Scripts for generating the graphs of the paper: Are the chemical families still there?

1. Overview
2. Repository contents
3. Software requirements
- 3.1. Installation gawk and bash
- 3.2. Installation of SBCL
4. Figure 1
5. Figure 2
6. Figure 4
7. Figure 5
8. Figure 6
9. Figure 7
10. Figure 8
11. Table 1
12. Figure 9
13. Figure 10a
14. Figure 10b

1 Overview

This repository contains the scripts (source code) and raw data to reproduce the results of: Are the chemical families still there? exploration of similarity among elements Authors: Eugenio Llanos Ballestas, Wilmer Leal, Andrés Bernal, Jürgen Jost and Peter F. Stadler.

Contact: mailto:ellanos@sciocorp.org

2 Repository contents

code/: directory containing the scripts.
data/: directory containing raw data to make the plots and tables of this document.

3 Software requirements

To run the code of this repository bash (4.3.42(1)-release or greater), gawk (4.1.3 or greater) and sbcl (1.2.7 or greater) are required.

3.1 Installation gawk and bash

If your system don't have installed these software, this is the way to proceed to install them in a Debian based distro.

sudo apt install bash gawk

Installation takes around a minute.

3.2 Installation of SBCL

Steel Bank Common Lisp (SBCL) is a free distribution of Common Lisp. To install it, just type

sudo apt install sbcl

4 Figure 1

Diversity of combination motifs among elements. The code acts on a file containing formulae and molecules codes.

gawk -f expected.awk formulae-compounds-minyear.tsv

The output is a tab separated table of all elements with the following fields:

Z	element	expected rank	number of compounds	number of combinations
1	H	0.00452634	25,889,651	69,769
2	C	0.00432876	25,829,779	62,347

The code is stored in this file.

5 Figure 2

Growth of chemistry's bias toward favored combination motifs. The script runs as follows:

gawk -f code/expected-rank-element-year.awk > data/expected-rank-element-year.tsv

The output is a tab separated table whose first column corresponds to the year (descending order). The following columns correspond to the expected motif rank of each element.

The code is stored in this file.

6 Figure 4

Comparison of similarity measures between elements in both directions. The scripts runs as follows:

gawk -f code/sim-asym-dif.awk data/similarity-stoich.tsv

The output is a table of every pair of elements:

\(e_1\)	\(e_2\)	\(e_1 \to e_2\)	\(e_2 \to e_1\)
Lu	No	0.00101334	0.461538
Lu	Ra	0.00371559	0.431373

The code is stored in this file.

7 Figure 5

Singularity by year. Data is obtained as follows:

This code generates data for each year and stores them in separated files named singularity-YEAR.tsv, where YEAR corresponds to the year value.

function echo_year() {
    gawk -F "\t" -v year=$1 '$3 && $3<=year' data/formulae-compounds-minyear.tsv
}
count=0
for y in {1800..2015}
do
    echo_year $y | gawk -f neigh.awk | gawk -f sim.awk | gawk -v year=$y -f sing-yearly.awk > yearly/singularty-$y.tsv&
    let count+=1
    [[ $((count%50)) -eq 0 ]] && wait
done

Following awk code calculates singularities of elements for all years. It is fed the above results (singularity-YEAR.tsv).

{
    sing[$1][$2]=$3/$4
    tot[$2]=$4
}
END{
    PROCINFO["sorted_in"]="@val_num_desc"
    printf "year"
    for (e in tot){
        printf "\t%s",e
    }
    print ""
    for (y=1800;y<=2015;y++){
        printf "%d",y
        for (e in tot){
            printf "\t%s",sing[y][e] ? sing[y][e]*100 : ""
        }
        print ""
    }
}

The result is a file whose first column is the year, and the following columns correspond to the percentage of singularity of each element in that year. The elements are organized in descending order by the total frequency (number of formulae).

8 Figure 6

Network of the most similar element. The data is obtained as follows:

gawk -f code/pt-most-similar.awk data/elem-z-group-period.csv data/sim-rank-stoich.sv

The code can be found here.

9 Figure 7

Distributions of coverage measures. Data is obtained using three different measures.

gawk -f code/rank-sim-comb.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-comb.csv
gawk -f code/rank-sim-diverse.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-diverse.csv
gawk -f code/rank-sim-mol-size.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-mol-size.csv

10 Figure 8

Rank asymmetry vs Rank value.

gawk -f code/asymmetry.awk data/groups.tsv data/rank-sim-comb.csv > data/asym-sym-comb.csv

The result is a table

11 Table 1

To calculate the family clusters, we developed a simple algorithm in Common Lisp to calculate an HCA, and then extracted the family clusters. The code is in this file.

sbcl --script code/rank-motif.lisp

The last code creates the file data/rank-motif.nwk, which contains the dendrogram in Newick format.

12 Figure 9

To obtain normalized family rankings, we run the following code:

gawk -f code/elem-group-rank.awk data/elements data/groups.tsv rank-mol-size.csv > data/elem-group-rank-molsize.csv

The result is a table as follows:

Z	Symbol	Norm. Rank	Symbol	Z	Family
1	H	0.2	F	9	16
1	H	0.4	Cl	17	16
1	H	0.8	Br	35	16
1	H	1.6	I	53	16
2	He	0.2	Ar	18	17
2	He	0.4	Ne	10	17
2	He	2.6	Xe	54	17
2	He	6	Kr	36	17
3	Li	0.25	Na	11	1
3	Li	0.5	K	19	1
3	Li	0.75	Cs	55	1
3	Li	1	Rb	37	1

Z is the nuclear charge of the first element, then the symbol of the first element, the normalized rank of the second element, symbol of second element, nuclear charge of second element and the family of both elements.

13 Figure 10a

To generate the Figure 10a a series of random experiments are performed. Run the following code in the shell.

select_random_mol (){
          gawk -F "\t" '{for (i=1;i<=$2;i++) print $1}' data/formulae-compounds-minyear.tsv|shuf -n $1|gawk -f code/neigh|gawk -f code/sim.awk|gawk -f sim-rank-comb.awk
          }
  for p in {10..90..10}
  do
      lines=$(gawk -v l=$p '{count+=$2}END{print int(l*count/100)}' data/formulae-compounds-minyear.tsv)
      for replica in {1..100}
      do
          select_random_mol $lines > data/random-compounds/rank-$p-$replica.csv
      done
  done

After generating all the rankings for each random experiment, the percentage of elements in family clusters is calculated. The following code generates 100 dendrograms for each percentage of compounds.

sbcl --script code/random-compounds-trees.lisp

For each dendrogram from random experiment, the percentage of elements that fall in family clusters is caculated.

sbcl --script code/random-compounds-percentage.lisp > data/rand-comp-percentage.csv

The result is in this file.

14 Figure 10b

In order to obtain the data required for this figure, we deleted compounds from selected motifs in decreasing order of frequency. The first step is to obtain the motifs in decreasing order.

gawk -f code/motif-freq.awk data/similarity-stoich.tsv > data/motifs-freq.tsv

Then, the ranks are generated by the following code:

for i in 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000                     
do                                                                                                                                     
    gawk -v exclude=$i -f delete-combinations.awk similarity-stoich.csv |gawk -F "\t" -f sim-rank-comb.awk > data/motif-del/rank-comb-$i.csv&                                                                                                                            
done

From the rankings for each element, dendrograms are calculated for each set of compounds.

sbcl --script code/motif-deletion-trees.lisp

Scripts for generating the graphs of the paper: Are the chemical families still there?

Table of Contents

1 Overview

2 Repository contents

3 Software requirements

3.1 Installation gawk and bash

3.2 Installation of SBCL

4 Figure 1

5 Figure 2

6 Figure 4

7 Figure 5

8 Figure 6

9 Figure 7

10 Figure 8

11 Table 1

12 Figure 9

13 Figure 10a

14 Figure 10b