Scripts for generating the graphs of the paper: Are the chemical families still there?
Table of Contents
1 Overview
This repository contains the scripts (source code) and raw data to reproduce the results of: Are the chemical families still there? exploration of similarity among elements Authors: Eugenio Llanos Ballestas, Wilmer Leal, Andrés Bernal, Jürgen Jost and Peter F. Stadler.
Contact: mailto:ellanos@sciocorp.org
2 Repository contents
3 Software requirements
To run the code of this repository bash (4.3.42(1)-release or greater), gawk (4.1.3 or greater) and sbcl (1.2.7 or greater) are required.
3.1 Installation gawk and bash
If your system don't have installed these software, this is the way to proceed to install them in a Debian based distro.
sudo apt install bash gawk
Installation takes around a minute.
3.2 Installation of SBCL
Steel Bank Common Lisp (SBCL) is a free distribution of Common Lisp. To install it, just type
sudo apt install sbcl
4 Figure 1
Diversity of combination motifs among elements. The code acts on a file containing formulae and molecules codes.
gawk -f expected.awk formulae-compounds-minyear.tsv
The output is a tab separated table of all elements with the following fields:
Z | element | expected rank | number of compounds | number of combinations |
---|---|---|---|---|
1 | H | 0.00452634 | 25,889,651 | 69,769 |
2 | C | 0.00432876 | 25,829,779 | 62,347 |
The code is stored in this file.
5 Figure 2
Growth of chemistry's bias toward favored combination motifs. The script runs as follows:
gawk -f code/expected-rank-element-year.awk > data/expected-rank-element-year.tsv
The output is a tab separated table whose first column corresponds to the year (descending order). The following columns correspond to the expected motif rank of each element.
The code is stored in this file.
6 Figure 4
Comparison of similarity measures between elements in both directions. The scripts runs as follows:
gawk -f code/sim-asym-dif.awk data/similarity-stoich.tsv
The output is a table of every pair of elements:
\(e_1\) | \(e_2\) | \(e_1 \to e_2\) | \(e_2 \to e_1\) |
---|---|---|---|
Lu | No | 0.00101334 | 0.461538 |
Lu | Ra | 0.00371559 | 0.431373 |
The code is stored in this file.
7 Figure 5
Singularity by year. Data is obtained as follows:
This code generates data for each year and stores them in separated files named singularity-YEAR.tsv, where YEAR corresponds to the year value.
function echo_year() { gawk -F "\t" -v year=$1 '$3 && $3<=year' data/formulae-compounds-minyear.tsv } count=0 for y in {1800..2015} do echo_year $y | gawk -f neigh.awk | gawk -f sim.awk | gawk -v year=$y -f sing-yearly.awk > yearly/singularty-$y.tsv& let count+=1 [[ $((count%50)) -eq 0 ]] && wait done
Following awk code calculates singularities of elements for all years. It is fed the above results (singularity-YEAR.tsv).
{ sing[$1][$2]=$3/$4 tot[$2]=$4 } END{ PROCINFO["sorted_in"]="@val_num_desc" printf "year" for (e in tot){ printf "\t%s",e } print "" for (y=1800;y<=2015;y++){ printf "%d",y for (e in tot){ printf "\t%s",sing[y][e] ? sing[y][e]*100 : "" } print "" } }
The result is a file whose first column is the year, and the following columns correspond to the percentage of singularity of each element in that year. The elements are organized in descending order by the total frequency (number of formulae).
8 Figure 6
Network of the most similar element. The data is obtained as follows:
gawk -f code/pt-most-similar.awk data/elem-z-group-period.csv data/sim-rank-stoich.sv
The code can be found here.
9 Figure 7
Distributions of coverage measures. Data is obtained using three different measures.
gawk -f code/rank-sim-comb.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-comb.csv gawk -f code/rank-sim-diverse.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-diverse.csv gawk -f code/rank-sim-mol-size.awk data/groups.tsv data/similarity-stoich.tsv > data/rank-sim-mol-size.csv
10 Figure 8
Rank asymmetry vs Rank value.
gawk -f code/asymmetry.awk data/groups.tsv data/rank-sim-comb.csv > data/asym-sym-comb.csv
The result is a table
11 Table 1
To calculate the family clusters, we developed a simple algorithm in Common Lisp to calculate an HCA, and then extracted the family clusters. The code is in this file.
sbcl --script code/rank-motif.lisp
The last code creates the file data/rank-motif.nwk, which contains the dendrogram in Newick format.
12 Figure 9
To obtain normalized family rankings, we run the following code:
gawk -f code/elem-group-rank.awk data/elements data/groups.tsv rank-mol-size.csv > data/elem-group-rank-molsize.csv
The result is a table as follows:
Z | Symbol | Norm. Rank | Symbol | Z | Family |
---|---|---|---|---|---|
1 | H | 0.2 | F | 9 | 16 |
1 | H | 0.4 | Cl | 17 | 16 |
1 | H | 0.8 | Br | 35 | 16 |
1 | H | 1.6 | I | 53 | 16 |
2 | He | 0.2 | Ar | 18 | 17 |
2 | He | 0.4 | Ne | 10 | 17 |
2 | He | 2.6 | Xe | 54 | 17 |
2 | He | 6 | Kr | 36 | 17 |
3 | Li | 0.25 | Na | 11 | 1 |
3 | Li | 0.5 | K | 19 | 1 |
3 | Li | 0.75 | Cs | 55 | 1 |
3 | Li | 1 | Rb | 37 | 1 |
Z is the nuclear charge of the first element, then the symbol of the first element, the normalized rank of the second element, symbol of second element, nuclear charge of second element and the family of both elements.
13 Figure 10a
To generate the Figure 10a a series of random experiments are performed. Run the following code in the shell.
select_random_mol (){ gawk -F "\t" '{for (i=1;i<=$2;i++) print $1}' data/formulae-compounds-minyear.tsv|shuf -n $1|gawk -f code/neigh|gawk -f code/sim.awk|gawk -f sim-rank-comb.awk } for p in {10..90..10} do lines=$(gawk -v l=$p '{count+=$2}END{print int(l*count/100)}' data/formulae-compounds-minyear.tsv) for replica in {1..100} do select_random_mol $lines > data/random-compounds/rank-$p-$replica.csv done done
After generating all the rankings for each random experiment, the percentage of elements in family clusters is calculated. The following code generates 100 dendrograms for each percentage of compounds.
sbcl --script code/random-compounds-trees.lisp
For each dendrogram from random experiment, the percentage of elements that fall in family clusters is caculated.
sbcl --script code/random-compounds-percentage.lisp > data/rand-comp-percentage.csv
The result is in this file.
14 Figure 10b
In order to obtain the data required for this figure, we deleted compounds from selected motifs in decreasing order of frequency. The first step is to obtain the motifs in decreasing order.
gawk -f code/motif-freq.awk data/similarity-stoich.tsv > data/motifs-freq.tsv
Then, the ranks are generated by the following code:
for i in 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 do gawk -v exclude=$i -f delete-combinations.awk similarity-stoich.csv |gawk -F "\t" -f sim-rank-comb.awk > data/motif-del/rank-comb-$i.csv& done
From the rankings for each element, dendrograms are calculated for each set of compounds.
sbcl --script code/motif-deletion-trees.lisp