It can either be started giving the intended files in fasta-format (at least two) after the OPTIONS or just one file containing the paths to these. This is especially useful if their number grows. Each file should represent all proteins (or the part of it that should be investigated) of one species.
In a first step all files are blasted against each other. The hits will be evaluated according to the given OPTIONS and transformed into a graph, where each protein is represented by a node. This graph will be fragmented into its connected components, thus proteins which are connected to each other.
Proteinortho was designed to deal with large data sets and also behave
nicely regarding the memory consumption. However, it works for small sets as well.
Details about the algorithm and benchmarks can be found in this thesis.
First line starts with # followed by the file names. Second line starts with # followed by the corresponding number of proteins in the files.
From here each line represents a connected component and therefore the ids of determined orthologous proteins.
proteinortho.pl [OPTIONS] <FILES> >OUTPUT
proteinortho.pl [OPTIONS] <FILE> >OUTPUT
[default: 1e-10]
[default: 1]
[default: blastp]
[default: 1 (enabled)]
[default: 0.95 (nearly equal)]
[default directory: working directory]
paralogous protein ids are separated by ","
[default file: cc.matrix]
[default file: pb.log]
[default file: cc.log]
the creation is very time intensive and not recommended if more than 10,000 proteins are involved
[default file: ultimate.log]
The main part of Proteinortho consists of blasting each species against each other. This can take several hours up to days if hundreds of species are involved - even on multi-core machines. For this purpose a mechanism has been implemented which allows to distribute that workload over multiple machines.
Every option aside from -a=#THREADS needs to be the same. This is especially important for the directory in which the blasts are stored. A file named sync will be created their and used to synchronize the processes. As flock is not capable for network file systems a temporary directory named lock/ is used for locking. Both may need to be removed if Proteinortho was interrupted or crashed and a restart is intended.
Run all scripts using the option
As the scripts synchronize themselves the order or time you start it on different machines does not matter. You can even stop certain processes if needed. See SIGNALS for more details to that topic. After the blasts are done, all started scripts will be terminated.
If that happened, you can grab the results and finish the calculations. Start the script again on one machine using the same options as before. Instead of -blastonly use the option
To run this program the standard way comparing two or more species type:
proteinortho.pl speciesA.faa speciesB.faa >orthologs.out
If you want to define the number of threads, have live progress report and store blast files in a separate folder, type:
mkdir blastout
proteinortho.pl
-a=4 -verbose -dir=blastout/ files.list >orthologs.out
Marcus Lechner marcus[at]bioinf[dot]uni-leipzig[dot]de
Written by Marcus Lechner, Lydia Steiner and Sonja J. Prohaska
Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig
orthomatrix2tree.pl a tool which allows to generate trees based on the shared proteins