Proteinortho4

NAME

Proteinortho4 - Orthology detection tool

SYNTAX

proteinortho4.pl [OPTIONS]... <FILE1> <FILE2>... >OUTPUT
proteinortho4.pl [OPTIONS]... <FILELIST> >OUTPUT

DESCRIPTION

Predicts orthologous and co-orthologous proteins within different species.
Given a list of proteome files in fasta-format, Proteinortho runs an all against all blast search and applies a partitioning algorithm on the adaptive best blast hits to conclude (co-)orthologous groups.
This tool is designed to deal with large data sets and behaves nicely regarding the memory consumption even if applied to millions of proteins.

Important:: Protein ids must be globally different! You should also consider that blast may cut the ids on a whitespace using the first part only.
Proteome files can be given directly by command line (<FILE1> <FILE2>...) or in a <FILELIST> containing one filename each line.

OUTPUT FORMAT

The OUTPUT is a tab separated matrix.
First line starts with # followed by the column and file names, respectively.
Second line starts with # followed by the corresponding values and numbers of proteins for each file.
Each following line, represents a (co-)orthologous group. Besides the number of species and proteins, the (approximated) algebraic connectivity of each group is given. This value allows to conclude the degree of conservation among each group. Values close to 1 are highly conserved whereas values close to 0 are poorly conserved.
The last line starts with # and tells about the version used as well as the applied parameters.

OPTIONS

-e=<E-VALUE>: <E-VALUE> threshold for blasts [default: 1e-10]
-p=blastp|blastn: Defines the blast program [default: blastp]
Use blastp for amino acid sequences (.faa)
Use blastn for nucleotide sequences (.fna)
-id=(0..100): Min. percent identity of best blast hits [default: 25] Hits below this level will be ignored.
-cov=(0..1): Min. coverage of best blast hits [default: 0.5] Hits below this level will be ignored.
-conn=(0..1): Min. algebraic connectivity for each (co-)orthologous group [default: 0.1] Proteinortho will split groups until the given level of connectivity is reached. Raising this level can be useful to remove less conserved paralogous from the output. The average group-size will decrease. Thereby, 0.5 is very strict, already. Raising the value even higher is not recommended, except you want to focus on strongly conserved sets only.
-m=(0..1): Min. similarity for additional hits [default: 0.95 (nearly equal]
All blast hits with new/bestscore < m are included, even if they are not the best hit. This options allows to recover (co-)orthologous groups even if ambiguous paralogs exist. Setting this value to 1 complies with running a regular (non-adaptive) reciprocal best blast hit approach. Lowering this value will potentially include more paralogous proteins to the groups. Values lower than 0.75 are not recommended unless you know what you do.
-pairs: Do not remove simple pairs from output
(Co-)orthologous groups of size two are very likely to occur by chance, thus they are removed normally. However, these groups might of interest for some users as well. Especially if the number of species is small.
-selfblast: Apply blast for each species against itself.
Proteinortho concludes paralogous genes indirectly from comparisons to other species. In turn, paralogs will not be detected if there is no co-orthologous gene in any other species. Use this option to recover them as well. Will significantly increase runtime. Using -pairs in addition is recommended.
-unambiguous: Exclude connected components with paralogs from output
Beware: This option might exclude a reasonable amount of groups.
-a=<THREADS>: The number of processors to use [default: auto]
-noiolimit: Proteinortho automatically limits the amount of competitive I/O-threads to spare the hard disk. In case you use a SSD or RAM-Disk, you can disable this behavior to speed up the analysis.
-f: Force blastall (even if blast output is found)
-ff: Force formatdb (even if databases are found), implies -f

-dir=<DIRECTORY>

Defines the <DIRECTORY> for the blast outputs [default: working directory]

-remove

Proteinortho allows to reuse the blast output for additional analysis. This significantly saves time. However, if you do not intend to run an additional analysis with at least some of these species you can tell Proteinortho remove unnecessary files.

-log=<FILE>

Writes a detailed log of reciprocal best hits to <FILE>.

-o=<FILE>

Prints the output to the given <FILE> rather than STDOUT.

-verbose

Gives information about what happens, including a progress report

-debug

Gives detailed information for bug tracking
Does not work in combination with -verbose

MULTIPLE MACHINE OPTIONS

The main part of Proteinortho consists of blasting each species against each other. This can take several hours up to days if hundreds of species are involved - even on multi-core machines. For this purpose a mechanism has been implemented which allows to distribute that workload over multiple machines.
Every option aside from -a=<THREADS> needs to be the same. This is especially important for the directory in which the blasts are stored. A file named sync will be created their and used to synchronize the processes. As flock is not capable for network file systems a temporary directory named lock is used for locking. Both may need to be removed if Proteinortho was interrupted or crashed and a restart is intended.

Run all scripts using the option

-blastonly

As the scripts synchronize themselves the order or time you start it on different machines does not matter. You can even stop certain processes if needed. See SIGNALS for more details to that topic. After the blasts are done, all started scripts will be terminated.

If that happened, you can grab the results and finish the calculations. Start the script again on one machine using the same options as before. Instead of -blastonly use the option

-blastdone

This will lead to skip database creation and blasts and thus speed up the beginning of the connected component calculation.

-batch

Returns a batch-list of jobs on STDOUT
This is preferable to -blastonly if you use a cluster-management system.
"wait;" will indicate that all jobs above have to be finished before proceeding further
Requires -o=<ILE>

SIGNALS

Sending signal INT or TERM to a Proteinortho process will lead to a clean stop which allows a later continuation at this point. If used on MULTIPLE MACHINES this allows to stop certain processes without interference with the on going calculation. As going blast jobs need to be finished first, the termination may take a while.

However, sending the signal twice (or using KILL) will lead to an immediate stop and may result in corrupted data. It is advisable to remove all files from the blast out directory and not use the data any further. This is also the case if the the blasts where distributed over multiple machines.

Furthermore, if a full stop of all processes on MULTIPLE MACHINES is intended, a file named stop can be placed in the blast out directory. This will lead to clean stop as described above for all running scripts.

EXAMPLES

To run this program the standard way comparing two or more species type:: proteinortho4.pl speciesA.faa speciesB.faa >orthologs.out

If you want to have live progress report and store blast files in a separate folder, type:: mkdir blastout/
proteinortho4.pl -verbose -dir=blastout/ files.list >orthologs.out

If you use a cluster-management system and want to handle threads yours elf:: mkdir blastout/
proteinortho4.pl -batch -dir=blastout/ -o=orthologs.out files.list > batch.sh
Now batch.sh can be executed via the management software

COPYRIGHT

Copyright 2008 Free Software Foundation, Inc. License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

REPORTING BUGS

Marcus Lechner <marcus[at]bioinf.uni-leipzig.de>

AUTHORS

Written by Marcus Lechner and Lydia Steiner
Interdisciplinary Center for Bioinformatics, University of Leipzig