Proteinortho Manual / PoFF Manual

This manual corresponds to version 6.0 beta

Introduction

Proteinortho is a tool to detect orthologous genes within different species. For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. Details can be found in Lechner et al., BMC Bioinformatics. 2011 Apr 28;12:124.
To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (manuscript in preparation), is already build in Proteinortho.

Installation

Prerequisites

Proteinortho uses standard software which is often installed already or is part of then package repositories and can thus easily be installed. To run Proteinortho, you need

NCBI BLAST+ or NCBI BLAST legacy
(to test this, type tblastn (BLAST+) or blastall (BLAST+) in the command line)
Perl v5.08 or higher
(to test this, type perl -v in the command line)
Python v2.6.0 or higher to include synteny analysis
(to test this, type python -V in the command line)

Proteinortho optionally supports similarity search using

Diamond (to test this, type 'diamond' in the command line)
Last (to test this, type 'lastal' in the command line)
Rapsearch (to test this, type 'rapsearch' in the command line)
Topaz (to test this, type 'topaz' in the command line)
usearch (to test this, type 'usearch' in the command line)
ublast (is part of usearch)
blat (to test this, type 'blat' in the command line)

The sources come with a precompiled version of Proteinortho for 64bit Linux. If you want/need to recompile and install it, you will also need

GNU make (to test this, type make in the command line)
GNU g++ (to test this, type g++ in the command line)
Cmake (to test this, type cmake in the command line)

Building and installing from source

Fetch the latest source code archive from www.bioinf.uni-leipzig.de/Software/proteinortho/.
Extract the files e.g. via tar -xzvf proteinortho_v6.0b.tar.gz
Change directory into the extracted folder e.g. via cd proteinortho_v6.0b
You can now run ./proteinortho6.pl directly
If you want to recompile and install Proteinortho, type make followed sudo make install (requires root privileges).
In any case, run make test to make sure Proteinortho works as expected

Quick Start

proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTAn...] performs an orthology analysis for the given sets of proteins. Add -p=blastn in case your sequences are represented as nucleotides (ACTG) rather than as amino acids.

Quick Tutorial

This manual assumes that you have Proteinortho installed on your system and can thus directly run it via proteinortho5.pl. If you have it not installed but only downloaded and extracted to a folder, please use /FULL/PATH/TO/proteinortho5.pl instead.

Proteinortho assumes, that you have all your gene sequences in FASTA format either represented as amino acids or as nucleotides. The source code archive contains some examples, namely C.faa, E.faa, L.faa, M.faa located in the test/ directory. By default Proteinortho assumes amino acids and thus uses blastp+ to compare sequences. If you have nucleotide sequences, you need to change this by adding the parameter -p=blastn+. (In case you have only have NCBI BLAST legacy installed, you need to tell this too - either by adding -p=blastp or -p=blastn respectively.) The full command for the example files would thus be proteinortho5.pl -project=test test/C.faa test/E.faa test/L.faa test/M.faa. Instead of naming the FASTA files one by one, you could also supply test/*.faa as argument. Please note that the parameter -project=test is optional. With this, you can set the prefix of the output files generated by Proteinortho. If you skip the project parameter, the default project name will be myproject.

Proteinortho will automatically determine the number of available CPU threads and use them accordingly to speed up the calculations. You can use the parameter -cpus= to manually set the number of threads. When the analysis is done you will find a new file in your current working directory, namely test.proteinortho. To have a quick look, you can i.e. use less -S test.proteinortho. The tab-separated output generated looks like this:

# Species	Genes	Alg.-Conn.	M.faa	L.faa	C.faa		E.faa
4		4	1		M_10	L_10	C_10		E_10
4		4	1		M_11	L_11	C_11		E_11
4		4	1		M_14	L_14	C_14		E_14
...
4		5	0.2		M_19	L_19	C_22,C_63	E_19
...

The first line starting with #is a comment line indicating the meaning of each column for each of the following lines which represent an orthologous group each. The very first column indicates the number of species covered by this group. The second column indicates the number of genes included in the group. Often, this number will equal the number of species, meaning that there is a single ortholog in each species. If the number of genes is bigger than the number of species, there are co-orthologs present. The third column gives rise to the algebraic connectivity of the respective group. Basically, this indicates how densely the genes are connected in the orthology graph that was used for clustering. A connectivity of 1 indicates a perfect dense cluster with each gene similar to each other gene. By default, Proteinortho splits each group into two more dense subgroups when the connectivity is below 0.1. In the second last line of the example above, there is a group with three paralogs in species C (C.faa). They are separated by a comma (,) indicating that they are co-orthologous the genes in the other species.

The PoFF extension allows you to use the relative order of genes (synteny) as an additional criterion to disentangle complex co-orthology relations. To do so, add the parameter -synteny. You can use it to either come closer to one-to-one orthology relations by preferring synthetically conserved copies in the presence of two very similar paralogs (default), or just to reduce noise in the predictions by detecting multiple copies of genomic areas (add the parameter -dups=3). Please note that you need additional data to include synteny, namely the gene positions in GFF3 format. As Proteinortho is primarily made for proteins, it will only accept GFF entries of type CDS (column #3 in the GFF-file). The attributes column (#9) must contain Name=GENE IDENTIFIER where GENE IDENTIFIER corresponds to the respective identifier in the FASTA format. It may not contain a semicolon (;)! Alternatively, you can also set ID=GENE IDENTIFIER. Example files are provided in the source code archive. Hence, we can run proteinortho5.pl -project=test -synteny test/A1.faa test/B1.faa test/E1.faa test/F1.faa to add synteny information to the calculations. Of course, this only makes sense if species are sufficiently similar. You won't gain much when comparing e.g. bacteria with fungi. When the analysis is done you will find an additional file in your current working directory, namely test.poff. This file is equivalent to the .proteinortho file (above) but can be considered more accurate as synteny was involved for its construction.

In addition Proteinortho will generate graph files containing all pairwise orthology relationships including similarity scores. If they are not generated, rerun Proteinortho with the -graph parameter.

myproject.blast-graph: filtered raw blast data based on adaptive reciprocal best blast matches (= reciprocal best match plus all reciprocal matches within a range of 95% by default)
myproject.proteinortho-graph: clustered blast graph. Its connected components are represented in myproject.proteinortho.
myproject.ffadj-graph: filtered blast data based on adaptive reciprocal best blast matches and synteny (only if -synteny is set)
myproject.poff-graph clustered ffadj graph. Its connected components are represented in myproject.poff (only if -synteny is set)

The format of all graph files looks about similar:

# file_a        file_b
# a     b       evalue_ab       bitscore_ab     evalue_ba       bitscore_ba
# M.faa 	L.faa
M_15    L_15    0.0     	893     	0.0     	893
M_16    L_16    3e-175  	481     	3e-175  	481
M_19    L_19    8e-93   	262     	8e-93   	262
...
# M.faa E.faa
M_10    E_10    3e-137  	415     	2e-148  	441
M_11    E_11    2e-71   	221     	9e-68   	209
...

The first two rows are just comments explaining the meaning of each row. Whenever a comment line (starting with #) follows, it indicates results comparing the two species is about to follow. E.g. #M.faa L.faa tells that the next lines represent results for species M and L. All matches are reciprocal matches. If e.g. a match for M_15 L_15 is shown, L_15 M_15 exists implicitly. E-Values and bit scores for both directions are given behind each match.

The synteny based graph files (myproject.ffadj-graph and myproject.poff-graph) have two additional columns: same_strand and simscore. The first one indicates if two genes from a match are located at the same strands (1) or not (-1). The second one is an internal score which can be interpreted as a normalized weight ranging from 0 to 1 based on the respective e-values. Moreover, a second comment line is followed after the species lines, e.g.

...
# M.faa L.faa
# Scores: 4     39      34.000000       39.000000
...

These scores are derived from the ffadj algorithm comparing the gene similarities and gene orders in the respective species. They are:

the number of breakpoints to match both gene orders
the number of edges required to match both gene orders
calculated weight of adjacencies
calculated weight of gene similarities

Hints

Using .faa to indicate that your file contains amino acids and .fna to show it contains nucleotides makes life much easier.
Sequence IDs must be unique within a single FASTA file. Consider renaming otherwise.
Note: Till version 5.15 sequences IDs had to be unique among the whole dataset. Proteinortho now keeps track of name and species to avoid the necessissity of renaming.
You need write permissions in the directory of your FASTA files as Proteinortho will create blast databases. If this is not the case, consider using symbolic links to the FASTA files.
The directory tools contains useful tools, e.g. grab_proteins.pl which fetches protein sequences of orthologous groups from Proteinortho output table

Usage

proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTAn...]

Option Default value Description

[General options]

-project= myproject prefix for all result file names

-cpus= auto use the given number of threads

-verbose give a lot of information about the current progress

-keep - store temporary blast results for reuse (advisable for larger jobs)

-temp= working directory path for temporary files

-force force recalculation of blast results in any case

-clean remove all unnecessary files after processing

[Search options]

-e= 1e-05 E-value for blast

-p= diamond simiarity tool to use:

diamond → sequences are given as amino acids (fastest)
blastn+ → sequences are given as nucleotides
blastp+ → sequences are given as amino acids
tblastx+ → sequences are given as nucleotides and will be interpreted as (translated) amino acids in all three reading frames
in case you only have access to blast legacy:

blastn → sequences are given as nucleotides
blastp → sequences are given as amino acids
tblastx → sequences are given as nucleotides and will be interpreted as (translated) amino acids in all three reading frames

-selfblast apply selfblast to directly paralogs; normally these are inferred indirectly from orthology data to other species (experimental!)

-sim= 0.95 min. similarity for additional hits

-identity= 25 min. percent identity of best blast alignments

-cov= 50 min. coverage of best blast alignments in percent

-subpara= additional parameters for blast; set these in quotes (e.g. -subpara='-seg no')
This parameter was named -blastParameters in earlier versions

[Synteny options]

-synteny - activate PoFF extension to separate similar sequences using synteny (requires a GFF file for each FASTA file)

-dups= 0 applied in combination with -synteny; number of reiterations for adjacencies heuristic to determine duplicated regions;
if set to a higher number, co-orthologs will tend to get clustered together rather than getting separated

-cs= 3 applied in combination with -synteny; size of a maximum common substring (MCS) for adjacency matches;
the longer this value becomes the longer syntenic regions need to be in order to be detected

alpha= 0.5 weight of adjacencies vs. sequence similarity

[Clustering options]

-singles also report genes without orthologs in table output

-purity= 1 avoid spurious graph assignments [range: 0.01-1, default 0.75]

-conn= 0.1 min. algebraic connectivity of orthologous groups during clustering

-nograph - do not generate .graph files with pairwise orthology data
saves some time

[Misc options]

-desc write gene description file (XXX.descriptions); works only with NCBI-formated FASTA entries currently

-blastpath= path to your local blast installation (if not in present in default paths)

-debug gives detailed information for bug tracking

[Large compute jobs]

Parameters needed to distribute the runs over several machines

-step= 0 perform only specific steps of the analysis

1 → generate indices
2 → perform pairwise analyses
3 → perform clustering
0 → perform all steps

jobs=N/M distribute blast step into M subsets and run job number N out of M in this very process, only works in combination with -step=2

Using several machines

If you want to involve multiple machines or separate a Proteinortho run into smaller chunks, use the -jobs=M/N option. First, run proteinortho5.pl -steps=1 ... to generate the indices. Then you can run proteinortho5.pl -steps=2 -jobs=M/N ... to run small chunks separately. Instead of M and N numbers must be set representing the number of jobs you want to divide the run into (M) and the job division to be performed by the process. E.g. to divide a Proteinortho run into 4 jobs to run on several machines, use


proteinortho5.pl -steps=1 ...

on a single PC, then


proteinortho5.pl -steps=2 -jobs=1/4 ... 

proteinortho5.pl -steps=2 -jobs=2/4 ... 

proteinortho5.pl -steps=2 -jobs=3/4 ... 

proteinortho5.pl -steps=2 -jobs=4/4 ...

separately on different machines (can be run in parallel or iteratively within the same shared working directory). After all step 2 runs are done, run
proteinortho5.pl -steps=3 ...
to perform the clustering and merge all calculations on a single PC.

Option	Default value	Description
[General options]
-project=	myproject	prefix for all result file names
-cpus=	auto	use the given number of threads
-verbose		give a lot of information about the current progress
-keep	-	store temporary blast results for reuse (advisable for larger jobs)
-temp=	working directory	path for temporary files
-force		force recalculation of blast results in any case
-clean		remove all unnecessary files after processing
[Search options]
-e=	1e-05	E-value for blast
-p=	diamond	simiarity tool to use: diamond → sequences are given as amino acids (fastest) blastn+ → sequences are given as nucleotides blastp+ → sequences are given as amino acids tblastx+ → sequences are given as nucleotides and will be interpreted as (translated) amino acids in all three reading frames in case you only have access to blast legacy: blastn → sequences are given as nucleotides blastp → sequences are given as amino acids tblastx → sequences are given as nucleotides and will be interpreted as (translated) amino acids in all three reading frames
-selfblast		apply selfblast to directly paralogs; normally these are inferred indirectly from orthology data to other species (experimental!)
-sim=	0.95	min. similarity for additional hits
-identity=	25	min. percent identity of best blast alignments
-cov=	50	min. coverage of best blast alignments in percent
-subpara=		additional parameters for blast; set these in quotes (e.g. -subpara='-seg no') This parameter was named -blastParameters in earlier versions
[Synteny options]
-synteny	-	activate PoFF extension to separate similar sequences using synteny (requires a GFF file for each FASTA file)
-dups=	0	applied in combination with -synteny; number of reiterations for adjacencies heuristic to determine duplicated regions; if set to a higher number, co-orthologs will tend to get clustered together rather than getting separated
-cs=	3	applied in combination with -synteny; size of a maximum common substring (MCS) for adjacency matches; the longer this value becomes the longer syntenic regions need to be in order to be detected
alpha=	0.5	weight of adjacencies vs. sequence similarity
[Clustering options]
-singles		also report genes without orthologs in table output
-purity=	1	avoid spurious graph assignments [range: 0.01-1, default 0.75]
-conn=	0.1	min. algebraic connectivity of orthologous groups during clustering
-nograph	-	do not generate .graph files with pairwise orthology data saves some time
[Misc options]
-desc		write gene description file (XXX.descriptions); works only with NCBI-formated FASTA entries currently
-blastpath=		path to your local blast installation (if not in present in default paths)
-debug		gives detailed information for bug tracking
[Large compute jobs]
Parameters needed to distribute the runs over several machines
-step=	0	perform only specific steps of the analysis 1 → generate indices 2 → perform pairwise analyses 3 → perform clustering 0 → perform all steps
jobs=N/M		distribute blast step into M subsets and run job number N out of M in this very process, only works in combination with -step=2