ExonMatchSolver

A pipeline assisting curation of gene annotations on fragmented genomes

System requirements/Installation
Input files
Program options
Output files
Trouble shooting

1. System requirements/Installation

The ExonMatchSolver-pipeline consists of two components, the ExonMatchSolver-Integer Linear Programming (ILP) and an accompanying perl-script for the pipeline-setup itself. The ExonMathSolver-ILP is embedded within this script, but can also be run seperately. Note that IBM ILOG CPLEX Optimizer 12.6 is required for running the ExonMatchSolver-ILP.

Download:

Unzip:

tar -zxvf ExonMatchSolver.tar.gz

Copy the library file obtained from the CPLEX Optimizer in this file:

cp [lib_file] ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/.
rm  ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/dummy

For running the whole pipeline, the following programs are required (linux only):

perl including the following packages: Bio::Perl; Bio::SeqIO; Bio::Seq; Data::Dumper; Bio::DB::Fasta; Statistics::Standard_Normal; Getopt::Long; List::Util; FindBin, Cwd; DateTime
Scipio (v 1.4.1)
BLAST+ tool suite (v 2.2.26)
exonerate (v 2.0.0) or ProSplign (v 0.0.0)

Install the ExonMatchSover-ILP in a directory of your choice. The path to this directory may be specified by changing $Bin in ExonMatchSolver_pipline.pl. Additional programs will be required for the following modes:

Fasta-mode:

clustalw (v.2.0.12)

Alignment-mode:

getORF (v EMBOSS:6.6.0.0)
HMMER (v 3.1b1)

To get a more sensitive HMMER, it is necessary to copy its source code in a different directory and change the source code in /src/p7_pipeline.c Outcommand the following two lines of code with "/*" (see also Cryptogenomicon). The path to the more sensitive hmmer may be specified by changing $hmmer_path in the perl-script.

    if (pli->ddef->nregions   == 0) return eslOK; /* score passed threshold but there's no discrete domains here       */
    if (pli->ddef->nenvelopes == 0) return eslOK; /* rarer: region was found, stochastic clustered, no envelopes found */

2. Input files

As described in the paper, the ExonMatchSolver-pipeline can be run in three different modes: alignment-mode (green dot), fasta-mode (red dot) and user-mode (yellow dot).

Setup of the ExonMatchSolver pipeline. hMM - hidden Markov Model, TCE - translated coding exon.

The preparation of the input-files is critical for the output of the pipeline. We recommand the use of the alignment-mode, followed by user-mode and as last option fasta-mode for maximization of sensitivity. If intron losses or intron gains occured among the paralogous group of interest (of the query sequences), Translated Coding exons (TCEs) have to be split so that they can be derived by TCE loss from a hypothetical 'ancestor' (example). In this example, TCE 5 of ADGRL2a, ADGRL2b, ADGRL3a and ADGRL3b is encoded by one exon on genome level, while three exons code for one TCE each in ADGRL1a and ADGRL1b. The corresponding TCE (TCE 5) in ADGRL2a, ADGRL2b, ADGRL3a and ASGRL3b has to be split so that the first and last position of the new TCE 6 are homologous to the girst and last position of TCE 6 in ADGRL1a and ADGRL1b etc. . Amino acdids encoded by split codons are deleted. In these complicated cases, alignment- or user-mode should be used. The following files are required as input for the respective modes.

For all modes:

target genome
Path to the genome of the species you wish to annotate your paralogous protein family in. The genome should meet the requirements to be parsed with fastacmd. Option: -target
paralog sequences (.fa)

Protein sequences of all paralogs to be included in the analysis in fasta-format (example). For user- and alignment-mode, make sure that the numbering of paralogs is consistent with the "homologous TCEs"/alignment-file(s) (header should be: >Paralog${number}). For fasta-mode, the first entry in this file has to be the longest paralog/ the one with most TCEs. The paralogs can have either name in fasta-mode. The first entry will correspond to Paralog1 etc. in the output. Option: -i

Alignment-mode:

path to folder with alignment-files/ paralog-, TCE-specific alignments (.stk)
These must be TCE-individual and paralog-specific. Considering n TCEs of m paralogs, there should be n x m stockholm-files in this folder. Amino acids (split codons) that are split among two TCEs are not included in either of the TCE-sequences. They should be named as following: Paralog1_exon1.stk - Paralog{m}_exon{n}.stk. Option: -m
homologous TCEs (.fa)

TCE-individual and paralog-specific protein sequences of the paralogs in the query species in fasta-format. This will be used as input for the spliced alignment tools. Make sure, that the paralog- and TCE-numbering is consistent with the one in the alignment-files (example). Option: -user

User-mode:

homologous TCEs (.fa)

TCE-individual and paralog-specific protein sequences of the paralogs in the query species. This will be used as input for the spliced alignment tools, (example). Option: -user

Fasta-mode:

query genome

Path to the genome of the species, paralog/query sequences are provided for. The genome should meet the requirements to be parsed with fastacmd. Option: -query

3. Program options

Seperate call of the ILP:

java -jar ExonMatchSolver.jar [input] <int> <max> > [output]

Example input and example output files for the ExonMatchSolver-ILP. For int, max see Options.

Pipeline usage:

perl ExonMatchSolver_pipeline.pl -i [fasta] -o [dir] -mode <fasta/alignment/user> -target [file] -[OPTIONS]

Options:

[GENERAL OPTIONS]
-o <dir>	output directory
-mode <alignment/fasta/user>	mode to be run: input can either be an alignment (hMM are built), an user-prepared fasta-file with paralog-specific and TCE-individual protein-sequences or a fasta-file with protein-sequences of the paralogs. Default: fasta.
-WGD <yes/no>	WGD expected? e.g. starting from a tetrapod query annotating a teleost fish. Default: no.
-noWGD <integer,integer,...>	specify paralogs for which no WGD is expected separated by commas only (no space), if WGD is set to yes i.e. the specified paralogs are expected to be unduplicated.
-s <yes/no>	spliced alignment programm used: exonerate or ProSplign. Default: no (ProSplign).
-l <integer>	length cutoff for the preparational step in fasta-mode. Default: Length of longest TCE (AA)/15.
-z <integer>	Z-score cutoff applied during preparational step in fasta-mode. Default: 3 (3 standard deviations).
[OPTIONS for TBLASTN]
-dFirst <integer>	nucleotide distance considered upstream of the first blast hit found. Default: 10000.
-dLast <integer>	nucleotide distance considered downstream of the last blast hit found. Default: 10000.
[OPTIONS for the ILP]
-int <integer/fraction>	integer: round bitscore to a number dividable by this number; fraction: return all optimal solutions within this fraction. Default: "".
-max <integer>	maximal number of solution to be returned. Default: "".
[OPTIONS for SCIPIO]
-scipio_opt <string>	see Scipio documentation for options. Default: --min_identity=60--max_move_exon=6 --blat_score=15 --blat_identity=54
-c <integer>	number of cores available to use. Default: 2.
-h	help

4. Output files

In the specified folder, -o, output-files are generated. Inspect the STDOUT on the commandline.

The most important ones are:

ExonMatchSolver${number}.out/ExonMatchSolver${number}.input
$number can be 1, 2 or 3. This is the output after the first (initial), second (after refinement by sensitive tblastn/hmmsearch) and third ExonMatchSolver-run (after spliced alignment of nearly complete hits). These steps are again explained in detail below. The ouput each includes the contig-name, the paralog that was assigned to it (first number) and all TCEs found on that contig. This output also includes all TCEs that were found alone on one contig.
Single_exons, Single_reliable
Single TCEs on contigs postulated in the optimal solution of the ExonMatchSolver-assignment are filtered after normalization of the bitscore by TCE length. Only TCEs that are considered reliable are processed in further steps of the pipeline (Single_reliable). Those hits that were slightly below the applied bitscore cutoff, which is 1.5, might require manual inspection as they could represent additional TCEs (Single_exons).
Summary_scipio.gff, Summary_scipio.yaml
Scipio output files. The input, Summary.fa, includes the original query sequences and the refined models obtained from the spliced alignment on the merged fragments.
folder SearchTarget

ProSplign (*.out/*.inf) or exonerate output files (*complete.aln).

More details on the refined search steps:

Refinement step 1: In addition to the optimal paralog-contig assignment, the ExonMatchSolver returns a list of unscored TCEs per contig, for which at least one paralog obtained a non-zero score while the score is missing for the listed paralogs on the respective contigs (example). To obtain the missing scores, the TCEs of the query sequence are scored against the respective contigs with less restrictive settings (bl2seq/tblastn, E-value cutoff 100, hmmsearch without filters -E 1 --domE 1).
Refinement step 2: The first and last TCEs of protein coding genes are often more diverse than the rest of the sequence and can be especially short making their annotation more difficult. To this end, we assume that all contigs to which one of the five first TCEs and one of the last three TCEs were assigned to, possibly encode one complete paralog. Short or diverse TCEs that were not yet identified by tblastn are possibly found by ProSplign/exonerate on respective contigs. This is assumed to be true if the single TCEs of the query sequence are recovered in the protein annotation originating from the spliced alignment tool with an E-value cutoff lower 0.01 using bl2seq -p blastp. As these hits could theoretically still belong to either of the paralogs, they are added to the hitlist with a score of 0 for all paralogs. In this case, the ExonMatchSolver is run for a third time delivering the final contig-paralog assignment.

And the other output files?

folder SearchTarget: There are two stages, when the spliced alignment tools are applied to the target genome (during refinement step 2 and during the final stage on the merged fragments). You might wish to look at these input/output-files. "*_final*" indicates the spliced alignment on the merged fragments.
- *complete.aln: complete exonerate output (alignment)
- *.out: ProSplign output (alignment)
- *.inf: ProSplign output as list
- *.fa: full length protein sequence of the paralog used as input for ProSplign/exonerate (original query sequence or refined model from ProSplign/exonerate step 2)
folder TargetSequences: single fasta-files of all contigs, on which hits were found including the final assembled/merged contigs
folder SearchQuery (only in fasta-mode):
- *.blastout: blast output of tblastn search of query sequences against the query genome
- *.out: ProSplign output of query sequences aligned to the query genome
- *.fa: full protein query sequences as inferred on the query genome provided
folder HomologousExons (only in fasta-mode):
- paralogs.fa: assignment of paralog numbers to the query paralog sequences
- paralog_exon_alignment.fa: segmentation of the query paralog sequences into its TCEs
- clustalw ouput files originating from preprocessing step 0b
- paralog_exon_alignment_final.fa: final assignment of TCEs into homologous groups with consistent, homology-based numbering
folder hmmModels (only in alignment-mode): hMM of the individual, paralog-specific TCEs created during preparational step 0a

5. Trouble shooting

error: Entry not found [fastacmd].

Is the target/query genome in the right format? Ensembl-genomes often need to be reformatted so that the single fasta entries do not start with numbers. This script will reformat the genome resulting in a file ${genome}.tmp saved to the folder where the query/target genome was located. Entries like ">1" will be reformatted to ">adopted_1".

    perl reformat_genome.pl [input_genome]

Error: File existence/permissions problem in trying to open HMM file hmmModels/... or The amino acid sequence of paralog x, exon y does not exist.

The TCE model or TCE does not exist. The TCE was not included for that paralog. Is that what you expect? In fasta-mode, these messages point to homologous groups that were not included for some paralogs i.e. they did no pass the length cutoff.
Something is wrong with the preparational step 0b in fasta-mode (assignment of TCEs into homologous groups)./ WARNING: Number of paralogs assigned to one homologous TCE group is greater than number of paralogs. Check assignment of homologous TCEs.
Congratulation, you found out! This can have many reasons. If one TCE is represented by two separate TCEs in another paralog (e.g. caused by an intron insertion), the homology assignment might not work. You should have a look at HomologousExons/paralog_exon_alignment_final.fa and HomologousExons/paralog_exon_alignment.fa. If this is the problem, you can use these files as starting file to prepare a user-file for the user-mode. If this is not the problem, but you loose many homologous TCEs during this preparational step, play with -z and -l. Make sure, the first entry in your query file is the longest paralog/ the one with the highest number of TCEs.
WARNING: There are hits on the positive and negative strand of the same contig!
This could be a scrambled locus. Please check in the respective ExonMatchSolver${number}.input file whether this is a false positive hit or whether your gene of interest lies on a fragment that is scrambled i.e. single contigs are assembled to build a bigger scaffold in the wrong direction. In this case, hits might lie on both, the swapped contig and the bigger scaffold, on opposite strands. You will have to do the annotation of this paralog manually and decide yourself whether you want to trust this scrambled locus.
WARNING: Check for two paralogs on one unit: positive. Check this before continuing!
Inspect the ExonMatchSolver3.input. At least two hits with an E-value < 1*10e-1 were found on one contig with the same TCE-model. This could point to a tandem duplication, if this warning appears for a lot of different TCEs situated on the same contig. In this case, you will have to manually prepare the target genome so that these paralogs lie on different fragments. If this warning appears for only a few TCEs on the same fragment, this could represent a region of high sequence similarity to another gene outside the paralogous family given as queries (e.g. having the same domain, see Paper). Check available annotations if necessary. Make sure, you included all paralogous members that appear as high-scoring blast hits of your query sequences. If this warning appears for one or two single TCEs, this could also point to a split of the TCE relative to the query sequence.
WARNING: Several different optimal solutions were obtained by the last run of ExonMatchSolver
Inspect ExonMatchsolver3.out. A count of how often one contig was included in the best solutions is displayed below the warning message (count contig paralog TCEs). Presumably, two different hits have the exact some bitscores. Check, whether they and the surrounding intron have the exact same nucleotide sequences (coordinates in ExonMatchSolver3.input). If so, these hits probably represent remnants of an uncomplete assembly of both fragments.
Which paralog is which (fasta-mode)?

Check HomologousExons/paralogs.fa. The first entry in your input paralogs.fa-file will be Paralog1, the second Paralog2 etc.

ExonMatchSolver

A pipeline assisting curation of gene annotations on fragmented genomes

Table of contents

1. System requirements/Installation

Download:

For running the whole pipeline, the following programs are required (linux only):

Fasta-mode:

Alignment-mode:

2. Input files

For all modes:

Alignment-mode:

User-mode:

Fasta-mode:

3. Program options

Seperate call of the ILP:

Pipeline usage:

Options:

4. Output files

The most important ones are:

More details on the refined search steps:

And the other output files?

5. Trouble shooting

Please forward any remaining questions and possible bugs considering the ExonMatchSolver-pipeline to henrike@bioinf.uni-leipzig.de!