A pipeline assisting curation of gene annotations on fragmented genomes

Manual as pdf

Table of contents

  1. System requirements/Installation
  2. Input files
  3. Program options
  4. Output files
  5. Trouble shooting

1. System requirements/Installation

The ExonMatchSolver-pipeline consists of two components, the ExonMatchSolver-Integer Linear Programming (ILP) and an accompanying perl-script for the pipeline-setup itself. The ExonMathSolver-ILP is embedded within this script, but can also be run seperately. Note that IBM ILOG CPLEX Optimizer 12.6 is required for running the ExonMatchSolver-ILP.



tar -zxvf ExonMatchSolver.tar.gz

Copy the library file obtained from the CPLEX Optimizer in this file:

cp [lib_file] ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/.
rm  ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/dummy

For running the whole pipeline, the following programs are required (linux only):

Install the ExonMatchSover-ILP in a directory of your choice. The path to this directory may be specified by changing $Bin in ExonMatchSolver_pipline.pl. Additional programs will be required for the following modes:



    if (pli->ddef->nregions   == 0) return eslOK; /* score passed threshold but there's no discrete domains here       */
    if (pli->ddef->nenvelopes == 0) return eslOK; /* rarer: region was found, stochastic clustered, no envelopes found */

2. Input files

As described in the paper, the ExonMatchSolver-pipeline can be run in three different modes: alignment-mode (green dot), fasta-mode (red dot) and user-mode (yellow dot).

Setup of the ExonMatchSolver pipeline. hMM - hidden Markov Model, TCE - translated coding exon.

The preparation of the input-files is critical for the output of the pipeline. We recommand the use of the alignment-mode, followed by user-mode and as last option fasta-mode for maximization of sensitivity. If intron losses or intron gains occured among the paralogous group of interest (of the query sequences), Translated Coding exons (TCEs) have to be split so that they can be derived by TCE loss from a hypothetical 'ancestor' (example). In this example, TCE 5 of ADGRL2a, ADGRL2b, ADGRL3a and ADGRL3b is encoded by one exon on genome level, while three exons code for one TCE each in ADGRL1a and ADGRL1b. The corresponding TCE (TCE 5) in ADGRL2a, ADGRL2b, ADGRL3a and ASGRL3b has to be split so that the first and last position of the new TCE 6 are homologous to the girst and last position of TCE 6 in ADGRL1a and ADGRL1b etc. . Amino acdids encoded by split codons are deleted. In these complicated cases, alignment- or user-mode should be used. The following files are required as input for the respective modes.

For all modes:




3. Program options

Seperate call of the ILP:

java -jar ExonMatchSolver.jar [input] <int> <max> > [output]

Example input and example output files for the ExonMatchSolver-ILP. For int, max see Options.

Pipeline usage:

perl ExonMatchSolver_pipeline.pl -i [fasta] -o [dir] -mode <fasta/alignment/user> -target [file] -[OPTIONS]


-o <dir> output directory
-mode <alignment/fasta/user> mode to be run: input can either be an alignment (hMM are built), an user-prepared fasta-file with paralog-specific and TCE-individual protein-sequences or a fasta-file with protein-sequences of the paralogs. Default: fasta.
-WGD <yes/no> WGD expected? e.g. starting from a tetrapod query annotating a teleost fish. Default: no.
-noWGD <integer,integer,...> specify paralogs for which no WGD is expected separated by commas only (no space), if WGD is set to yes i.e. the specified paralogs are expected to be unduplicated.
-s <yes/no> spliced alignment programm used: exonerate or ProSplign. Default: no (ProSplign).
-l <integer> length cutoff for the preparational step in fasta-mode. Default: Length of longest TCE (AA)/15.
-z <integer> Z-score cutoff applied during preparational step in fasta-mode. Default: 3 (3 standard deviations).
-dFirst <integer> nucleotide distance considered upstream of the first blast hit found. Default: 10000.
-dLast <integer> nucleotide distance considered downstream of the last blast hit found. Default: 10000.
[OPTIONS for the ILP]
-int <integer/fraction> integer: round bitscore to a number dividable by this number; fraction: return all optimal solutions within this fraction. Default: "".
-max <integer> maximal number of solution to be returned. Default: "".
-scipio_opt <string> see Scipio documentation for options. Default: --min_identity=60--max_move_exon=6 --blat_score=15 --blat_identity=54
-c <integer> number of cores available to use. Default: 2.
-h help

4. Output files

In the specified folder, -o, output-files are generated. Inspect the STDOUT on the commandline.

The most important ones are:

More details on the refined search steps:

And the other output files?

5. Trouble shooting

    perl reformat_genome.pl [input_genome]

Please forward any remaining questions and possible bugs considering the ExonMatchSolver-pipeline to henrike@bioinf.uni-leipzig.de!