The ExonMatchSolver-pipeline consists of two components, the ExonMatchSolver-Integer Linear Programming (ILP) and an accompanying perl-script for the pipeline-setup itself. The ExonMathSolver-ILP is embedded within this script, but can also be run seperately. Note that IBM ILOG CPLEX Optimizer 12.6 is required for running the ExonMatchSolver-ILP.
Unzip:
tar -zxvf ExonMatchSolver.tar.gz
Copy the library file obtained from the CPLEX Optimizer in this file:
cp [lib_file] ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/.
rm ExonMatchSolver/os/[linux|macosx|windows]/[x86|x86_64]/dummy
Install the ExonMatchSover-ILP in a directory of your choice. The path to this directory may be specified by changing $Bin in ExonMatchSolver_pipline.pl. Additional programs will be required for the following modes:
HMMER (v 3.1b1)
To get a more sensitive HMMER, it is necessary to copy its source code in a different directory and change the source code in /src/p7_pipeline.c Outcommand the following two lines of code with "/*" (see also Cryptogenomicon). The path to the more sensitive hmmer may be specified by changing $hmmer_path in the perl-script.
if (pli->ddef->nregions == 0) return eslOK; /* score passed threshold but there's no discrete domains here */
if (pli->ddef->nenvelopes == 0) return eslOK; /* rarer: region was found, stochastic clustered, no envelopes found */
As described in the paper, the ExonMatchSolver-pipeline can be run in three different modes: alignment-mode (green dot), fasta-mode (red dot) and user-mode (yellow dot).
The preparation of the input-files is critical for the output of the pipeline. We recommand the use of the alignment-mode, followed by user-mode and as last option fasta-mode for maximization of sensitivity. If intron losses or intron gains occured among the paralogous group of interest (of the query sequences), Translated Coding exons (TCEs) have to be split so that they can be derived by TCE loss from a hypothetical 'ancestor' (example). In this example, TCE 5 of ADGRL2a, ADGRL2b, ADGRL3a and ADGRL3b is encoded by one exon on genome level, while three exons code for one TCE each in ADGRL1a and ADGRL1b. The corresponding TCE (TCE 5) in ADGRL2a, ADGRL2b, ADGRL3a and ASGRL3b has to be split so that the first and last position of the new TCE 6 are homologous to the girst and last position of TCE 6 in ADGRL1a and ADGRL1b etc. . Amino acdids encoded by split codons are deleted. In these complicated cases, alignment- or user-mode should be used. The following files are required as input for the respective modes.
target genome
Path to the genome of the species you wish to annotate your paralogous protein family in. The genome should meet the requirements to be parsed with fastacmd. Option: -targetparalog sequences (.fa)
Protein sequences of all paralogs to be included in the analysis in fasta-format (example). For user- and alignment-mode, make sure that the numbering of paralogs is consistent with the "homologous TCEs"/alignment-file(s) (header should be: >Paralog${number}). For fasta-mode, the first entry in this file has to be the longest paralog/ the one with most TCEs. The paralogs can have either name in fasta-mode. The first entry will correspond to Paralog1 etc. in the output. Option: -i
path to folder with alignment-files/ paralog-, TCE-specific alignments (.stk)
These must be TCE-individual and paralog-specific. Considering n TCEs of m paralogs, there should be n x m stockholm-files in this folder. Amino acids (split codons) that are split among two TCEs are not included in either of the TCE-sequences. They should be named as following: Paralog1_exon1.stk - Paralog{m}_exon{n}.stk. Option: -mhomologous TCEs (.fa)
TCE-individual and paralog-specific protein sequences of the paralogs in the query species in fasta-format. This will be used as input for the spliced alignment tools. Make sure, that the paralog- and TCE-numbering is consistent with the one in the alignment-files (example). Option: -user
homologous TCEs (.fa)
TCE-individual and paralog-specific protein sequences of the paralogs in the query species. This will be used as input for the spliced alignment tools, (example). Option: -user
query genome
Path to the genome of the species, paralog/query sequences are provided for. The genome should meet the requirements to be parsed with fastacmd. Option: -query
java -jar ExonMatchSolver.jar [input] <int> <max> > [output]
Example input and example output files for the ExonMatchSolver-ILP. For int, max see Options.
perl ExonMatchSolver_pipeline.pl -i [fasta] -o [dir] -mode <fasta/alignment/user> -target [file] -[OPTIONS]
[GENERAL OPTIONS] | |
-o <dir> | output directory |
-mode <alignment/fasta/user> | mode to be run: input can either be an alignment (hMM are built), an user-prepared fasta-file with paralog-specific and TCE-individual protein-sequences or a fasta-file with protein-sequences of the paralogs. Default: fasta. |
-WGD <yes/no> | WGD expected? e.g. starting from a tetrapod query annotating a teleost fish. Default: no. |
-noWGD <integer,integer,...> | specify paralogs for which no WGD is expected separated by commas only (no space), if WGD is set to yes i.e. the specified paralogs are expected to be unduplicated. |
-s <yes/no> | spliced alignment programm used: exonerate or ProSplign. Default: no (ProSplign). |
-l <integer> | length cutoff for the preparational step in fasta-mode. Default: Length of longest TCE (AA)/15. |
-z <integer> | Z-score cutoff applied during preparational step in fasta-mode. Default: 3 (3 standard deviations). |
[OPTIONS for TBLASTN] | |
-dFirst <integer> | nucleotide distance considered upstream of the first blast hit found. Default: 10000. |
-dLast <integer> | nucleotide distance considered downstream of the last blast hit found. Default: 10000. |
[OPTIONS for the ILP] | |
-int <integer/fraction> | integer: round bitscore to a number dividable by this number; fraction: return all optimal solutions within this fraction. Default: "". |
-max <integer> | maximal number of solution to be returned. Default: "". |
[OPTIONS for SCIPIO] | |
-scipio_opt <string> | see Scipio documentation for options. Default: --min_identity=60--max_move_exon=6 --blat_score=15 --blat_identity=54 |
-c <integer> | number of cores available to use. Default: 2. |
-h | help |
In the specified folder, -o, output-files are generated. Inspect the STDOUT on the commandline.
ExonMatchSolver${number}.out/ExonMatchSolver${number}.input
$number can be 1, 2 or 3. This is the output after the first (initial), second (after refinement by sensitive tblastn/hmmsearch) and third ExonMatchSolver-run (after spliced alignment of nearly complete hits). These steps are again explained in detail below. The ouput each includes the contig-name, the paralog that was assigned to it (first number) and all TCEs found on that contig. This output also includes all TCEs that were found alone on one contig.Single_exons, Single_reliable
Single TCEs on contigs postulated in the optimal solution of the ExonMatchSolver-assignment are filtered after normalization of the bitscore by TCE length. Only TCEs that are considered reliable are processed in further steps of the pipeline (Single_reliable). Those hits that were slightly below the applied bitscore cutoff, which is 1.5, might require manual inspection as they could represent additional TCEs (Single_exons).Summary_scipio.gff, Summary_scipio.yaml
Scipio output files. The input, Summary.fa, includes the original query sequences and the refined models obtained from the spliced alignment on the merged fragments.folder SearchTarget
ProSplign (*.out/*.inf) or exonerate output files (*complete.aln).
error: Entry not found [fastacmd].
Is the target/query genome in the right format? Ensembl-genomes often need to be reformatted so that the single fasta entries do not start with numbers. This script will reformat the genome resulting in a file ${genome}.tmp saved to the folder where the query/target genome was located. Entries like ">1" will be reformatted to ">adopted_1".
perl reformat_genome.pl [input_genome]
Error: File existence/permissions problem in trying to open HMM file hmmModels/... or The amino acid sequence of paralog x, exon y does not exist.
The TCE model or TCE does not exist. The TCE was not included for that paralog. Is that what you expect? In fasta-mode, these messages point to homologous groups that were not included for some paralogs i.e. they did no pass the length cutoff.
Something is wrong with the preparational step 0b in fasta-mode (assignment of TCEs into homologous groups)./ WARNING: Number of paralogs assigned to one homologous TCE group is greater than number of paralogs. Check assignment of homologous TCEs.
Congratulation, you found out! This can have many reasons. If one TCE is represented by two separate TCEs in another paralog (e.g. caused by an intron insertion), the homology assignment might not work. You should have a look at HomologousExons/paralog_exon_alignment_final.fa and HomologousExons/paralog_exon_alignment.fa. If this is the problem, you can use these files as starting file to prepare a user-file for the user-mode. If this is not the problem, but you loose many homologous TCEs during this preparational step, play with -z and -l. Make sure, the first entry in your query file is the longest paralog/ the one with the highest number of TCEs.WARNING: There are hits on the positive and negative strand of the same contig!
This could be a scrambled locus. Please check in the respective ExonMatchSolver${number}.input file whether this is a false positive hit or whether your gene of interest lies on a fragment that is scrambled i.e. single contigs are assembled to build a bigger scaffold in the wrong direction. In this case, hits might lie on both, the swapped contig and the bigger scaffold, on opposite strands. You will have to do the annotation of this paralog manually and decide yourself whether you want to trust this scrambled locus.WARNING: Check for two paralogs on one unit: positive. Check this before continuing!
Inspect the ExonMatchSolver3.input. At least two hits with an E-value < 1*10e-1 were found on one contig with the same TCE-model. This could point to a tandem duplication, if this warning appears for a lot of different TCEs situated on the same contig. In this case, you will have to manually prepare the target genome so that these paralogs lie on different fragments. If this warning appears for only a few TCEs on the same fragment, this could represent a region of high sequence similarity to another gene outside the paralogous family given as queries (e.g. having the same domain, see Paper). Check available annotations if necessary. Make sure, you included all paralogous members that appear as high-scoring blast hits of your query sequences. If this warning appears for one or two single TCEs, this could also point to a split of the TCE relative to the query sequence.WARNING: Several different optimal solutions were obtained by the last run of ExonMatchSolver
Inspect ExonMatchsolver3.out. A count of how often one contig was included in the best solutions is displayed below the warning message (count contig paralog TCEs). Presumably, two different hits have the exact some bitscores. Check, whether they and the surrounding intron have the exact same nucleotide sequences (coordinates in ExonMatchSolver3.input). If so, these hits probably represent remnants of an uncomplete assembly of both fragments.Which paralog is which (fasta-mode)?
Check HomologousExons/paralogs.fa. The first entry in your input paralogs.fa-file will be Paralog1, the second Paralog2 etc.