In the following, using a small example data set, we outline how to perform remapping
on unmapped reads, initially mapped by segemehl, using our novel tool lack, which is
part of the segemehl distribution. Moreover, the complete list of options available for
lack is described and details about the compatibility of lack to other split-read aligners
such as Blat, TopHat 2, and STAR are explained.
-d, --database <file> [<file> ...] | list of path/filename(s) of database sequence(s) |
-q, --query | path/filename of alignment file in (gzip'ed) SAM format, file must be coordinate-sorted |
-o, --outfile <string> | outputfile in SAM format (default:stdout) |
-r, --remapfilename <file> | filename for unmapped reads to be remapped (default:none) file must contain best-seed information, that is automatically added by segemehl (see Section below if another aligner was used) |
-u, --nomatchfilename <file> | filename for reads that remained unmapped (default:none) |
-t, --threads <n> | start <n> threads (default:1) |
-s, --silent | shut up! |
-A, --accuracy <n> | min percentage of matches per read in semi-global alignment (default:90) |
-W, --minsplicecover <n> | min coverage for spliced transcripts (default:80) |
-U, --minfragscore <n> | min score of a spliced fragment (default:5) |
-Z, --minfraglen <n> | min length of a spliced fragment (default:5) |
-M, --maxdist <n> | max number of distant sites to consider, 0 to disable (default:100) |
Version:
0.1.7-403 (2013-09-12 11:46:53 +0200 (Thu, 12 Sep 2013))
Bugs:
Please report bugs to christian [at] bioinf.uni-leipzig [dot] de.
References:
SEGEMEHL is free software for non-commercial use
© 2012 Bioinformatik Leipzig
In order to use lack with split-read aligners other than segemehl, it is necessary to get a file
comprising unmapped reads, to add best seed information, and to postprocess the alignments
reported by the split-read aligner.
Here, we provide detailed information on these performance tests. In the light of the
heated debates, we would like to stress that benchmarks only measure specific aspects
and may not be used to claim any universal superiority or inferiority of a tool.
In order to reproduce the results of the read aligner comparisons, we have assembled
a benchmark package containing all data sets (simulated and down-sampled real data),
read aligners (pre-compiled versions), optimal alignments (obtained by RazerS 3), and
all custom scripts. In addition, we subsequently explain how to re-run the benchmarks.
We would like to encourage all readers to reproduce this data and to come up with
alternative benchmarks.
If not, please refer to the section 'Tools' below.
Details on the output is given in the section 'Output' below.
Note: The initial execution will take much longer since the genome
will be downloaded and the genome index of each aligner needs be
built.
Details on the output is given in the section 'Output' below.
Note: The execution of these benchmarks takes longer than under
default parameters since multiple different parameter settings are
used for every aligner and data set, some of which are much more
time-consuming than the default setting.
Bowtie 2 v.2.1.0, BWA/BWA-SW/BWA-MEM v.0.7.4, GEM pre-release 3,
segemehl v.0.1.7, and STAR v.2.3.0e are already included precompiled
in the package above. Please check whether all tools are working
properly. Note that all aligners were downloaded as source (if
available) and compiled on a linux x86 (64-bit) machine.
If any aligner is not working properly, you would need to download
the aligner manually, compile it, and possibly update the paths in the
respective runX.sh in the scripts subfolder.
The optimal alignments, obtained by RazerS 3 v.3.1, are included in the
package above. These are necessary for the evaluation of the benchmarks.
Details about the command line options used to generate these alignments
and the processing of paired-end data are given in the paper.
To reduce the size of the package, genome sequence and genome indices
of the aligners are not included in the package above.
During the first execution of the scripts, the genome will be downloaded
and the indices will be build. More specifically, the Human genome (hg19)
will automatically be downloaded from UCSC and converted properly.
Haplotypes, random contigs, and 'non-chromosomal' sequences will
therefore be removed.
Genome indices of the aligners will be build once as well.
The output of runDefault.sh or runVarParam.sh will be put into the
subfolders mappings and mapping_varparam, respectively. The
subfolders contain all final alignment files (coordinate sorted
gzip'ed SAM), time and memory files containing the running time and
peak virtual memory measurements, and evaluation files.
Finally, there will be a file, named all.eval.txt, that contains all
the evaluation measures of every dataset and aligner. It is formatted
as follows:
dataset<tab>tool<tab>params<tab>type<tab>time<tab>mem<tab>sens<tab>fp
where dataset is the name of the data set, tool is the short name of
the split-read aligner, params is the concatenated, white-spaced
reduced list of parameters used for execution, type is the evaluation
scenario (allbest or anybest), time is the user time in seconds, mem is
the peak virtual memory consumtion in MB, sens is the calculated
sensitivity, and fp is the calculated number of false positives.
Note that the all.eval.txt files are sufficient to generate all
figures and tables of the read aligner comparison, shown in paper
and supplement.