Publications - Supplemental material

Please find below supplemental material corresponding to publications of our group. Currently, we list 122 supplements. If you have problems accessing electronic information, please let us know:

©NOTICE: All documents are copyrighted by the authors; If you would like to use all or a portion of any paper, please contact the author.

This supplement is also available at http://www.bioinf.uni-leipzig.de/publications/supplements/13-008
You may use this URL to cite or link to us.

BIOINF 13-008: Lacking alignments? The next-generation sequencing mapper segemehl revisited

Christian Otto, Peter F. Stadler, Steve Hoffmann

1. General Information

  • Supplementary methods, figures, and tables [PDF]

2. Remapping unmapped reads with lack

In the following, using a small example data set, we outline how to perform remapping
on unmapped reads, initially mapped by segemehl, using our novel tool lack, which is
part of the segemehl distribution. Moreover, the complete list of options available for
lack is described and details about the compatibility of lack to other split-read aligners
such as Blat, TopHat 2, and STAR are explained.

2.1 Basic usage

  1. Download the example data, custom scripts (necessary for compatibility with other
    split-read aligners) and segemehl version 0.1.7 (or newer)
  2. uncompress all files, and compile segemehl.x and lack.x
  3. tar -xzvf lack_example_data.tar.gz
    tar -xzvf lack_comp_scripts.tar.gz
    tar -xzvf segemehl_0_1_7.tar.gz
    cd segemehl_0_1_7/segemehl && make && cd ../..
  4. run split-read alignment with segemehl with split-mapping option
  5. segemehl_0_1_7/segemehl/segemehl.x -s -d reference.fa -i reference.idx -q reads.fq -o mapped.sam -u unmapped.fq -S
  6. sort and compress SAM file, compress unmapped read file
  7. samtools view -bS mapped.sam | samtools sort - mapped.sorted
    samtools view -h mapped.sorted.bam | gzip -c > mapped.sorted.sam.gz
    gzip -f unmapped.fq
  8. run remapping with lack
  9. segemehl_0_1_7/segemehl/lack.x -s -d reference.fa -q mapped.sorted.sam.gz -r unmapped.fq.gz -o remapped.sam

2.2 More options

Parameters:
-d, --database <file> [<file> ...]list of path/filename(s) of database sequence(s)
-q, --query [ ...]path/filename of alignment file in (gzip'ed)
SAM format, file must be coordinate-sorted
-o, --outfile <string>outputfile in SAM format (default:stdout)
-r, --remapfilename <file>filename for unmapped reads to be remapped
(default:none) file must contain best-seed
information, that is automatically added by
segemehl (see Section below if another aligner
was used)
-u, --nomatchfilename <file>filename for reads that remained unmapped
(default:none)
-t, --threads <n>start <n> threads (default:1)
-s, --silentshut up!
-A, --accuracy <n> min percentage of matches per read in
semi-global alignment (default:90)
-W, --minsplicecover <n>min coverage for spliced transcripts (default:80)
-U, --minfragscore <n>min score of a spliced fragment (default:5)
-Z, --minfraglen <n>min length of a spliced fragment (default:5)
-M, --maxdist <n>max number of distant sites to consider,
0 to disable (default:100)

Version:
0.1.7-403 (2013-09-12 11:46:53 +0200 (Thu, 12 Sep 2013))

Bugs:
Please report bugs to christian [at] bioinf.uni-leipzig [dot] de.

References:
SEGEMEHL is free software for non-commercial use
© 2012 Bioinformatik Leipzig

2.3 Compatibility with other split-read aligners

In order to use lack with split-read aligners other than segemehl, it is necessary to get a file
comprising unmapped reads, to add best seed information, and to postprocess the alignments
reported by the split-read aligner.

  1. run split-read aligner and convert into extended SAM format of segemehl
  2. (1) Blat (tested with version 35x1):
    blat -noHead reference.fa reads.fa mapped.psl
    pslReps -nohead mapped.psl mapped.filtered.psl mapped.filtered.psr
    cat mapped.filtered.psl | perl convBl2Se.pl reads.fq | samtools view -bT reference.fa - | samtools sort - mapped.sorted
    samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz
    (2) TopHat2 (tested with version 2.0.9):
    bowtie2-build reference.fa reference_bowtie2
    tophat2 --fusion-search -o mapped -p 8 reference_bowtie2 reads.fq
    samtools view -F 0x4 mapped/accepted_hits.bam | perl convTo2Se.pl | samtools view -bT reference.fa - | samtools sort - mapped.sorted
    samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz
    (3) STAR (tested with version 2.3.0e):
    STAR --runMode genomeGenerate --genomeDir reference_star --genomeFastaFiles reference.fa --runThreadN 8
    STAR --outFileNamePrefix mapped/ --genomeLoad NoSharedMemory --genomeDir reference_star --runThreadN 8 --chimSegmentMin 17 --chimScoreMin 17 --outSAMattributes All --readFilesIn reads.fq
    samtools view -F 0x4 -S mapped/Aligned.out.sam | perl convTo2Se.pl > mapped.tmp1
    samtools view -F 0x4 -S mapped/Chimeric.out.sam | perl convSt2Se.pl > mapped.tmp2
    perl mergeOutputStar.pl mapped.tmp1 mapped.tmp2 | samtools view -bhT reference.fa - | samtools sort - mapped.sorted
    samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz
  3. get unmapped reads,
    either
    • by split-read aligner itself, stored in preunmapped.fq
    or
    • by use of getUnmapped.pl (part of custom script package above)
    • samtools view -S mapped.sorted.sam.gz | getUnmapped.pl reads.fq > preunmapped.fq
  4. find best seeds using segemehl
  5. segemehl_0_1_7/segemehl/segemehl.x -d genome.fa -i genome.idx -q preunmapped.fq -A 100 -u unmapped.fq -o preunmapped.sam -t 8
    gzip -f unmapped.fq
  6. run remapping with lack
  7. segemehl_0_1_7/segemehl/lack.x -s -d reference.fa -q mapped.sorted.sam.gz -r unmapped.fq.gz -o remapped.sam

3. Read aligner comparison

Here, we provide detailed information on these performance tests. In the light of the
heated debates, we would like to stress that benchmarks only measure specific aspects
and may not be used to claim any universal superiority or inferiority of a tool.

In order to reproduce the results of the read aligner comparisons, we have assembled
a benchmark package containing all data sets (simulated and down-sampled real data),
read aligners (pre-compiled versions), optimal alignments (obtained by RazerS 3), and
all custom scripts. In addition, we subsequently explain how to re-run the benchmarks.

We would like to encourage all readers to reproduce this data and to come up with
alternative benchmarks.

3.1 Download

benchmark package 0.1 (342 MB, initial version)
benchmark package 0.2 (369 MB, paired-end mRNA-seq data set added)
benchmark package 0.3 (369 MB, bug fixes in run*.sh files)
benchmark package 0.4 (372 MB, base qualities scaled to Phred+33)

3.2 Requirements

  • 64-bit linux machine with 50 GB RAM or more
  • multi-threading of each aligner with 8 threads is utilized
    (adapt parameter $THR in scripts/run*.sh if necessary)
  • samtools installed (newest version 0.1.19 and bugfixed according to
    samtools-devel, otherwise there will be an error message if no
    alignments are present in SAM)

3.3 How to run the benchmarks?

  1. Download the benchmark package, uncompress it, and change into
    the new directory by typing
  2. tar -xzvf segemehl_benchmarks.tar.gz && cd segemehl_benchmarks
  3. Check whether all aligners are working properly using the following
    commands
  4. ./tools/bowtie2-2.1.0/bowtie2 -h ./tools/bwa-0.7.4/bwa
    ./tools/GEM-binaries-Linux-x86_64-core_i3-20130406-045632/bin/gem-mapper
    ./tools/segemehl_0_1_7/segemehl/segemehl.x -h
    ./tools/STAR_2.3.0e.Linux_x86_64/STAR --outFileNamePrefix test && rm -rf test*

    If not, please refer to the section 'Tools' below.

  5. Run the benchmarks under default parameter values by typing
  6. ./scripts/runDefault.sh

    Details on the output is given in the section 'Output' below.

    Note: The initial execution will take much longer since the genome
    will be downloaded and the genome index of each aligner needs be
    built.

  7. Run the benchmarks with the set of different of parameter options
    (in order to define best-sensitivity and best-false positive settings)
    by typing
  8. ./scripts/runVarParam.sh

    Details on the output is given in the section 'Output' below.

    Note: The execution of these benchmarks takes longer than under
    default parameters since multiple different parameter settings are
    used for every aligner and data set, some of which are much more
    time-consuming than the default setting.

3.4 Tools

Bowtie 2 v.2.1.0, BWA/BWA-SW/BWA-MEM v.0.7.4, GEM pre-release 3,
segemehl v.0.1.7, and STAR v.2.3.0e are already included precompiled
in the package above. Please check whether all tools are working
properly. Note that all aligners were downloaded as source (if
available) and compiled on a linux x86 (64-bit) machine.

If any aligner is not working properly, you would need to download
the aligner manually, compile it, and possibly update the paths in the
respective runX.sh in the scripts subfolder.

3.5 Optimal alignments

The optimal alignments, obtained by RazerS 3 v.3.1, are included in the
package above. These are necessary for the evaluation of the benchmarks.

Details about the command line options used to generate these alignments
and the processing of paired-end data are given in the paper.

3.6 Genome and genome indices

To reduce the size of the package, genome sequence and genome indices
of the aligners are not included in the package above.

During the first execution of the scripts, the genome will be downloaded
and the indices will be build. More specifically, the Human genome (hg19)
will automatically be downloaded from UCSC and converted properly.
Haplotypes, random contigs, and 'non-chromosomal' sequences will
therefore be removed.

Genome indices of the aligners will be build once as well.

3.7 Output

The output of runDefault.sh or runVarParam.sh will be put into the
subfolders mappings and mapping_varparam, respectively. The
subfolders contain all final alignment files (coordinate sorted
gzip'ed SAM), time and memory files containing the running time and
peak virtual memory measurements, and evaluation files.

Finally, there will be a file, named all.eval.txt, that contains all
the evaluation measures of every dataset and aligner. It is formatted
as follows:
dataset<tab>tool<tab>params<tab>type<tab>time<tab>mem<tab>sens<tab>fp
where dataset is the name of the data set, tool is the short name of
the split-read aligner, params is the concatenated, white-spaced
reduced list of parameters used for execution, type is the evaluation
scenario (allbest or anybest), time is the user time in seconds, mem is
the peak virtual memory consumtion in MB, sens is the calculated
sensitivity, and fp is the calculated number of false positives.

Note that the all.eval.txt files are sufficient to generate all
figures and tables of the read aligner comparison, shown in paper
and supplement.