segemehl

Steve Hoffmann, Christian Otto, Stefan Kurtz, David Langenberger, Cynthia Sharma, Joerg Vogel, Philipp Khaitovitch, Peter F. Stadler & Joerg Hackermueller

1. Introduction

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA). Segemehl now supports the SAM format, reads gziped queries to save both disk and memory space and allows bisulfite sequencing mapping and split read mapping.

2. Matching model

For each suffix of a read, segemehl aims to find the best scoring seed. Seeds might contain insertions, deletions and mismatches (differences). The number of differences allowed within a single seed is user controlled [parameter -D, --differences] and is crucial for the runtime of the program. Subsequently, seeds that undercut the user definedE-value [parameter -E, --evalue] are passed on to an exact semi global alignment procedure. Finally, reads with a minimum accuracy of [-A, --accuracy] percent are reported to the user.

3. Quick Start

To build segemehl from source you simply download the archive and extract it with

tar -xvzf segemehl_0_*.tar.gz

subsequently go to the new directory and type

make

First, you have to build the index structures for the reference sequences (i.e. chromosomes, genomes) you want to match against. Please make sure the reference provided in fasta format. To build the index structure, say for some fasta file chr1.fa, you callsegemehl with option -x

If you want to build indices for multiple fasta files, say chr1.fa chr2.fa chr3.fa you simply type

Please note: building and storing of index structures needsmemory. The index for the human chromosome 1, e.g., takes a about 4GB of disk space. Building the index needs approximately 6 GB of RAM. Please make sure that your machine has at least 8GB of RAM available. Machines with 32GB of RAM, e.g., allow building a single index for half of the Human Chromosomes at once (see below).

Matching is just as easy. To match your reads (provided in a single multiple fasta file) just type

All results will be written to the file mymap.sam

If the index was build from mutliple fasta files, please provide the fasta files in the same order used when building the index. Otherwise you will recieve an error message (m5 keys do not match). If you are sure you have supplied a correct pair of index and reference sequence you can continue and the md5 field will be automatically updated.

3a. Reducing memory consumption

To reduce the memory consumption, you can also run segemehl

chromosome-wise as follows.

Afterwards, you can merge the sam files (e.g. using MergeSamFiles from Picard tools, update the SAM tags and enforce best-only (if necessary) using our tool processMerged.pl by typing

By running segemehl on each human chromosome separately, its peak memory

consumption is reduced to around 6 GB of RAM.

4. Threads

Parallel threads [parameter -t, --threads] will make matching much faster on machines with multiple cores. Use them! To start segemehl with, e.g. 8 parallel threads, type

Using threads ten million reads with 35bp can be matched in 30 minutes.

5. Bisulfite read mapping

segemehl supports mapping of bisulfite-treated sequencing data of both currently used protocols for the construction of the bisulfite-treated libraries, namely methylC-seq (Lister et al., 2009) and BS-seq (Cokus et al., 2008) with the option [-F 1, --bisulfite 1] and [-F 2,--bisulfite 2], respectively. The sequencing reads exhibit asymmetric bisulfite-related mismatches due to the sodium bisulfite treatment which will be considered as matches during the mapping. To enable the seed search in bisulfite mapping, segemehl requires two bisulfite indices, one C-to-T and one G-to-A-converted index by using

To perform mapping of bisulfite reads, simply use

In the SAM-formatted output, the custom tag XB:Z indicates the genomic strand from which the read was derived according to the mapping, i.e., plus and minus strand in case of XB:Z:CT and XB:Z:GA, respectively.

We provide the simple methylation caller callMajorMethyl.pl which is based on majority voting and uses the output generated by the mpileup tool from the samtools package. In addition to the majority vote, the tool calculates the methylation rate, i.e., C/(C+T), as

confidence measure of each methylation call. The methylation rate may also be directly used as a measure.To get further information on the required input and output as well as on the majority voting approach, just type

Note that the methylation caller is currently limited to bisulfite sequencing data generated by the methylC-seq protocol.

5. Split read mapping

Split read mapping with segemehl is easy: just add the -S option. A split has to have a minmum length and a minimum score. You can change that length using the -Z and the -U option. The score is calculated as the number of matches minus the number of mismatches and indels. Remember that segemehl uses a local alignment for split mapping, i.e. parts of a read might not be aligned at all. The the option -W makes sure that a minumum percentage of the read is actually aligned.

After mapping you might want to call

to get out the junctions. !!Attention: make sure that the sam file is sorted. The program does not check it for you and will produce no congruent output!!

6. Help

For additional information read the manual or type :

5. Downloads

Please note: old versions has a dependency on openssl/md5. Please make sure you have the approriate package (most likely a developer package) installed. Indices build with previous versions of should be readable with later versions of the program

Initial version

- segemehl_0_0_1.tar.gz

Bugfix: floppy alignment ends

- segemehl_0_0_2.tar.gz

Optionally now displays the semi-global alignments; header description corrected; parameter -D now defaults to 1 (sufficient in most cases)

- segemehl_0_0_4.tar.gz

Bugfix: Invalid %N$ use detected (reported on Ubuntu systems; -DDIRECTOUTPUT flag); output contains representation of semi-global alignment; sensitivity increased for equally scoring seeds

- segemehl_0_0_5.tar.gz

Bugfix: Unmatched files bug fixed. Several bugs and flaws occuring during generation of large indices fixed.

- segemehl_0_0_7.tar.gz

Bugfix: missing alignments due to missing carry bit for reads of size 32. Thanks go to Jonathan Hoser!

- segemehl_0_0_8.tar.gz

Bugfix: incorrect margins in alignment step. Due to this bug alignments at the ends of reference sequences were not reported. Thanks go to Benoit Nabholz! A problem with chopping fasta files with few sequences into pieces was fixed.

- segemehl_0_0_9_3.tar.gz

New release:

- segemehl_0_1.tar.gz

Critial Bugfix: this version corrects an error that occured in the MD:Z: of some alignments. Important for variant calling:

-segemehl_0_1_2.tar.gz

New release:

-segemehl_0_1_3.tar.gz

Bugfix: A conditional check in the chaining prohibited the chaining for longer fragments. The fix increases the sensitivity in the split read mapping. Another bug in the seeding step of segemehl reduced its capabilities of finding mismatches in very short reads. This was fixed. Furthermore, the new version fixes some minor inconsistencies in the SAM-file.

New release:

-segemehl_0_1_4.tar.gz

New release:

-segemehl_0_1_6.tar.gz

This release fixes a bug that occured in the backtracking of the local transition alignment for strand switches.

New release:

-segemehl_0_1_7.tar.gz

Updates for the Realigner. Bugfix for a segfault that occasionally occured when mapping reads to (very) small references.

New release:

-segemehl_0_1_8.tar.gz

Reference implementation of variation calling model. Sort the sam!

New release:

-segemehl_0_1_9.tar.gz

gnu scientific library (>=1.15; lgsl and lgslcblas) only required for making haarz.x. All the other tools should now work without this dependency.

New release:

-segemehl_0_2_0.tar.gz

various bug fixes.

6. Contact

If you have any further questions, complaints or bug reports please mail to steve (thesign) bioinf (thesign) uni-leipzig (thesign) de

7. Reference

If you use this program in your work you might want to cite:

Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermueller J: "Fast mapping of short sequences with mismatches, insertions and deletions using index structures", PLoS Comput Biol (2009) vol. 5 (9) pp. e1000502

Hoffmann S, Otto C, Doose G, Tanzer A, Langenberger D, Christ S, Kunz M, Holdt L, Teupser D, Hackermueller J, Stadler PF: "A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection", Genome Biology (2014) 15:R34

For the Bisulfite option:

Otto C, Stadler PF, Hoffmann S: "Fast and sensitive mapping of bisulfite-treated sequencing data", Bioinformatics (2012) 28:1698-1704

./segemehl.x -x chr1.idx -d chr1.fa

./segemehl.x -x chr1_2_3.idx -d chr1.fa chr2.fa chr3.fa

./segemehl.x -i chr1.idx -d chr1.fa -q myreads.fa > mymap.sam

./segemehl.x -i chr1_2_3.idx -d chr1.fa chr2.fa chr3.fa -q myreads.fa --threads 8 > mymap.sam

./segemehl.x --help

./segemehl.x -x chr1.ctidx -y chr1.gaidx -d chr1.fa -F [1 or 2]

./segemehl.x -i chr1.ctidx -j chr1.gaidx -d chr1.fa -q myreads.fa -o mymap.sam -F [1 or 2]

perl callMajorMethyl.pl

./segemehl.x -i chr1.idx -d chr1.fa -q myreads.fa> mymap_chr1.sam

./segemehl.x -i chr2.idx -d chr2.fa -q myreads.fa> mymap_chr2.sam

java -jar MergeSamFiles.jar QUIET=true MSD=true SO=queryname O=/dev/stdout I=mymap_chr1.sam I=mymap_chr2.sam | samtools view -h - | perl processedMerged.pl -b > mymap.sam

>testrealign.x -d hg19.fa -q file.sam -n