segemehl
short read mapping with gaps
Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia Sharma, Joerg Vogel, Philipp Khaitovitch, Peter F. Stadler & Joerg Hackermueller
1. Introduction
segemehl
is a software to map short sequencer reads to reference genomes. Unlike
other methods, segemehl is able to detect not only mismatches but also
insertions and deletions. Furthermore, segemehl is not limited to a
specific read length and is able to map primer- or polyadenylation
contaminated reads correctly. segemehl implements a matching
strategy based on enhanced suffix arrays (ESA).
2. Matching model
For each suffix of a read, segemehl
aims to find the best scoring seed. Seeds might contain insertions,
deletions and mismatches (differences). The number of differences
allowed within a single seed is user controlled [parameter -D,
--differences] and is crucial for the runtime of the program.
Subsequently, seeds that undercut the user defined E-value
[parameter -E, --evalue] are passed on to an exact semi global alignment procedure.
Finally, reads with a minimum accuracy of [-A, --accuracy] percent are
reported to the user.
3. Quick Start
To build segemehl from source you simply download the archive and extract it with
tar -xvzf segemehl_0_0_*.tar.gz
subsequently go to the new directory and type
make
First,
you have to build the index structures for the reference sequences
(i.e. chromosomes, genomes) you want to match against. Please make sure
the reference provided in fasta format. To build the index structure,
say for some fasta file chr1.fa, you call segemehl with option -x
./segemehl.x -x chr1.idx -d chr1.fa
If you want to build indices for multiple fasta files, say chr1.fa chr2.fa chr3.fa you simply type
./segemehl.x -x chr1_2_3.idx -d chr1.fa chr2.fa chr3.fa
Please
note: building and storing of index structures needs memory. The
index for the human chromosome 1, e.g., takes a about 4GB of disk
space. Building the index needs approximately 6 GB of RAM. Please make
sure that your machine has at least 8GB of RAM available. Machines with
32GB of RAM, e.g., allow building a single index for half of the Human
Chromosomes at once.
Matching is just as easy. To match your reads (provided in a single multiple fasta file) just type
./segemehl.x -i chr1.idx -d chr1.fa -q myreads.fa > mymap.map
All results will be written to the file mymap.map
If the index was build from mutliple fasta files, please provide the fasta files in the same order used when building the index. Otherwise you will recieve an error message (m5 keys do not match)
4. Threads
Parallel threads [parameter -t, --threads] will make matching much faster on machines with multiple cores. Use them! To start segemehl with, e.g. 8 parallel threads, type
./segemehl.x -i chr1_2_3.idx -d chr1.fa chr2.fa chr3.fa -q myreads.fa --threads 8 > mymap.map
Using threads ten million reads with 35bp can be matched in 30 minutes.
For additional information please type
./segemehl.x --help
5. Download
Please note: the current version has a dependency on openssl/md5. Please make sure you have the approriate package (most likely a developer package) installed.
Indices build with previous versions of should be readable with later versions of the program
Initial version
Bugfix: floppy alignment ends
Optionally now displays the semi-global alignments; header description corrected; parameter -D now defaults to 1 (sufficient in most cases)
Bugfix: Invalid %N$ use detected (reported on Ubuntu systems; -DDIRECTOUTPUT flag); output contains representation of semi-global alignment; sensitivity increased for equally scoring seeds
Bugfix: Unmatched files bug fixed. Several bugs and flaws occuring during generation of large indices fixed.
Bugfix: missing alignments due to missing carry bit for reads of size 32. Thanks go to Jonathan Hoser!
Bugfix: incorrect margins in alignment step. Due to this bug alignments at the ends of reference sequences were not reported. Thanks go to Benoit Nabholz! A problem with chopping fasta files with few sequences into pieces was fixed.
5. Contact
If you have any further questions, complaints or bug reports please mail to steve (thesymbol) bioinf (another symbol) uni-leipzig (another symbol) de