Segemehl (diplonema version)
General Information
- segemehl is a very sensitive read aligner that is able to map genomic but can also allow for splicing events via the commandline option -S
- in both cases it requires an index structure (similar to other aligners such as Bowtie) to be generated, this can be done using the option -x and this index can then be provided using the option -i
- segemehl_diplonema.x is an extended and customized version of segemehl that is able to perform (split-)mapping while allowing for C->T and A->G substitutions in arbitrary number during seeding and alignment step (via -F 5) or only during alignment step (via -F 6)
- use of segemehl with -F 5 requires an additional index (with collapsed alphabet) that can be generated via the -x option, in the example calls below it will be termed with the suffix '.ctgaidx'
- Diplonema-specific modes have a higher sensitivity but lower specificity compared to the normal segemehl due to the unlimited allowance of C->T and A->G substitutions
- both diplonema-specific modes (-F 5 and -F 6) can be coupled with the split-read mapping (via -S)
- in case of split-read mapping, common segemehl can be at advantage if no substitutions are present in the read data as the specificity can help to identify the correct split-read alignment wheras in diplonema mode it may not report an alignment overall; in our benchmarks we have observed a higher support on the trans-splicing junctions with normal split-read segemehl compared to split-read segemehl in a diplonema-specific mode (but this is probably not true with highly edited transcripts)
- however, the benchmarks also showed that in split-read mapping, the diplonema mode -F 6 appears to work best with respect to the trade-off between specificity and sensitivity of finding multi-split mappings with edits
- in addition to the mapping of read data itself, segemehl can also be used to map assembled transcripts to the genome. this can help to identify entire genes or isoforms instead of individual split junctions. for testing purposes, this was tested using the manually assembled non-edited but also edited transcripts where the first ones were best found with common segemehl (as expected) and the last ones were best found with segemehl in diplonema mode -F 6
Preprocessing
-
reads: it is necessary to reformat the header of the read files for segemehl to its original Illumina state (with space before '1:N:...' instead of '_') in order for segemehl to check the concordance of the paired-end read files by e.g. using the following commands
zcat PA_1-r.a53.q20_paired.fastq.gz | paste - - - - | awk 'BEGIN{FS="\t"; OFS="\n"}{sub("_", " ", $1); print}' | gzip -c > PA_1-r.a53.q20_paired_se.fastq.gz
zcat PA_2-r.a53.q20_paired.fastq.gz | paste - - - - | awk 'BEGIN{FS="\t"; OFS="\n"}{sub("_", " ", $1); print}' | gzip -c > PA_2-r.a53.q20_paired_se.fastq.gz
-
mitogenome:
- segemehl does not like '>' symbols within its fasta description
which can be converted manually or using the following command
cat Dp_mito_Cass_20150806.fasta | sed 's/->/ to /g' > Dp_mito_Cass_20150806_se.fasta
- to allow for poly-T inserts, an additional Poly-T contig needs
to be added to the mitogenome
cat Dp_mito_Cass_20150806_se.fasta > Dp_mito_Cass_20150806_se_polyT.fasta
perl -e 'print ">polyT\n".("T" x 100)."\n"' >> Dp_mito_Cass_20150806_se_polyT.fasta
Example Calls
- in the following, some example calls for segemehl are listed which can be used for mapping RNA-seq data of Diplonema to the mitogenomic sequences
- it generates the mappings, converts them to sorted BAM and performs junction calling using our segemehl junction caller (testrealign.x), see segemehl manual (section 11) for more information
- please adjust the number of threads (-t) to your needs and capacities
generate segemehl indices
./segemehl_diplonema.x -d Dp_mito_Cass_20150806_se_polyT.fasta -x Dp_mito_Cass_20150806_se_polyT.idx
./segemehl_diplonema.x -d Dp_mito_Cass_20150806_se_polyT.fasta -x Dp_mito_Cass_20150806_se_polyT.ctgaidx -F 5
normal split-read mapping with segemehl
./segemehl_diplonema.x -d Dp_mito_Cass_20150806_se_polyT.fasta -i Dp_mito_Cass_20150806_se_polyT.idx -q PA_1-r.a53.q20_paired_se.fastq.gz -p PA_2-r.a53.q20_paired_se.fastq.gz -o PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.sam -t 10 -s -S -u PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.unmapped.fastq 2> PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.log
samtools view -bS PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.sam | samtools sort - PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S
samtools index PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.bam
rm -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.sam
gzip -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.unmapped.fastq
samtools view -h PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.bam | gzip -c > PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.sam.gz
./testrealign.x -d Dp_mito_Cass_20150806_se_polyT.fasta -q PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.sam.gz -n -T PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.trans.bed -U PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S.splice.bed
split-read mapping with segemehl in diplonema mode -F 5
./segemehl_diplonema.x -d Dp_mito_Cass_20150806_se_polyT.fasta -i Dp_mito_Cass_20150806_se_polyT.ctgaidx -q PA_1-r.a53.q20_paired_se.fastq.gz -p PA_2-r.a53.q20_paired_se.fastq.gz -o PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.sam -t 10 -s -S -F 5 -u PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.unmapped.fastq 2> PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.log
samtools view -bS PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.sam | samtools sort - PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5
samtools index PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.bam
rm -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.sam
gzip -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.unmapped.fastq
samtools view -h PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.bam | gzip -c > PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.sam.gz
./testrealign.x -d Dp_mito_Cass_20150806_se_polyT.fasta -q PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.sam.gz -n -T PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.trans.bed -U PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F5.splice.bed
split-read mapping with segemehl in diplonema mode -F 6
./segemehl_diplonema.x -d Dp_mito_Cass_20150806_se_polyT.fasta -i Dp_mito_Cass_20150806_se_polyT.idx -q PA_1-r.a53.q20_paired_se.fastq.gz -p PA_2-r.a53.q20_paired_se.fastq.gz -o PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.sam -t 10 -s -S -F 6 -u PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.unmapped.fastq 2> PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.log
samtools view -bS PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.sam | samtools sort - PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6
samtools index PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.bam
rm -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.sam
gzip -f PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.unmapped.fastq
samtools view -h PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.bam | gzip -c > PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.sam.gz
./testrealign.x -d Dp_mito_Cass_20150806_se_polyT.fasta -q PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.sam.gz -n -T PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.trans.bed -U PA_1-r.a53.q20_paired_se.Dp_mito_Cass_20150806_se_polyT.segemehl-S-F6.splice.bed