Mapping

Fast and efficient bisulfite-sensitive read alignment including postprocessing filters.


BAT_mapping

For mapping with BAT, you only need a reference genome in FastA format, your read sequences in (gzip'ed) FastQ/A format, and the external tools samtools (version 1.3) and segemehl (version 0.2.0-420) installed in order to obtain read alignments in BAM format.

BAT_mapping takes care of pre- and postprocessing steps as well as running the short-read aligner segemehl in bisulfite mode. If there are no bisulfite indices of segemehl, the two required bisulfite indices will be automatically built. The actual alignment of reads is done multi-threaded (option -t) and the output sam file is filtered for mappings showing a high number of false-stranded bisulfite mismatches, i.e. G-to-A mismatches during C-to-T mapping run and vice versa. Your final output are already sorted and indexed bam files, ready for further processing.

Note

We recommand some quality analysis and filtering prior to mapping. Therefore, we recommand to use the FASTX and FastQC toolkits. At least adapter clipping is crucial for reads containing adapter sequences.

Basic usage

BAT_mapping  -g <file> -q <file> [-p <file>] -i <prefix> -o <prefix>

Output files

File Description
prefix.bam Indexed and sorted BAM file containing the filtered read alignments.
prefix.excluded.bam Indexed and sorted BAM file containing the read alignments that were excluded by the quality control filter.
prefix.log Log file.

Input/Output options

Option Description
-g Filename of reference genome FastA.
-q Filename of query sequence in FastA or (gzip'ed) FastQ format.
-p Filename of pair (mate) sequences (in case of paired-end data). Please take care, that reads and mates are concordant.
-i Prefix of genome indices. In bisulfite mode, segemehl requires two different indices. In case these indices do not exist, they will be built under the given path with the given prefix. This can be time-consuming, but only needs to be done once per genome and can then be reused for further bisulfite datasets.
-o Prefix of output files. All output files will be written to the directory given by the prefix. By default, this directory also serves as temporary directory, if no temporary directory is specified by -- tmp.
--tmp Path of temporary directory. In bisulfite mode, segemehl produces many temporary files during mapping which will all be written to this directory. If a non-existent directory is specified, it will be created and removed at the end.

Alignment/Filtering options

Option Description
-t Number of threads (default: 1).
-F Type of library preparation protocol for bisulfite sequencing: 1 = methylC-Seq/directional/Lister et al.; 2 = BS-Seq/non-directional/Cokus et al. (default: 1).
-a Additional parameters for segemehl (default: none). This option is only recommended for advanced users.
--exclude Filtering threshold for bisulfite read alignments. Read alignment where the value of the XF tag exceeds the given threshold will be excluded (default: 3). Only change this option if you are an advanced user.

External tools

Option Description
--segemehl Path to segemehl executable (recommended: v0.2.0-420). Required if segemehl executable is not in PATH. For installation, manual or problems please go to the segemehl website.
--samtools Path to samtools executable (recommended: v1.3). Required if samtools executable is not in PATH. For installation, manual or problems please go to the samtools website.

(top)


BAT_mapping_stat

BAT_mapping_stat counts your mapped reads and gives an indication of the quality of the mapping. It uses the BAM file created by BAT_mapping as input and reports several statistics including count statistics and frequency distributions.

BAT_mapping_stat can distinguish between single-end and paired-end read alignments as well as between reads aligned to one (unique) and multiple locations. In bisulfite mode, paired-end reads are allowed to map independently as single mates, when both mates are mapped, but in different runs (C-to-T and G-to-A). In addition the statistics are visualized in an addition pdf.

Basic usage

BAT_mapping_stat  --bam <file> --excluded <file> --fastq <file>

or

samtools view mapping.bam | BAT_mapping_stat --excluded excluded.bam

Statistics

  1. Total amount of mappings: Number of read alignments. Possibly larger than number of mapped reads due to multiple aligned reads.
  2. Mapped reads: Number of aligned reads subdivided into paired-end reads aligned as pair that were aligned once (unique paired-end) or multiple times (multiple paired ends), paired-end reads where only one end was aligned once (unique one mate) or multiple times (multiple one mate), as well as into single-end reads that were aligned once (unique single-end) or multiple times (multiple single-end).
  3. Sum of all mapped reads: and if paired ends sum of all mapped fragments
  4. Amount of split and not split mates: only if split reads are present
  5. Frequencies of multiple hits: List of number of hits and the frequency of reads with this number of hits; separately for paired-end reads and one mates and single-end reads. Format: nr_of_hits <tab> frequency.
  6. Frequencies of the e-distances: List of edit distance and the frequency of non-split read alignments with this edit distance. Format: e_distance <tab> frequency.
  7. Frequency of number of split fragments: List of number of split-read fragments and the frequency of read alignments that are split into this number of fragments. Format: nr_split_fragments <tab> frequency.

Input/Output options

Option Description
--bam Path to input bam file (default: stdin)
--excluded Path to input excluded bam file (default: none)
--fastq Path to fastq file (first read pair if paired end) (default: none)
-p Name of output file for frequency of multiple hits of paired-end reads (default: stdout).
-m Name of output file for frequency of multiple hits of one mate reads (default: stdout).
-s Name of output file for frequency of multiple hits of single-end reads (default: stdout).
-e Name of output file for distribution of the edit distances of read alignments (default: stdout).

(top)


BAT_merging

BAT_merging facilitates the merging of multiple BAM files into a single file (e.g. if one sample was sequenced on multiple lanes). Several options can be set in order to add meta information of the input as read group information to the header of the output file. Read group identifiers (RG tag in SAM format) in the read alignments are then used to link this meta information (e.g., sequencing center, library protocol, sequencing platform, etc.) to the corresponding read alignments. In such a way, it is later possible to trace possible batch effects or sequencing artefacts back to the original input file(s).

Basic usage

BAT_merging  -o <file> --bam <file>, ... ,<file>

Output file

File Description
output.bam Indexed and sorted BAM file containing the read alignments of all input files.

Input/Output options

Option Description
--bam Comma-separated list of BAM filenames that ought to be merged. This list is not allowed to contain whitespaces.
-o Filename of merged output BAM file.

Read group options

Option Description
--id Comma-separated list of read group identifiers, one for each BAM file (default: prefix of filename).
--cn Comma-separated list of names of sequencing centers that generated the read data, one for each BAM file (default: none).
--ds Comma-separated list of descriptions, one for each BAM file (default: none).
--dt Comma-separated list of dates (ISO8601 date or date/time) the runs were produced, one for each BAM file (default: none).
--fo Comma-separated list of flow orders, one for each BAM file (default: none).
--ks Comma-separated list of arrays of nucleotide bases that correspond to the key sequence of each read, one for each BAM file (default: none)
--lb Comma-separated list of libraries, one for each BAM file (default: none).
--pi Comma-separated list of predicted median insert sizes, one for each BAM file (default: none).
--pl Comma-separated list of platforms/technologies that were used to generate the reads, one for each BAM file (default: none), valid values: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO.
--pu Comma-separated list of platform units, one for each BAM file (default: none).
--sm Comma-separated list of samples, one for each BAM file (default: none).

External tools

Option Description
--samtools Path to samtools executable. Required if samtools executable is not in PATH. For installation, manual or problems please go to the samtools website.

(top)