Product Grammars for Alignment and Folding

(How to Multiply Dynamic Programming Algorithms)

We develop a theory of algebraic operations over linear and context-free grammars. Using this theory, we make it possible to multiply dynamic programming algorithms.

The submitted extended paper: Product Grammars for Alignment and Folding; Höner zu Siederdissen, Hofacker, and Stadler (2014).

The supplement for the extended paper: Supplement.

Additional supplemental material for the DNA-Protein Alignment example can be found below.

The BSB 2013 paper: How to Multiply Dynamic Programming Algorithms; Höner zu Siederdissen, Hofacker, and Stadler (2013).

The BSB 2013 supplement: Supplement.

Products of Context-Free Grammars (or how to Multiply Dynamic Programming Algorithms)

Many sequence alignment algorithms have a structure that can be composed in a meaningful way from a small set of ``atomic grammars’’. We provide a number of algebraic operators to compose grammars in different ways. To this end, it is useful to consider that the atomic grammars describe structure on a single tape. Combining two grammars using a product operation then yields an alignment of the two input sequences under the constraints of the structure prescribed by the individual single-tape grammars.

Example: Frameshift-aware DNA-Protein Alignment

This algorithm aligns protein sequences to genomes. Frameshifts are explicitly allowed. A number of species perform major insertions and deletions to the mitochondrial transcriptome. Usual alignment algorithms are therefore not able to successfully align protein sequences to the mitogenome. The Bundschuh group developed special algorithms to deal with these cases using a complex recursive scheme. We demonstrate how to calculate a general frameshift-aware alignment algorithm from a set of simple grammars. Our algorithm recovers a large fraction of the annotated genes, even if the alignment between protein and DNA is between fairly distant species. For details, consult our new paper Product Grammars for Alignment and Folding and the supplement.

Supplemental material:

Physarum polycephalum protein sequences aligned to P.poly. mitogenome.

Reclinomonas americana protein sequences aligned to P.poly. mitogenome.

DNA-Protein Alignment Tool, gzipped Binary, compiled on Fedora Core 17.

Link to NCBI Blast matrices.

Ansi to HTML script.

Usage example:

Features of the Compiler

Compiler, gzipped Binary, compiled on Fedora Core 17.

How to Use the Framework

The code below consists of a “Step” grammar which either reads a single character on the right, or loop on the non-terminal. In addition, we have a “Done” grammar that recognizes the empty word, and a “Loop” grammar to remove the loop case.

The product “ThreeWay” creates a three-tape grammar that aligns three input sequences at once – if you provide a scoring algebra.

Grammar: Step
NT: W
T:  c
W -> step <<< W c
W -> loop <<< W
//
Grammar: Done
NT: W
T:  empty
W -> nil <<< empty
//
Grammar: Loop
NT: W
W -> loop <<< W
Product: ThreeWay
Step >< Step >< Step + Done * 3 - Loop * 3
//

As we already provide an example of LaTeX output in paper, we now consider usage in Haskell. Note that the QuasiQouter qqGDG makes it possible to directly define the grammar in a Haskell module as an embedded domain-specific language.

[qqGDG|
Grammar: Step
{- code from above -}
//
Product: ThreeWay
Step >< Step >< Step + Done * 3 - Loop * 3
//
|]

This code snippet emits a signature “SThreeWay” and a grammar function “gThreeWay”. These are equivalent to the following code:

data SThreeWay _m _x _r c empty = SThreeWay
  { loop_loop_step :: _x -> (((Z:.()):.()):.c) -> _x
  , loop_step_loop :: _x -> (((Z:.()):.c):.()) -> _x
  , loop_step_step :: _x -> (((Z:.()):.c):.c) ->  _x
  , step_loop_loop :: _x -> (((Z:.c):.()):.()) -> _x
  , step_loop_step :: _x -> (((Z:.c):.()):.c) ->  _x
  , step_step_loop :: _x -> (((Z:.c):.c):.()) ->  _x
  , step_step_step :: _x -> (((Z:.c):.c):.c) ->   _x
  , nil_nil_nil :: (((Z:.empty):.empty):.empty) -> _x
  , h :: Stream _m _x -> _m _r
  }

gThreeWay sThreeWay {-non-terminals:-} www {-terminals:-} c_1 c_2 c_3 empty_1 empty_2 empty_3 =
  (Z:.
  ( www , loop_loop_step sThreeWay <<< www % (T:!None:!None:!c_3)  |||
          loop_step_loop sThreeWay <<< www % (T:!None:!c_2:!None)  |||
          loop_step_step sThreeWay <<< www % (T:!None:!c_2:!c_3)   |||
          step_loop_loop sThreeWay <<< www % (T:!c_1:!None:!None)  |||
          step_loop_step sThreeWay <<< www % (T:!c_1:!None:!c_3)   |||
          step_step_loop sThreeWay <<< www % (T:!c_1:!c_2:!None)   |||
          step_step_step sThreeWay <<< www % (T:!c_1:!c_2:!c_3)    |||
          nil_nil_nil sThreeWay <<< (T:!empty_1:!empty_2:!empty_3) ... h sThreeWay )
  )
{-# INLINE gThreeWay #-}

The underlying ADPfusion framework makes sure that the code is efficient.

Some grammars

  1. Protein-DNA Alignment
  2. Three-Sequence Alignment

Installation

All libraries and program are available on hackage. Make sure you have a modern GHC (7.6.x) and the newest “cabal” tool. This allows building in a “sandbox”.

  1. create a temporary directory: “mkdir ~/tmp/gramprod; cd ~/tmp/gramprod”
  2. install the Grammar Products framework: “cabal update; cabal sandbox init; cabal install GrammarProducts”

The DNA/Protein Aligner grammar was created using the “standalone Haskell module” target and does not require the GrammarProducts library to be built beforehand! To install

  1. create a temporary directory: “mkdir ~/tmp/dnaprot; cd ~/tmp/dnaprot”
  2. install the Grammar Products framework: “cabal update; cabal sandbox init; cabal install DnaProteinAlignment”

local list of references

Christian Höner zu Siederdissen, Ivo L. Hofacker, and Peter F. Stadler. 2013. How to multiply dynamic programming algorithms. Brazilian symposium on bioinformatics (BSB 2013). Lecture notes in bioinformatics 8213. Springer, Heidelberg. doi:10.1007/978-3-319-02624-4_8.

---. 2014. Product grammars for alignment and folding. IEEE/ACM Transactions on Computational Biology and Bioinformatics 12, no. 3. 12: 507–19. doi:10.1109/TCBB.2014.2326155.