Homology Search with Fragmented Nucleic Acid Sequence Patterns

Axel Mosig, Julian L. Chen, Peter F. Stadler


WABI 2007 (R. Giancarlo & S. Hannehalli, eds.), Springer Verlag, Berlin, LNBI 4645, pp. 335-345 (2007).


The comprehensive annotation of non-coding RNAs in newly sequenced genomes is still a largely unsolved problem because many functional RNAs exhibit not only poorly conserved sequences but also large variability in structure. In many cases, such as Y RNAs, vault RNAs, or telomerase RNAs, sequences differ by large insertions or deletions and have only a few small sequence patterns in common. <br>Here we present <tt>fragrep2</tt>, a purely sequence-based approach to detect such patterns in complete genomes. A <tt>fragrep2</tt> pattern consists of an ordered list of position-specific weight matrices (PWMs) describing short, approximately conserved sequence elements, that are separated by intervals of non-conserved regions of bounded length. The program uses a fractional programming approach to align the PWMs to genomic DNA in order to allow for a bounded number of insertions and deletions in the patterns. These patterns are then combined to significant combinations of PWMs. At this step, a subset of PWMs may be deleted, i.e., have no match in the current region of the genome. The program furthermore estimates <i>p</i>- and <i>E</i>-values for the matches. <br> We apply <tt>fragrep2</tt> to homology searches for RNase MRP, unveiling two previously unidentified matches as well as reproducing the results of two previous surveys. Furthermore, we complement the picture of vertebrate vault RNAs, a class of ncRNAs that has not received much attention so far.