Publications of Sebastian Will

Hosna Jabbari, Ian Wark, Carlo Montemagno, and Sebastian Will. Knotty: efficient and accurate prediction of complex RNA pseudoknot structures. Bioinformatics, 34(22):3849–3856, 2018. doi:10.1093/bioinformatics/bty420
Motivation: The computational prediction of RNA secondary structure by free energy minimization has become an important tool in RNA research. However in practice, energy minimization is mostly limited to pseudoknot-free structures or rather simple pseudoknots, not covering many biologically important structures such as kissing hairpins. Algorithms capable of predicting sufficiently complex pseudoknots (for sequences of length n) used to have extreme complexities, e.g. Pknots has O(n6) time and O(n4) space complexity. The algorithm CCJ dramatically improves the asymptotic run time for predicting complex pseudoknots (handling almost all relevant pseudoknots, while being slightly less general than Pknots), but this came at the cost of large constant factors in space and time, which strongly limited its practical application ( approximately 200 bases already require 256 GB space). Results: We present a CCJ-type algorithm, Knotty, that handles the same comprehensive pseudoknot class of structures as CCJ with improved space complexity of Theta(n3+Z)-due to the applied technique of sparsification, the number of 'candidates', Z, appears to grow significantly slower than n4 on our benchmark set (which include pseudoknotted RNAs up to 400 nt). In terms of run time over this benchmark, Knotty clearly outperforms Pknots and the original CCJ implementation, CCJ 1.0; Knotty's space consumption fundamentally improves over CCJ 1.0, being on a par with the space-economic Pknots. By comparing to CCJ 2.0, our unsparsified Knotty variant, we demonstrate the isolated effect of sparsification. Moreover, Knotty employs the state-of-the-art energy model of 'HotKnots DP09', which results in superior prediction accuracy over Pknots. Availability and implementation: Our software is available at https://github.com/HosnaJabbari/Knotty. Supplementary information: Supplementary data are available at Bioinformatics online.
Stefan Hammer, Yann Ponty, Wei Wang, and Sebastian Will. Fixed-Parameter Tractable Sampling for Rna Design with Multiple Target Structures. In Ben Raphael, editor, Proc. of the 22th Annual International Conferences on Computational Molecular Biology (RECOMB'18), page 2, 2018. [Preprint]
The design of multi-stable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches largely benefit from effective in-silico methods, which can tremendously impact their cost and feasibility. Here, we revisit a central ingredient of most in-silico design methods: the sampling of sequences for multi-target design. We establish the #P-hardness of uniform sampling, and introduce RNARedPrint, a tree decomposition-based algorithm for efficient, fixed parameter tractable sampling. By modeling the problem as a constraint network, RNARedPrint supports generic Boltzmann-weighted sampling for arbitrary additive RNA energy models; this enables generating designs meeting specific goals like expected free energies or GC-content. Finally, we empirically study general properties of the approach and generate biologically relevant multi-target Boltzmann-weighted designs for a common design benchmark. In particular, we show significant improvements over seed sequences generated by uniform sampling, the previously best available sampling strategy for multi-target design. Our software is freely available at: https://github.com/yannponty/RNARedPrint Keywords: Multi-target RNA Design, Counting Complexity, Boltzmann Sampling
Björn Grüning, Ryan Dale, Andreas Sjodin, Brad A. Chapman, Jillian Rowe, Christopher H. Tomkins-Tinch, Renan Valieris, Johannes Koster and The Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods, 15(7):475–476, 2018. doi:10.1038/s41592-018-0046-7
Martin Raden, Syed M. Ali, Omer S. Alkhnbashi, Anke Busch, Fabrizio Costa, Jason A. Davis, Florian Eggenhofer, Rick Gelhausen, Jens Georg, Steffen Heyne, Michael Hiller, Kousik Kundu, Robert Kleinkauf, Steffen C. Lott, Mostafa M. Mohamed, Alexander Mattheis, Milad Miladi, Andreas S. Richter, Sebastian Will, Joachim Wolff, Patrick R. Wright, and Rolf Backofen. Freiburg RNA tools: a central online resource for RNA-focused research and teaching. NAR, 46(W1):W25–W29, 2018. doi:10.1093/nar/gky329
The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.
Felix Kühnl, Peter F. Stadler, and Sebastian Will. Tractable Rna–ligand interaction kinetics. BMC Bioinformatics, 18(Suppl 12):47, 2017. doi:10.1186/s12859-017-1823-5
Background. The binding of small ligands to RNA elements can cause substantial changes in the RNA structure. This constitutes an important, fast-acting mechanism of ligand-controlled transcriptional and translational gene regulation implemented by a wide variety of riboswitches. The associated refolding processes often cannot be explained by thermodynamic effects alone. Instead, they are governed by the kinetics of RNA folding. While the computational analysis of RNA folding can make use of well-established models of the thermodynamics of RNA structures formation, RNA textendash RNA interaction, and RNA textendash ligand interaction, kinetic effects pose fundamentally more challenging problems due to the enormous size of the conformation space. The analysis of the combined process of ligand binding and structure formation even for small RNAs is plagued by intractably large state spaces. Moreover, the interaction is concentration-dependent and thus is intrinsically non-linear. This precludes the direct transfer of the strategies previously used for the analysis of RNA folding kinetics. Results. In our novel, computationally tractable approach to RNA textendash ligand kinetics, we overcome the two main difficulties by applying a gradient-based coarse graining to RNA textendash ligand systems and solving the process in a pseudo-first order approximation. The latter is well-justified for the most common case of ligand excess in RNA textendash ligand systems. We present the approach rigorously and discuss the parametrization of the model based on empirical data. The method supports the kinetic study of RNA textendash ligand systems, in particular at different ligand concentrations. As an example, we apply our approach to analyze the concentration dependence of the ligand response of the rationally designed, artificial theophylline riboswitch RS3. Conclusion. This work demonstrates the tractability of the computational analysis of RNA textendash ligand interaction. Naturally, the model will profit as more accurate measurements of folding and binding parameters become available. Due to this work, computational analysis is available to support tasks like the design of riboswitches; our analysis of RS3 suggests strong co-transcriptional effects for this riboswitch. Availability. The method used in this study is available online: www.bioinf.uni-leipzig.de/~felix/software/RLIkin. Keywords: RNA secondary structure prediction, RNA interaction kinetics, RNA textendash ligand interaction, Riboswitches.
Sven Findeiß, Maja Etzel, Sebastian Will, Mario Mörl, and Peter F Stadler. Design of Artificial Riboswitches as Biosensors. Sensors, 17(9):1990, 2017. doi:10.3390/s17091990
RNA aptamers readily recognize small organic molecules, polypeptides, as well as other nucleic acids in a highly specific manner. Many such aptamers have evolved as parts of regulatory systems in nature. Experimental selection techniques such as SELEX have been very successful in finding artificial aptamers for a wide variety of natural and synthetic ligands. Changes in structure and/or stability of aptamers upon ligand binding can propagate through larger RNA constructs and cause specific structural changes at distal positions. In turn, these may affect transcription, translation, splicing, or binding events. The RNA secondary structure model realistically describes both thermodynamic and kinetic aspects of RNA structure formation and refolding at a single, consistent level of modelling. Thus, this framework allows studying the function of natural riboswitches in silico. Moreover, it enables rationally designing artificial switches, combining essentially arbitrary sensors with a broad choice of read-out systems. Eventually, this approach sets the stage for constructing versatile biosensors. Keywords: aptamer; RNA structure; ligand binding; refolding; thermodynamics; rational design; folding kinetics
Jörg Fallmann, Sebastian Will, Jan Engelhardt, Björn Grüning, Rolf Backofen, and Peter F Stadler. Recent advances in Rna folding. Journal of Biotechnology, 2017. doi:10.1016/j.jbiotec.2017.07.007
In the realm of nucleic acid structures, secondary structure forms a conceptually important intermediate level of description and explains the dominating part of the free energy of structure formation. Secondary structures are well conserved over evolutionary time-scales and for many classes of RNAs evolve slower than the underlying primary sequences. Given the close link between structure and function, secondary structure is routinely used as a basis to explain experimental findings. Recent technological advances, finally, have made it possible to assay secondary structure directly using high throughput methods. From a computational biology point of view, secondary structures have a special role because they can be computed efficiently using exact dynamic programming algorithms. In this contribution we provide a short overview of RNA folding algorithms, recent additions and variations and address methods to align, compare, and cluster RNA structures, followed by a tabular summary of the most important software suites in the fields. Keywords: RNA secondary structure; RNA constraint folding; RNA interactions; RNA secondary structure comparison; RNA analysis
Gesine Domin, Sven Findeiß, Manja Wachsmuth, Sebastian Will, Peter F Stadler, and Mario Mörl. Applicability of a computational design approach for synthetic riboswitches. Nucleic acids research, 45(7):4108–4119, 2017. doi:10.1093/nar/gkw1267
Abstract Riboswitches have gained attention as tools for synthetic biology, since they enable researchers to reprogram cells to sense and respond to exogenous molecules. In vitro evolutionary approaches produced numerous RNA aptamers that bind such small ligands, but their conversion into functional riboswitches remains difficult. We previously developed a computational approach for the design of synthetic theophylline riboswitches based on secondary structure prediction. These riboswitches have been constructed to regulate ligand-dependent transcription termination in Escherichia coli. Here, we test the usability of this design strategy by applying the approach to tetracycline and streptomycin aptamers. The resulting tetracycline riboswitches exhibit robust regulatory properties in vivo. Tandem fusions of these riboswitches with theophylline riboswitches represent logic gates responding to two different input signals. In contrast, the conversion of the streptomycin aptamer into functional riboswitches appears to be difficult. Investigations of the underlying aptamer secondary structure revealed differences between in silico prediction and structure probing. We conclude that only aptamers adopting the minimal free energy (MFE) structure are suitable targets for construction of synthetic riboswitches with design approaches based on equilibrium thermodynamics of RNA structures. Further improvements in the design strategy are required to implement aptamer structures not corresponding to the calculated MFE state. Topic: gene expression galactosidase ligands logic protein structure, secondary streptomycin thermodynamics tetracycline theophylline rna escherichia coli free energy binding (molecular function) molecule
Björn A Grüning, Jörg Fallmann, Dilmurat Yusuf, Sebastian Will, Anika Erxleben, Florian Eggenhofer, Torsten Houwaart, Bérénice Batut, Pavankumar Videm, Andrea Bagnacani, and others. The Rna workbench: best practices for Rna and high-throughput sequencing bioinformatics in Galaxy. Nucleic Acids Research, 2017. doi:10.1093/nar/gkx409
RNA-based regulation has become a major research topic in molecular biology. The analysis of epigenetic and expression data is therefore incomplete if RNA-based regulation is not taken into account. Thus, it is increasingly important but not yet standard to combine RNA-centric data and analysis tools with other types of experimental data such as RNA-seq or ChIP-seq. Here, we present the RNA workbench, a comprehensive set of analysis tools and consolidated workflows that enable the researcher to combine these two worlds. Based on the Galaxy framework the workbench guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated analyses that are independent of command-line knowledge. Currently, it includes more than 50 bioinformatics tools that are dedicated to different research areas of RNA biology including RNA structure analysis, RNA alignment, RNA annotation, RNA-protein interaction, ribosome profiling, RNA-seq analysis and RNA target prediction. The workbench is developed and maintained by experts in RNA bioinformatics and the Galaxy framework. Together with the growing community evolving around this workbench, we are committed to keep the workbench up-to-date for future standards and needs, providing researchers with a reliable and robust framework for RNA data analysis. Availability: The RNA workbench is available at https://github.com/bgruening/galaxy-rna-workbench.
Hosna Jabbari, Ian Wark, Carlo Montemagno, and Sebastian Will. Sparsification Enables Predicting Kissing Hairpin Pseudoknot Structures of Long RNAs in Practice. In Russell Schwartz and Knut Reinert, editors, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pages 12:1–12:13, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. doi:10.4230/LIPIcs.WABI.2017.12
While computational RNA secondary structure prediction is an important tool in RNA research, it is still fundamentally limited to pseudoknot-free structures (or at best very simple pseudoknots) in practice. Here, we make the prediction of complex pseudoknots - including kissing hairpin structures - practically applicable by reducing the originally high space consumption. For this aim, we apply the technique of sparsification and other space-saving modifications to the recurrences of the pseudoknot prediction algorithm by Chen, Condon and Jabbari (CCJ algorithm). Thus, the theoretical space complexity of free energy minimization is reduced to Theta(n^3+Z), in the sequence length n and the number of non-optimally decomposable fragments ("candidates") Z. The sparsified CCJ algorithm, sparseCCJ, is presented in detail. Moreover, we provide and compare three generations of CCJ implementations, which continuously improve the space requirements: the original CCJ implementation, our first modified implementation, and our final sparsified implementation. The two latest implementations implement the established HotKnots DP09 energy model. In our experiments, using 244GB of RAM, the original CCJ implementation failed to handle sequences longer than 195 bases; sparseCCJ handles our pseudoknot data set (up to about length 400 bases) in this space limit. All three CCJ implementations are available at https://github.com/HosnaJabbari/CCJ. Keywords: RNA, secondary structure prediction, pseudoknots, space efficiency, sparsification
Felix Kühnl, Peter F. Stadler, and Sebastian Will. Tractable Kinetics of Rna–Ligand Interaction. In Anu Bourgeois, Pavel Skums, Xiang Wan, and Alex Zelikovsky, editors, Bioinformatics Research and Applications (ISBRA 2016), volume 9683 of LNBI, page 2. Springer International Publishing, June 2016. doi:10.1007/978-3-319-38782-6
The interaction of RNAs and their ligands strongly depends on folding kinetics and thus requires explanations that go beyond thermodynamic effects. Whereas the computational prediction of minimum energy secondary structures, and even RNA–RNA and RNA–ligand interactions, are well established, the analysis of their kinetics is still in its infancy. Due to enormous conformation spaces, the exact analysis of the combined processes of ligand binding and structure formation requires either the explicit modeling of an intractably large conformation space or—often debatable—simplifications. Moreover, concentration effects play a crucial role. This increases the complexity of modeling the interaction kinetics fundamentally over single molecule kinetics.
Francesco Righetti, Aaron M. Nuss, Christian Twittenhoff, Sascha Beele, Kristina Urban, Sebastian Will, Stephan H. Bernhart, Peter F. Stadler, Petra Dersch, and Franz Narberhaus. Temperature-responsive in vitro Rna structurome of Yersinia pseudotuberculosis. Proceedings of the National Academy of Sciences, 113(26):7237–7242, 2016. doi:10.1073/pnas.1523004113
RNA structures are fundamentally important for RNA function. Dynamic, condition-dependent structural changes are able to modulate gene expression as shown for riboswitches and RNA thermometers. By parallel analysis of RNA structures, we mapped the RNA structurome of Yersinia pseudotuberculosis at three different temperatures. This human pathogen is exquisitely responsive to host body temperature (37 textdegree C), which induces a major metabolic transition. Our analysis profiles the structure of more than 1,750 RNAs at 25 textdegree C, 37 textdegree C, and 42 textdegree C. Average mRNAs tend to be unstructured around the ribosome binding site. We searched for 5'-UTRs that are folded at low temperature and identified novel thermoresponsive RNA structures from diverse gene categories. The regulatory potential of 16 candidates was validated. In summary, we present a dynamic bacterial RNA structurome and find that the expression of virulence-relevant functions in Y. pseudotuberculosis and reprogramming of its metabolism in response to temperature is associated with a restructuring of numerous mRNAs.
Sebastian Will and Hosna Jabbari. Sparse RNA folding revisited: space-efficient minimum free energy structure prediction. Algorithms Mol Biol, 11(7):13, 2016. doi:10.1186/s13015-016-0071-y
BACKGROUND: RNA secondary structure prediction by energy minimization is the central computational tool for the analysis of structural non-coding RNAs and their interactions. Sparsification has been successfully applied to improve the time efficiency of various structure prediction algorithms while guaranteeing the same result; however, for many such folding problems, space efficiency is of even greater concern, particularly for long RNA sequences. So far, space-efficient sparsified RNA folding with fold reconstruction was solved only for simple base-pair-based pseudo-energy models. RESULTS: Here, we revisit the problem of space-efficient free energy minimization. Whereas the space-efficient minimization of the free energy has been sketched before, the reconstruction of the optimum structure has not even been discussed. We show that this reconstruction is not possible in trivial extension of the method for simple energy models. Then, we present the time- and space-efficient sparsified free energy minimization algorithm SparseMFEFold that guarantees MFE structure prediction. In particular, this novel algorithm provides efficient fold reconstruction based on dynamically garbage-collected trace arrows. The complexity of our algorithm depends on two parameters, the number of candidates Z and the number of trace arrows T; both are bounded by [Formula: see text], but are typically much smaller. The time complexity of RNA folding is reduced from [Formula: see text] to [Formula: see text]; the space complexity, from [Formula: see text] to [Formula: see text]. Our empirical results show more than 80% space savings over RNAfold [Vienna RNA package] on the long RNAs from the RNA STRAND database (>=2500 bases). CONCLUSIONS: The presented technique is intentionally generalizable to complex prediction algorithms; due to their high space demands, algorithms like pseudoknot prediction and RNA-RNA-interaction prediction are expected to profit even stronger than "standard" MFE folding. SparseMFEFold is free software, available at http://www.bioinf.uni-leipzig.de/~will/Software/SparseMFEFold.
Sarah Juliane Berkemer, Christian Höner zu Siederdissen, Fabian Amman, Axel Wintsche, Sebastian Will, Ivo Hofacker, Sonja Prohaska, and Peter Stadler. Processed Small RNAs in Archaea and BHB Elements. Genomics and Computational Biology, 1(1):e18, 2015. doi:10.18547/gcb.2015.vol1.iss1.e18
Bulge-helix-bulge (BHB) elements guide the enzymatic splicing machinery that in Archaea excises introns from tRNAs, rRNAs from their primary precursor, and accounts for the assembly of piece-wise encoded tRNAs. This processing pathway renders the intronic sequences as circularized RNA species. Although archaeal transcriptomes harbor a large number of circular small RNAs, it remains unknown whether most or all of them are produced through BHB-dependent splicing. We therefore conduct a genome-wide survey of BHB elements of a phylogenetically diverse set of archaeal species and complement this approach by searching for BHB-like structures in the vicinity of circularized transcripts. We find that besides tRNA introns, the majority of box C/D snoRNAs is associated with BHB elements. Not all circularized sRNAs, however, can be explained by BHB elements, suggesting that there is at least one other mechanism of RNA circularization at work in Archaea. Pattern search methods were unable, however, to identify common sequence and/or secondary structure features that could be characteristic for such a mechanism. Keywords: Circular RNA; Archaea; splicing; structure-based motif search
Sebastian Will and Hosna Jabbari. Sparse Rna folding revisited: space-efficient minimum free energy prediction. In Mihai Pop and Hélène Touzet, editors, Proceedings of the 15th Workshop on Algorithms in Bioinformatics (WABI 2015), volume 9289 of LNCS, pages 257–270. Springer Berlin Heidelberg, 2015. doi:10.1007/978-3-662-48221-6_19
RNA secondary structure prediction by energy minimization is the central computational tool for the analysis of structural non-coding RNAs and their interactions. Sparsification has been successfully applied to improve the time efficiency of various structure prediction algorithms while guaranteeing the same result; however, for many such folding problems, space efficiency is of even greater concern, in particular for long RNAs and complex folding algorithms. So far, space-efficient sparsified RNA folding with fold reconstruction was solved only for simple pseudo-energy models. Here, we revisit the problem of space-efficient free energy minimization. Whereas the space-efficient minimization of the free energy has been sketched before, the reconstruction of the optimum structure has not even been discussed. We show that this reconstruction is not possible in trivial extension of the method for simple energy models. Then, we present the time- and space-efficient sparsified free energy minimization algorithm SparseMFEFold, which guarantees optimal structure prediction. In particular, this novel algorithm provides efficient fold reconstruction based on dynamically garbage collected trace arrows. We provide theoretical and empirical results on the efficiency of the method. SparseMFEFold is free software, available at http://www.bioinf.uni-leipzig.de/~will/Software/SparseMFEFold. Keywords: space efficient sparsification, pseudoknot-free RNA folding, RNA secondary structure prediction
Sebastian Will, Christina Otto, Milad Miladi, Mathias Mohl, and Rolf Backofen. SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics. Bioinformatics, 31(15):2489–2496, 2015. doi:10.1093/bioinformatics/btv185
MOTIVATION: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of [Formula: see text]. Subsequently, numerous faster 'Sankoff-style' approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity ([Formula: see text] quartic time). RESULTS: Breaking this barrier, we introduce the novel Sankoff-style algorithm 'sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)', which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff's original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. AVAILABILITY AND IMPLEMENTATION: SPARSE is freely available at http://www.bioinf.uni-freiburg.de/Software/SPARSE. CONTACT: backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online.
Christina Otto, Mathias Möhl, Steffen Heyne, Mika Amit, Gad M. Landau, Rolf Backofen, and Sebastian Will. ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs. BMC Bioinformatics, 15(1):404, 2014. doi:10.1186/s12859-014-0404-0
Background: Identifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable. Results: The novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known inverted question marksimultaneous alignment and folding inverted question mark of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P inverted question marks sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences ( inverted question mark400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.uni-freiburg.de/Software/ExpaRNA-P. Conclusions: ExpaRNA-P inverted question marks novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications.
Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Möhl, Christina Otto, and Sebastian Will. Local Exact Pattern Matching for Non-Fixed RNA Structures. IEEE/ACM Trans. Comput. Biology Bioinform., 11(1):219–230, 2014. doi:10.1109/TCBB.2013.2297113
Detecting local common sequence-structure regions of RNAs is a biologically important problem. Detecting such regions allows biologists to identify functionally relevant similarities between the inspected molecules. We developed dynamic programming algorithms for finding common structure-sequence patterns between two RNAs. The RNAs are given by their sequence and a set of potential base pairs with associated probabilities. In contrast to prior work on local pattern matching of RNAs, we support the breaking of arcs. This allows us to add flexibility over matching only fixed structures; potentially matching only a similar subset of specified base pairs. We present an O(n³) algorithm for local exact pattern matching between two nested RNAs, and an O(n³ log n) algorithm for one nested RNA and one bounded-unlimited RNA. In addition, an algorithm for approximate pattern matching is introduced that for two given nested RNAs and a number k, finds the maximal local pattern matching score between the two RNAs with at most k mismatches in O(n³k²) time. Finally, we present an O(n³) algorithm for finding the most similar subforest between two nested RNAs.
Sebastian Will and Peter F. Stadler. A common Framework for Linear and Cyclic Multiple Sequence Alignment Problems. In Dan Brown and Burkhard Morgenstern, editors, Proceedings of Algorithms in Bioinformatics (WABI'14), volume 8701 of LNCS, pages 135–147. Springer Berlin Heidelberg, 2014. doi:10.1007/978-3-662-44753-6_11
Circularized RNAs have received considerable attention is the last few years following the discovery that they are not only a rather common phenomenon in the transcriptomes of Eukarya and Archaea but also may have key regulatory functions. This calls for the adaptation of basic tools of sequence analysis to accommodate cyclic sequences. Here we discuss a common formal framework for linear and circular alignments as partitions that preserve (cyclic) order. We focus on the similarities and differences and describe a prototypical ILP formulation.
Jerome Waldispuhl, Charles W. O'Donnell, Sebastian Will, Srinivas Devadas, Rolf Backofen, and Bonnie Berger. Simultaneous Alignment and Folding of Protein Sequences. J Comput Biol, 21(7):477–491, 2014. doi:10.1089/cmb.2013.0163
Abstract Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane beta-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at http://partifold.csail.mit.edu/ ).
Hugo Schneider, Sebastian Bartschat, Gero Doose, Lucas Maciel, Erick Pizani, Marcelo Bassani, Fernando Araripe Torres, Sebastian Will, Tainá Raiol, Marcelo Brígido, Maria Emília Walter, and Peter Stadler. Genome-Wide Identification of Non-coding Rnas in Komagatella pastoris str. Gs115. In Sérgio Campos, editor, Advances in Bioinformatics and Computational Biology, volume 8826 of Lecture Notes in Computer Science, pages 115–122. Springer International Publishing, 2014. doi:10.1007/978-3-319-12418-6_15
Sita J. Lange, Omer S. Alkhnbashi, Dominic Rose, Sebastian Will, and Rolf Backofen. CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems. Nucleic Acids Res, 41(17):8034–44, 2013. SJL, OSA and DR contributed equally to this work. doi:10.1093/nar/gkt606
Central to Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-Cas systems are repeated RNA sequences that serve as Cas-protein-binding templates. Classification is based on the architectural composition of associated Cas proteins, considering repeat evolution is essential to complete the picture. We compiled the largest data set of CRISPRs to date, performed comprehensive, independent clustering analyses and identified a novel set of 40 conserved sequence families and 33 potential structure motifs for Cas-endoribonucleases with some distinct conservation patterns. Evolutionary relationships are presented as a hierarchical map of sequence and structure similarities for both a quick and detailed insight into the diversity of CRISPR-Cas systems. In a comparison with Cas-subtypes, I-C, I-E, I-F and type II were strongly coupled and the remaining type I and type III subtypes were loosely coupled to repeat and Cas1 evolution, respectively. Subtypes with a strong link to CRISPR evolution were almost exclusive to bacteria; nevertheless, we identified rare examples of potential horizontal transfer of I-C and I-E systems into archaeal organisms. Our easy-to-use web server provides an automated assignment of newly sequenced CRISPRs to our classification system and enables more informed choices on future hypotheses in CRISPR-Cas research: http://rna.informatik.uni-freiburg.de/CRISPRmap.
Sebastian Will, Michael Yu, and Bonnie Berger. Structure-based whole-genome realignment reveals many novel noncoding RNAs. Genome Res, 23(6):1018–1027, 2013. doi:10.1101/gr.137091.111
Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole-genome alignments (WGAs) have predicted thousands of structural noncoding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited, due to their reliance on sequence-based whole-genome aligners, which regularly misalign structural ncRNAs. This suggests that many more structural ncRNAs may remain undetected. Structure-based alignment, which could increase the sensitivity, has been prohibitive for genome-wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for Prediction of structural ncRNA), which efficiently realigns whole genomes based on RNA sequence and structure, thus allowing us to boost the performance of de novo ncRNA predictors, such as RNAz. Key to the pipeline's efficiency is the development of a novel banding technique for multiple RNA alignment. REAPR significantly outperforms the widely used predictors RNAz and EvoFold in genome-wide screens; in direct comparison to the most recent RNAz screen on D. melanogaster, REAPR predicts twice as many high-confidence ncRNA candidates. Moreover, modENCODE RNA-seq experiments confirm a substantial number of its predictions as transcripts. REAPR's advancement of de novo structural characterization of ncRNAs complements the identification of transcripts from rapidly accumulating RNA-seq data.
Sebastian Will, Christina Schmiedl, Milad Miladi, Mathias Möhl, and Rolf Backofen. SPARSE: Quadratic Time Simultaneous Alignment and Folding of RNAs Without Sequence-Based Heuristics. In Minghua Deng, Rui Jiang, Fengzhu Sun, and Xuegong Zhang, editors, Proceedings of the 17th International Conference on Research in Computational Molecular Biology (RECOMB 2013), volume 7821 of LNCS, pages 289–290. Springer Berlin Heidelberg, 2013. doi:10.1007/978-3-642-37195-0_28
Motivation: There is increasing evidence of pervasive transcription, resulting in hundreds of thousands of ncRNAs of unknown function. Standard computational analysis tasks for inferring functional annotations like clustering require fast and accurate RNA comparisons based on sequence and structure similarity. The gold standard for the latter is Sankoffâ texteuro texttrademark s algorithm, which simultaneously aligns and folds RNAs. Because of its extreme time complexity of O(n⁶), numerous faster "Sankoff-style" approaches have been suggested. Several such approaches introduce heuristics based on sequence alignment, which compromises the alignment quality for RNAs with sequence identities below 60%. Avoiding such heuristics, as e.g. in LocARNA, has been assumed to prohibit time complexities better than O(n⁴), which strongly limits large-scale applications. Results: Breaking this barrier, we introduce SPARSE (Sparse Prediction and Alignment of RNAs using Structure Ensembles), a novel quadratic time Sankoff-style approach that does not rely on sequence-based heuristics but employs structural properties of RNA ensembles; its O(n²) complexity matches the one of sequence alignment. The approach is based on a novel lightweight Sankoff-style alignment model, for which we introduce the algorithm PARSE. For the first time it transfers the Sankoff-model completely to a lightweight energy model; thus, it is more expressive than all previous lightweight methods, which inherit the PMcomp model. In comparison to LocARNA and similar approaches, the novel model enables much stronger sparsification based on the RNA structure ensemble; consequently, SPARSE aligns and folds RNAs with similar alignment and better folding quality in significantly less time. Finally, SPARSE aligns ncRNAs from the challenging low sequence identity region more accurately than tools relying on sequence-based heuristics. Conclusion: Our results indicate that a complete lightweight Sankoff-style model with stronger sparsification can increase the performance and accuracy of RNA alignment, where the potential of the model points far beyond the studied prototype. Not falling back on sequence comparison, SPARSE suggests itself for large scale similarity assessment of RNAs with moderate to very low sequence identity.
Fabian Amman, Stephan H. Bernhart, Gero Doose, Ivo L. Hofacker, Jing Qin, Peter F. Stadler, and Sebastian Will. The Trouble with Long-Range Base Pairs in Rna Folding. In João C. Setubal and Nalvo F. Almeida, editors, Advances in Bioinformatics and Computational Biology, volume 8213 of Lecture Notes in Computer Science, pages 1–11. Springer International Publishing, 2013. doi:10.1007/978-3-319-02624-4_1
RNA prediction has long been struggling with long-range base pairs since prediction accuracy decreases with base pair span. We analyze here the empirical distribution of base pair spans in large collection of experimentally known RNA structures. Surprisingly, we find that long-range base pairs are overrepresented in these data. In particular, there is no evidence that long-range base pairs are systematically overpredicted relative to short-range interactions in thermodynamic predictions. This casts doubt on a recent suggestion that kinetic effects are the cause of length-dependent decrease of predictability. Instead of a modification of the energy model we advocate a modification of the expected accuracy model for RNA secondary structures. We demonstrate that the inclusion of a span-dependent penalty leads to improved maximum expected accuracy structure predictions compared to both the standard MEA model and a modiS,ed folding algorithm with an energy penalty function. The prevalence of long-range base pairs provide further evidence that RNA structures in general do not have the so-called polymer zeta property. This has consequences for the asymptotic performance for a large class of sparsified RNA folding algorithms.
Sebastian Will, Tejal Joshi, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs. RNA, 18(5):900–14, 2012. doi:10.1261/rna.029041.111
Current genomic screens for noncoding RNAs (ncRNAs) predict a large number of genomic regions containing potential structural ncRNAs. The analysis of these data requires highly accurate prediction of ncRNA boundaries and discrimination of promising candidate ncRNAs from weak predictions. Existing methods struggle with these goals because they rely on sequence-based multiple sequence alignments, which regularly misalign RNA structure and therefore do not support identification of structural similarities. To overcome this limitation, we compute columnwise and global reliabilities of alignments based on sequence and structure similarity; we refer to these structure-based alignment reliabilities as STARs. The columnwise STARs of alignments, or STAR profiles, provide a versatile tool for the manual and automatic analysis of ncRNAs. In particular, we improve the boundary prediction of the widely used ncRNA gene finder RNAz by a factor of 3 from a median deviation of 47 to 13 nt. Post-processing RNAz predictions, LocARNA-P's STAR score allows much stronger discrimination between true- and false-positive predictions than RNAz's own evaluation. The improved accuracy, in this scenario increased from AUC 0.71 to AUC 0.87, significantly reduces the cost of successive analysis steps. The ready-to-use software tool LocARNA-P produces structure-based multiple RNA alignments with associated columnwise STARs and predicts ncRNA boundaries. We provide additional results, a web server for LocARNA/LocARNA-P, and the software package, including documentation and a pipeline for refining screens for structural ncRNA, at http://www.bioinf.uni-freiburg.de/Supplements/LocARNA-P/.
Stefan Washietl, Sebastian Will, David A. Hendrix, Loyal A. Goff, John L. Rinn, Bonnie Berger, and Manolis Kellis. Computational analysis of noncoding RNAs. Wiley Interdiscip Rev RNA, 3(6):759–78, 2012. doi:10.1002/wrna.1134
Noncoding RNAs have emerged as important key players in the cell. Understanding their surprisingly diverse range of functions is challenging for experimental and computational biology. Here, we review computational methods to analyze noncoding RNAs. The topics covered include basic and advanced techniques to predict RNA structures, annotation of noncoding RNAs in genomic data, mining RNA-seq data for novel transcripts and prediction of transcript structures, computational aspects of microRNAs, and database resources. These authors contributed equally WIREs RNA 2012. doi: 10.1002/wrna.1134 For further resources related to this article, please visit the WIREs website.
Dragos A. Sorescu, Mathias Möhl, Martin Mann, Rolf Backofen, and Sebastian Will. CARNA - alignment of RNA structure ensembles. Nucleic Acids Res, 40(W1):W49–W53, 2012. DAS, MMö, and MMa contributed equally to this work. doi:10.1093/nar/gks491
Due to recent algorithmic progress, tools for the gold standard of comparative RNA analysis, namely Sankoff-style simultaneous alignment and folding, are now readily applicable. Such approaches, however, compare RNAs with respect to a simultaneously predicted, single, nested consensus structure. To make multiple alignment of RNAs available in cases, where this limitation of the standard approach is critical, we introduce a web server that provides a complete and convenient interface to the RNA structure alignment tool 'CARNA'. This tool uniquely supports RNAs with multiple conserved structures per RNA and aligns pseudoknots intrinsically; these features are highly desirable for aligning riboswitches, RNAs with conserved folding pathways, or pseudoknots. We represent structural input and output information as base pair probability dot plots; this provides large flexibility in the input, ranging from fixed structures to structure ensembles, and enables immediate visual analysis of the results. In contrast to conventional Sankoff-style approaches, 'CARNA' optimizes all structural similarities in the input simultaneously, for example across an entire RNA structure ensemble. Even compared with already costly Sankoff-style alignment, 'CARNA' solves an intrinsically much harder problem by applying advanced, constraint-based, algorithmic techniques. Although 'CARNA' is specialized to the alignment of RNAs with several conserved structures, its performance on RNAs in general is on par with state-of-the-art general-purpose RNA alignment tools, as we show in a Bralibase 2.1 benchmark. The web server is freely available at http://rna.informatik.uni-freiburg.de/CARNA.
Christina Schmiedl, Mathias Möhl, Steffen Heyne, Mika Amit, Gad M. Landau, Sebastian Will, and Rolf Backofen. Exact Pattern Matching for RNA Structure Ensembles. In Benny Chor, editor, Proceedings of the 16th International Conference on Research in Computational Molecular Biology (RECOMB 2012), volume 7262 of LNCS, pages 245–260. Springer-Verlag, 2012. doi:10.1007/978-3-642-29627-7_27
ExpaRNA's core algorithm computes, for two fixed RNA structures, a maximal non-overlapping set of maximal exact matchings. We introduce an algorithm ExpaRNA-P that solves the lifted problem of finding such sets of exact matchings in entire Boltzmann-distributed structure ensembles of two RNAs. Due to a novel kind of structural sparsification, the new algorithm maintains the time and space complexity of the algorithm for fixed input structures. Furthermore, we generalized the chaining algorithm of ExpaRNA in order to compute a compatible subset of ExpaRNA-P's exact matchings.We show that ExpaRNA-P outperforms ExpaRNA in BRAliBase 2.1 benchmarks, where we pass the chained exact matchings as anchor constraints to the RNA alignment tool LocARNA. Compared to LocARNA, this novel approach shows similar accuracy but is six times faster.
Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Möhl, Christina Schmiedl, and Sebastian Will. Local Exact Pattern Matching for Non-fixed RNA Structures. In Juha Kärkkäinen and Jens Stoye, editors, Proceedings of the 23th Annual Symposium on Combinatorial Pattern Matching (CPM 2012), volume 7354 of LNCS, pages 306–320. Springer-Verlag, 2012. doi:10.1007/978-3-642-31265-6_25
Detecting local common sequence-structure regions of RNAs is a biologically meaningful problem. By detecting such regions, biologists are able to identify functional similarity between the inspected molecules. We developed dynamic programming algorithms for finding common structure-sequence patterns between two RNAs. The RNAs are given by their sequence and a set of potential base pairs with associated probabilities. In contrast to prior work which matches fixed structures, we support the arc breaking edit operation; this allows to match only a subset of the given base pairs. We present an O(n³) algorithm for local exact pattern matching between two nested RNAs, and an O(n³ logn) algorithm for one nested RNA and one bounded-unlimited RNA.
Sebastian Will, Michael Yu, and Bonnie Berger. Structure-based Whole Genome Realignment Reveals Many Novel Non-coding RNAs. In Benny Chor, editor, Proceedings of the 16th International Conference on Research in Computational Molecular Biology (RECOMB 2012), volume 7262 of LNCS, page 341. Springer-Verlag, 2012. doi:10.1007/978-3-642-29627-7_35
Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole genome alignments (WGAs) have predicted thousands of structural non-coding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited due to their reliance on sequence-based whole- genome aligners, which regularly misalign structural ncRNAs. This suggests that many more structural ncRNAs may remain undetected. Structure-based align- ment, which could increase the sensitivity, has been prohibitive for genome- wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for de novo Prediction of structural ncRNA) that realigns whole genomes based on RNA sequence and structure and then evaluates the realignments for potential structural ncRNAs with a ncRNA predictor such as RNAz 2.0. For efficiency of the pipeline, we develop a novel banding realignment algorithm for the RNA multiple alignment tool LocARNA. This allows us to perform very fast structure-based realignment within limited deviation of the original multiple alignment from the WGA. We apply REAPR to the complete twelve Drosophila WGAs to predict ncRNAs across all these Drosophila species. Compared to direct prediction from the original WGA at the same False Discovery Rate (FDR), we predict twice as many high-confidence ncRNA candidates in D.melanogaster while less than doubling the run-time. As a novelty in ncRNA prediction, we control the FDR, going beyond the usual a posteriori FDR estimation. Applying the sequence-based alignment tool Muscle for realignment, we demonstrate that structure-based methods are necessary for effective prediction of originally misaligned ncRNAs. Comparing to recent screens of Drosophila and ENCODE we show that REAPR outperforms the widely-used de novo predictors RNAz, EvoFold, and CMfinder. Finally, we reveal, with high confidence, a putative structural motif in the long ncRNA roX1 of D.melanogaster, known to regulate X chromosome dosage compensation in male ies. Interestingly, we recapitulate the Drosophila phylogeny, based on co-predicted ncRNAs across all genomes.
Fernando Meyer, Stefan Kurtz, Rolf Backofen, Sebastian Will, and Michael Beckstette. Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinformatics, 12(1):214, 2011. doi:10.1186/1471-2105-12-214
ABSTRACT: BACKGROUND: The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. RESULTS: We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. CONCLUSIONS: The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator.
Sushmita Roy, Jason Ernst, Peter V. Kharchenko, Pouya Kheradpour, Nicolas Negre, Matthew L. Eaton, Jane M. Landolin, Christopher A. Bristow, Lijia Ma, Michael F. Lin, Stefan Washietl, Bradley I. Arshinoff, Ferhat Ay, Patrick E. Meyer, Nicolas Robine, Nicole L. Washington, Luisa Di Stefano, Eugene Berezikov, Christopher D. Brown, Rogerio Candeias, Joseph W. Carlson, Adrian Carr, Irwin Jungreis, Daniel Marbach, Rachel Sealfon, Michael Y. Tolstorukov, Sebastian Will, Artyom A. Alekseyenko, Carlo Artieri, Benjamin W. Booth, Angela N. Brooks, Qi Dai, Carrie A. Davis, Michael O. Duff, Xin Feng, Andrey A. Gorchakov, Tingting Gu, Jorja G. Henikoff, Philipp Kapranov, Renhua Li, Heather K. MacAlpine, John Malone, Aki Minoda, Jared Nordman, Katsutomo Okamura, Marc Perry, Sara K. Powell, Nicole C. Riddle, Akiko Sakai, Anastasia Samsonova, Jeremy E. Sandler, Yuri B. Schwartz, Noa Sher, Rebecca Spokony, David Sturgill, Marijke van Baren, Kenneth H. Wan, Li Yang, Charles Yu, Elise Feingold, Peter Good, Mark Guyer, Rebecca Lowdon, Kami Ahmad, Justen Andrews, Bonnie Berger, Steven E. Brenner, Michael R. Brent, Lucy Cherbas, Sarah C. R. Elgin, Thomas R. Gingeras, Robert Grossman, Roger A. Hoskins, Thomas C. Kaufman, William Kent, Mitzi I. Kuroda, Terry Orr-Weaver, Norbert Perrimon, Vincenzo Pirrotta, James W. Posakony, Bing Ren, Steven Russell, Peter Cherbas, Brenton R. Graveley, Suzanna Lewis, Gos Micklem, Brian Oliver, Peter J. Park, Susan E. Celniker, Steven Henikoff, Gary H. Karpen, Eric C. Lai, David M. MacAlpine, Lincoln D. Stein, Kevin P. White, and Manolis Kellis. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330(6012):1787–97, 2010. doi:10.1126/science.1198374
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Mathias Möhl, Raheleh Salari, Sebastian Will, Rolf Backofen, and S. Cenk Sahinalp. Sparsification of RNA structure prediction including pseudoknots. Algorithms Mol Biol, 5(1):39, 2010. doi:10.1186/1748-7188-5-39
ABSTRACT: BACKGROUND: Although many RNA molecules contain pseudoknots, computational prediction of pseudoknotted RNA structure is still in its infancy due to high running time and space consumption implied by the dynamic programming formulations of the problem. RESULTS: We introduce sparsification to significantly speedup the dynamic programming approaches for pseudoknotted RNA structure prediction, which also lower the space requirements. Although sparsification has been applied to a number of RNA-related structure prediction problems in the past few years, we provide the first application of sparsification to pseudoknotted RNA structure prediction specifically and to handling gapped fragments more generally - which has a much more complex recursive structure than other problems to which sparsification has been applied. We analyse how to sparsify four pseudoknot structure prediction algorithms, among those the most general method available (the Rivas-Eddy algorithm) and the fastest one (Reeder-Giegerich algorithm). In all algorithms the number of "candidate" substructures to be considered is reduced. CONCLUSION: Experimental results on the sparsified Reeder-Giegerich algorithm suggest a linear speedup over the unsparsified implementation.
Mathias Möhl, Raheleh Salari, Sebastian Will, Rolf Backofen, and S. Cenk Sahinalp. Sparsification of RNA Structure Prediction Including Pseudoknots. In Vincent Moulton and Mona Singh, editors, Proc. of the 10th Workshop on Algorithms in Bioinformatics (WABI), volume 6293 of Lecture Notes in Computer Science, pages 40–51. Springer Berlin / Heidelberg, 2010. doi:10.1007/978-3-642-15294-8_4
Although many RNA molecules contain pseudoknots, computational prediction of pseudoknotted RNA structure is still in its infancy due to high running time and space consumption implied by the dynamic programming formulations of the problem. In this paper, we introduce sparsification to significantly speedup the dynamic programming approaches for pseudoknotted RNA structure prediction, which also lower the space requirements. Although sparsification has been applied to a number of RNA-related structure prediction problems in the past few years, we provide the first application of sparsification to pseudoknotted RNA structure prediction specifically and to handling gapped fragments more generally - which has a much more complex recursive structure than other problems to which sparsification has been applied. We show that sparsification, when applied to the fastest, as well as the most general pseudoknotted structure prediction methods available, - respectively the Reeder-Giegerich algorithm and the Rivas-Eddy algorithm - reduces the number of candidate substructures to be considered significantly. In fact, experimental results on the sparsified Reeder-Giegerich algorithm suggest a linear speedup over the unsparsified implementation.
Alessandro Dal Palu, Mathias Möhl, and Sebastian Will. A propagator for Maximum Weight String Alignment with Arbitrary Pairwise Dependencies. In Proceedings of the 16th International Conference on Principles and Practice of Constraint Programming (CP-2010), pages 167–175, 2010. doi:10.1007/978-3-642-15396-9_16
The optimization of weighted string alignments is a well studied problem recurring in a number of application domains and can be solved efficiently. The problem becomes MAX-SNP-hard as soon as arbitrary pairwise dependencies among the alignment edges are introduced. We present a global propagator for this problem which is based on efficiently solving a relaxation of it. In the context of bioinformatics, the problem is known as alignment of arc-annotated sequences, which is e.g. used for comparing RNA molecules. For a restricted version of this alignment problem, we show that a constraint program based on our propagator is on par with state of the art methods. For the general problem with unrestricted dependencies, our tool constitutes the first available method with promising applications in this field.
Cameron Smith, Steffen Heyne, Andreas S. Richter, Sebastian Will, and Rolf Backofen. Freiburg RNA Tools: a web server integrating IntaRNA, ExpaRNA and LocARNA. Nucleic Acids Res, 38 Suppl:W373–7, 2010. doi:10.1093/nar/gkq316
The Freiburg RNA tools web server integrates three tools for the advanced analysis of RNA in a common web-based user interface. The tools IntaRNA, ExpaRNA and LocARNA support the prediction of RNA-RNA interaction, exact RNA matching and alignment of RNA, respectively. The Freiburg RNA tools web server and the software packages of the stand-alone tools are freely accessible at http://rna.informatik.uni-freiburg.de.
Raheleh Salari, Mathias Möhl, Sebastian Will, S. Cenk Sahinalp, and Rolf Backofen. Time and Space Efficient RNA-RNA Interaction Prediction via Sparse Folding. In Bonnie Berger, editor, Proc. of RECOMB 2010, volume 6044 of Lecture Notes in Computer Science, pages 473–490. Springer-Verlag Berlin Heidelberg, 2010. doi:10.1007/978-3-642-12683-3_31
In the past years, a large set of new regulatory ncRNAs have been identified, but the number of experimentally verified targets is considerably low. Thus, computational target prediction methods are on high demand. Whereas all previous approaches for predicting a general joint structure have a complexity of O(n⁶) running time and O(n⁴) space, a more time and space efficient interaction prediction that is able to handle complex joint structures is necessary for genome-wide target prediction problems. In this paper we show how to reduce both the time and space complexity of the RNA-RNA interaction prediction problem as described by Alkan et al. via dynamic programming sparsification - which allows to discard large portions of DP tables without loosing optimality. Applying sparsification techniques reduces the complexity of the original algorithm from O(n⁶) time and O(n⁴) space to O(n⁴ textbackslash psi(n)) time and O(n² textbackslash psi(n) + n³) space for some function textbackslash psi(n), which turns out to have small values for the range of n that we encounter in practice. Under the assumption that the polymer-zeta property holds for RNA-structures, we demonstrate that textbackslash psi(n)=O(n) on average, resulting in a linear time and space complexity improvement over the original algorithm. We evaluate our sparsified algorithm for RNA-RNA interaction prediction by total free energy minimization, based on the energy model of Chitsaz et al., on a set of known interactions. Our results confirm the significant reduction of time and space requirements in practice.
Mathias Möhl, Sebastian Will, and Rolf Backofen. Lifting prediction to alignment of RNA pseudoknots. J Comput Biol, 17(3):429–42, 2010. doi:10.1089/cmb.2009.0168
Prediction and alignment of RNA pseudoknot structures are NP-hard. Nevertheless, several efficient prediction algorithms by dynamic programming have been proposed for restricted classes of pseudoknots. We present a general scheme that yields an efficient alignment algorithm for arbitrary such classes. Moreover, we show that such an alignment algorithm benefits from the class restriction in the same way as the corresponding structure prediction algorithm does. We look at six of these classes in greater detail. The time and space complexity of the alignment algorithm is increased by only a linear factor over the respective prediction algorithm. For five of the classes, no efficient alignment algorithms were known. For the sixth, most general class, we improve the previously best complexity of O(n(5)m(5)) time to O(nm(6)), where n and m denote sequence lengths. Finally, we apply our fastest algorithm with O(nm(4)) time and O(nm(2)) space to comparative de-novo pseudoknot prediction.
Steffen Heyne, Sebastian Will, Michael Beckstette, and Rolf Backofen. Lightweight Comparison of RNAs Based on Exact Sequence-Structure Matches. Bioinformatics, 25(16):2095–2102, 2009. doi:10.1093/bioinformatics/btp065
MOTIVATION: Specific functions of ribonucleic acid (RNA) molecules are often associated with different motifs in the RNA structure. The key feature that forms such an RNA motif is the combination of sequence and structure properties. In this article, we introduce a new RNA sequence-structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs. Based on a fast detectable set of overlapping and crossing substructure matches for two nested RNA secondary structures, our method ExpaRNA (exact pattern of alignment of RNA) computes the longest collinear sequence of substructures common to two RNAs in O(H ast nm) time and O(nm) space, where H << n.m for real RNA structures. Applied to different RNAs, our method correctly identifies sequence-structure similarities between two RNAs. RESULTS: We have compared ExpaRNA with two other alignment methods that work with given RNA structures, namely RNAforester and RNAalign. The results are in good agreement, but can be obtained in a fraction of running time, in particular for larger RNAs. We have also used ExpaRNA to speed up state-of-the-art Sankoff-style alignment tools like LocARNA, and observe a tradeoff between quality and speed. However, we get a speedup of 4.25 even in the highest quality setting, where the quality of the produced alignment is comparable to that of LocARNA alone. AVAILABILITY: The presented algorithm is implemented in the program ExpaRNA, which is available from our website (http://www.bioinf.uni-freiburg.de/Software).
Martin Mann, Cameron Smith, Mohamad Rabbath, Marlien Edwards, Sebastian Will, and Rolf Backofen. CPSP-web-tools: a server for 3D lattice protein studies. Bioinformatics, 25(5):676–7, 2009. doi:10.1093/bioinformatics/btp034
Studies on proteins are often restricted to highly simplified models to face the immense computational complexity of the associated problems. Constraint-based protein structure prediction (CPSP) tools is a package of very fast algorithms for ab initio optimal structure prediction and related problems in 3D HP-models [cubic and face centered cubic (FCC)]. Here, we present CPSP-web-tools, an interactive online interface of these programs for their immediate use. They include the first method for the direct prediction of optimal energies and structures in 3D HP side-chain models. This newest extension of the CPSP approach is described here for the first time. AVAILABILITY AND IMPLEMENTATION: Free access at http://cpsp.informatik.uni-freiburg.de
Jérôme Waldispühl, Charles W. O'Donnell, Sebastian Will, Srinivas Devadas, Rolf Backofen, and Bonnie Berger. Simultaneous Alignment and Folding of Protein Sequences. In Serafim Batzoglou, editor, Proc.of the 13^th Annual International Conferences on Computational Molecular Biology (RECOMB'09), volume 5541 of LNBI, pages 339–355. Springer, 2009.
One of the central challenges in computational biology is to develop accurate tools for protein structure analysis. Particularly difficult cases of this are sequence alignment and consensus folding of low-homology proteins. In this work, we present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align additionally exploits sparsity in the set of likely super-secondary structure pairings and alignment candidates for each amino acid to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane beta-barrel proteins, an important yet difficult class of proteins with very few available three-dimensional structures. In tests on sequence alignments derived from structure alignments, partiFold-Align is significantly more accurate than current best approaches for pairwise sequence alignment in the difficult case of low sequence homology and improves secondary structure prediction when current approaches fail. Importantly, partiFold-Align does not require training on transmembrane beta-barrel proteins. The generality of these techniques should allow them to be applied to a wide variety of protein structures.
Mathias Möhl, Sebastian Will, and Rolf Backofen. Lifting Prediction to Alignment of RNA Pseudoknots. In Serafim Batzoglou, editor, Proc. of the 13th Annual International Conferences on Computational Molecular Biology (RECOMB'09), volume 5541 of LNBI, pages 285–301. Springer, 2009.
Many computational problems concerning RNA pseudoknot structures, most prominently their prediction and alignment, are NP-hard. For structure prediction, several algorithms have been proposed that are restricted to certain classes of pseudoknots in order to run efficiently. We present a general scheme that yields an efficient alignment algorithm for arbitrary classes of pseudoknots that can be predicted efficiently by dynamic programming. Moreover, we show that such an alignment algorithm benefits from restrictions to certain structure classes in the same way as structure prediction algorithms do. We look at five of these classes in greater detail. Compared to the respective structure prediction algorithm, the time and space complexity of the obtained alignment algorithm is increased by only a linear factor. For four of the classes, there do not exist alignment algorithms so far. For the fifth, most general class, our algorithm improves the time complexity compared to the best algorithm known so far, from O(n⁵m⁵) time to O(nm⁶), where n and m are the length of the two aligned sequences. Finally, we apply the fastest of the generated algorithms with complexity O(nm⁴) time and O(nm²) space to comparative de-novo prediction of pseudoknots.
Stephan H. Bernhart, Ivo L. Hofacker, Sebastian Will, Andreas R. Gruber, and Peter F. Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9:474, 2008. doi:10.1186/1471-2105-9-474
Background: The prediction of a consensus structure for a set of related RNAs is an important first step for subsequent analyses. RNAalifold, which computes the minimum energy structure that is simultaneously formed by a set of aligned sequences, is one of the oldest and most widely used tools for this task. In recent years, several alternative approaches have been advocated, pointing to several shortcomings of the original RNAalifold approach. Results: We show that the accuracy of RNAalifold predictions can be improved substantially by introducing a different, more rational handling of alignment gaps, and by replacing the rather simplistic model of covariance scoring with more sophisticated RIBOSUM-like scoring matrices. These improvements are achieved without compromising the computational efficiency of the algorithm. We show here that the new version of RNAalifold not only outperforms the old one, but also several other tools recently developed, on different datasets. Conclusions: The new version of RNAalifold not only can replace the old one for almost any application but it is also competitive with other approaches including those based on SCFGs, maximum expected accuracy, or hierarchical nearest neighbor classifiers.
Mathias Möhl, Sebastian Will, and Rolf Backofen. Fixed Parameter Tractable Alignment of RNA Structures Including Arbitrary Pseudoknots. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM 2008), LNCS, pages 69–81. Springer-Verlag, 2008.
We present an algorithm that computes the edit distance of two RNA structures with arbitrary kinds of pseudoknots. A main benefit of the algorithm is that, despite the problem is NP-hard, the algorithmic complexity adapts to the complexity of the RNA structures. Due to fixed parameter tractability, we can guarantee polynomial run-time for a parameter which is small in practice. Our algorithm can be considered as a generalization of the algorithm of Jiang et al.~(Jiang, 2002) to arbitrary pseudoknots. In their absence, it gracefully degrades to the same polynomial algorithm. A prototypical implementation demonstrates the applicability of the method.
Sebastian Will, Anke Busch, and Rolf Backofen. Efficient Sequence Alignment with Side-Constraints by Cluster Tree Elimination. Constraints Journal, 13(1):110–129, 2008. doi:10.1007/s10601-007-9032-x
Aligning DNA and protein sequences is a core technique in molecular biology. Often, it is desirable to include partial prior knowledge and conditions in an alignment. Going beyond prior work, we aim at the integration of such side constraints in free combination into alignment algorithms. The most common and successful technique for efficient alignment algorithms is dynamic programming (DP). However, a weakness of DP is that one cannot include additional constraints without specifically tailoring a new DP algorithm. Here, we discuss a declarative approach that is based on constraint techniques and show how it can be extended by formulating additional knowledge as constraints. We take special care to obtain the efficiency of DP for sequence alignment. This is achieved by careful modeling and applying proper solving strategies. Finally, we apply our method to the scanning for RNA motifs in large sequences. This case study demonstrates how the new approach can be used in real biological problems. A prototypic implementation of the method is available at http://www.bioinf.uni-freiburg.de/Software/CTE-Alignment.
Wolfgang Otto, Sebastian Will, and Rolf Backofen. Structure Local Multiple Alignment of RNA. In Proceedings of German Conference on Bioinformatics (GCB'2008), volume P-136 of Lecture Notes in Informatics (LNI), pages 178–188. Gesellschaft für Informatik (GI), 2008.
Today, RNA is well known to perform important regulatory and catalytical function due to its distinguished structure. Consequentially, structure is conserved in evolution and state-of-the art RNA multiple alignment algorithms consider structure as well as sequence information. However, existing tools neglect the important aspect of locality. Notably, locality in RNA occurs as similarity of subsequences as well as similarity of only substructures.
We present a novel approach for multiple alignment of RNAs that deals with both kinds of locality. The approach extends LocARNA by structural locality and delegates the construction of multiple alignments to T-Coffee.
The paper systematically investigates structural locality in known RNA families. Benchmarking multiple alignment tools on structurally local families shows the need for algorithmic support of this locality. The improvement in accuracy in special cases is achieved while staying competitive with state-of-the-art alignment tools across the whole Bralibase.
LocARNA and its T-Coffee extended variant LocARNATE are freely available at http://www.bioinf.uni-freiburg.de/Software/LocARNA/.
Steffen Heyne, Sebastian Will, Michael Beckstette, and Rolf Backofen. Lightweight comparison of RNAs based on exact sequence-structure matches. In Proceedings of the German Conference on Bioinformatics (GCB'2008), volume P-136 of Lecture Notes in Informatics (LNI), pages 189–198. Gesellschaft für Informatik (GI), 2008.
Specific functions of RNA molecules are often associated to different motifs in the RNA structure. The key feature is that the combination of sequence and structure properties form such an RNA motif. In this paper we introduce a new RNA sequence-structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs. Based on a fast detectable set of overlapping and crossing substructure matches for two nested RNA secondary structures, our method computes the longest colinear sequence of substructures common to two RNAs in O(n²m²) time and O(nm) space. Applied to different RNAs, our method correctly identifies sequence-structure similarities between two RNAs. The results of our experiments are in good agreement with existing alignment-based methods, but can be obtained in a fraction of running time, in particular for larger RNAs. The proposed algorithm is implemented in the program expaRNA, which is available from our website (www.bioinf.uni-freiburg.de/Software).
Martin Mann, Sebastian Will, and Rolf Backofen. CPSP-tools - Exact and Complete Algorithms for High-throughput 3D Lattice Protein Studies. BMC Bioinformatics, 9:230, 2008. doi:10.1186/1471-2105-9-230
Background: The principles of protein folding and evolution pose problems of very high inherent complexity. Often these problems are tackled using simplified protein models, e.g. lattice proteins. The CPSP-tools package provides programs to solve exactly and completely the problems typical of studies using 3D lattice protein models. Among the tasks addressed are the prediction of (all) globally optimal and/or suboptimal structures as well as sequence design and neutral network exploration. Results: In contrast to stochastic approaches, which are not capable of answering many fundamental questions, our methods are based on fast, non-heuristic techniques. The resulting tools are designed for high-throughput studies of 3D-lattice proteins utilizing the Hydrophobic-Polar (HP) model. The source bundle is freely available at http://www.bioinf.uni-freiburg.de/sw/cpsp/ Conclusions: The CPSP-tools package is the first set of exact and complete methods for extensive, high-throughput studies of non-restricted 3D-lattice protein models. In particular, our package deals with cubic and face centered cubic (FCC) lattices.
Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. Inferring Non-Coding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering. PLoS Comput Biol, 3(4):e65, 2007. doi:10.1371/journal.pcbi.0030065
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficientto establish homology. In some cases, such as microRNAs, box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist ofmultiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative non-coding RNAs and regulatory RNA signals. For many ofthem, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypo-thetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classesfrom genome-wide surveys for structured RNAs. The LocARNA tool implements a novel variant of theSankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured ornon-conserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seedalignments. Furthermore, we have applied it to a previously published set of 3332 predicted structured elements in the Ciona intestinalis genomes (Missal et al., Bioinformatics 21(S2), i77-i78). In addition torecovering e.g. tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to-date norepresentative has been experimentally characterized.
Athanasius F. Bompfünewerer, Rolf Backofen, Stephan H. Bernhart, Jana Hertel, Ivo L. Hofacker, Peter F. Stadler, and Sebastian Will. Variations on RNA folding and alignment: lessons from Benasque. Journal of Mathematical Biology, 56(1-2):129–144, 2008. doi:10.1007/s00285-007-0107-5
Dynamic programming algorithms solve many standard problems of RNA bioinformatics in polynomial time. In this contribution we discuss a series of variations on these standard methods that implement refined biophysical models, such as a restriction of RNA folding to canonical structures, and an extension of structural alignments to an explicit scoring of stacking propensities. Furthermore, we demonstrate that a local structural alignment can be employed for ncRNA gene finding. In this context we discuss scanning variants for folding and alignment algorithms.
Athanasius F. Bompfünewerer Consortium, Rolf Backofen, Stephan H. Bernhart, Christoph Flamm, Claudia Fried, Guido Fritzsch, Jörg Hackermüller, Jana Hertel, Ivo L. Hofacker, Kristin Missal, Axel Mosig, Sonja J. Prohaska, Dominic Rose, Peter F. Stadler, Andrea Tanzer, Stefan Washietl, and Sebastian Will. RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol, 308B(1):1–25, 2007. doi:10.1002/jez.b.21130
Starting with the discovery of microRNAs and the advent of genome-wide transcriptomics, non-protein-coding transcripts have moved from a fringe topic to a central field research in molecular biology. In this contribution we review the state of the art of "computational RNomics", i.e., the bioinformatics approaches to genome-wide RNA annotation. Instead of rehashing results from recently published surveys in detail, we focus here on the open problem in the field, namely (functional) annotation of the plethora of putative RNAs. A series of exploratory studies are used to provide non-trivial examples for the discussion of some of the difficulties.
Alessandro Dal Palù, Agostino Dovier, Fran c cois Fages, and Sebastian Will. Constraint-based Methods for Bioinformatics. In Frédéric Benhamou, Narendra Jussien, and Barry O'Sullivan, editors, Trends in Constraint Programming, chapter 6, pages 129–150. ISTE, London, UK, may 2007.
Martin Mann, Sebastian Will, and Rolf Backofen. The Energy Landscape Library - A platform for Generic Algorithms. In BIRD'07 - 1st international Conference on Bioinformatics Research and Development, volume 217, pages 83–86. Oesterreichische Computer Gesellschaft, 2007.
The study of energy landscapes of biopolymers and their models is an important field in bioinformatics. For instance the investigation of kinetics or folding simulations are done using methods that are based on sampling or exhaustive enumeration. Most of such algorithms are independent of the underlying landscape model. Therefore frameworks for generic algorithms to investigate the landscape properties is needed. Here, we present the Energy Landscape Library (ELL) that allows such a model-independent formulation of generic algorithms dealing with discrete states. The ELL is a completely object-oriented C++ library that is highly modular, easy to extend, and freely available online. It can be used for a fast and easy implementation of new generic algorithms (possibly based on the provided basic method pool) or as a framework to test their properties for different landscape models, which can be formulated straightforward.
Rolf Backofen and Sebastian Will. A constraint-Based Approach to Fast and Exact Structure Prediction in Three-Dimensional Protein Models. Journal of Constraints, 11(1):5–30, January 2006. doi:10.1007/s10601-006-6848-8
Simplified protein models are used for investigating general properties of proteins and principles of protein folding. Furthermore, they are suited for hierarchical approaches to protein structure prediction. A well known protein model is the HPmodel of Lau and Dill [33], which models the important aspect of hydrophobicity. One can define the HP-model for various lattices, among them two-dimensional and three-dimensional ones. Here, we investigate the three-dimensional case. The main motivation for studying simplified protein models is to be able to predict model structures much more quickly and more accurately than is possible for real proteins. However, up to now there was a dilemma: the algorithmically tractable, simple protein models can not model real protein structures with good quality and introduce strong artifacts.
We present a constraint-based method that largely improves this situation. It outperforms all existing approaches for lattice protein folding in HP-models. This approach is the first one that can be applied to two three-dimensional lattices, namely the cubic lattice and the face-centered-cubic (FCC ) lattice. Moreover, it is the only exact method for the FCC lattice. The ability to use the FCC lattice is a significant improvement over the cubic lattice. The key to our approach is the ability to compute maximally compact sets of points (used as hydrophobic cores), which we accomplish for the first time for the FCC lattice.
Keywords: protein structure prediction, HP-model, face-centered cubic lattice, constraint programming
Michael T. Wolfinger, Sebastian Will, Ivo L. Hofacker, Rolf Backofen, and Peter F. Stadler. Exploring the lower part of discrete polymer model energy landscapes. Europhysics Letters, 74(4):725–732, 2006. doi:10.1209/epl/i2005-10577-0
We present a generic, problem independent algorithm for exploration of the lowenergy portion of the energy landscape of discrete systems and apply it to the energy landscape of lattice proteins. Starting from a set of optimal and near-optimal conformations derived from a constraint-based search technique, we are able to selectively investigate the lower part of lattice protein energy landscapes in two and three dimensions. This novel approach allows, in contrast to exhaustive enumeration, for an eOEcient calculation of optimal and near-optimal structures below a given energy threshold and is only limited by the available amount of memory. A straightforward application of the algorithm is calculation of barrier trees (representing the energy landscape), which then allows dynamics studies based on landscape theory.
Anke Busch, Sebastian Will, and Rolf Backofen. SECISDesign: a server to design SECIS-elements within the coding sequence. Bioinformatics, 21(15):3312–3, 2005.
SUMMARY: SECISDesign is a server for the design of SECIS-elements and arbitrary RNA-elements within the coding sequence of an mRNA. The element has to satisfy both structure and sequence constraints. At the same time, a certain amino acid similarity to the original protein has to be kept. The designed sequence can be used for recombinant expression of selenoproteins in Escherichia coli. AVAILABILITY: The server is available at http://www.bio.inf.uni-jena.de/Software/SECISDesign/index.html.
Rolf Backofen and Sebastian Will. Local Sequence-Structure Motifs in RNA. Journal of Bioinformatics and Computational Biology (JBCB), 2(4):681–698, 2004.
Ribonuclic acid (RNA) enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure. Therefore, in addition to sequence information, structure plays an important part in assessing the similarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently, we address the problem of comparing RNA molecules by computing an optimal local alignment with respect to sequence and structure information. While local alignment is superior to global alignment for identifying local similarities, no general local sequence-structure alignment algorithms are currently known. We suggest a new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure alignment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time complexity of O(n²m²max(n,m)) and a space complexity of only O(nm). An implementation of our algorithm is available at http://www.bio.inf.uni-jena.de. Its runtime is competitive with global sequence-structure alignment.
Keywords: RNA; local alignment; local sequence-structure alignment; lssa
Sebastian Will and Rolf Backofen. Breaking of Partial Symmetries in the Photo and Alignment Problem. In Barbara Smith, Ian Gent, and Warwick Harvey, editors, Proceedings of the Third International Workshop on Symmetry in Constraint Satisfaction Problems (SymCon 2003), pages 187–194, 2003.
Symmetry breaking by adding symmetric constraints during search usually assumes that symmetric constraints are simple. We identify symmetries with complex symmetric constraints, where the symmetries nevertheless can be handled by a similar method. For this aim, we introduce partial symmetries. We identify those symmetries in two problems. The photo problem is a well known example problem, while the alignment problem is a real world problem from bioinformatics.
Rolf Backofen and Sebastian Will. A constraint-Based Approach to Structure Prediction for Simplified Protein Models that Outperforms Other Existing Methods. In Proceedings of the 19th International Conference on Logic Programming (ICLP 2003), volume 2916 of Lecture Notes in Computer Science, pages 49–71. Springer, 2003.
Lattice protein models are used for hierarchical approaches to protein structure prediction, as well as for investigating general principles of protein folding. So far, one has the problem that either the lattice does not model real protein conformations with good quality, or there is no efficient method known for finding native conformations. We present a constraint-based method that largely improves this situ- ation. It outperforms all existing approaches in lattice protein folding on the type of model we have chosen (namely the HP-model by Lau and Dill[34], which models the important aspect of hydrophobicity). It is the only exact method that has been applied to two different lattices. Furthermore, it is the only exact method for the face-centered cubic lattice. This lattice is important since it has been shown [38] that the FCC lattice can model real protein conformations with coordinate root mean square deviation below 2 Angstrom. Our method uses a constraint-based approach. It works by first calculating maximally compact sets of points (hydrophobic cores), and then threading the given HP-sequence to the hydrophobic cores such that the core is occupied by H-monomers.
Rolf Backofen and Sebastian Will. Excluding symmetries in constraint-based search. Constraints, 7(3):333–349, 2002.
We introduce a new method for excluding symmetries in constraint based search. To our knowledge, it is the first declarative method that can be applied to textbackslash em arbitrary symmetries. Our method is based on the notion of symmetric constraints, which are used in our modification of a general constraint based search algorithm. The method does not influence the search strategy. Furthermore, it can be used with either the full set of symmetries, or with an subset of all symmetries. We proof correctness, completeness and symmetry exclusion properties of our method. Then, we show how to apply the method in the special case of geometric symmetries (rotations and reflections) and permutation symmetries. Furthermore, we give results from practical applications.
Sebastian Will. Constraint-based Hydrophobic Core Construction for Protein Structure Prediction in the Face-Centered-Cubic Lattice. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002 (PSB 2002), pages 661–672, Singapore, 2002. World Scientific Publishing Co. Pte. Ltd.
We present an algorithm for exact protein structure prediction in the FCC-HP-model. This model is a lattice protein model on the face-centered-cubic lattice that models the main force of protein folding, namely the hydrophobic force. The structure prediction for this model can be based on the construction of hydrophobic cores. The main focus of the paper is on an algorithm for constructing maximally and submaximally compact hydrophobic cores of a given size. This algorithm treats core construction as a constraint satisfaction problem (CSP), and the paper describes its constraint model. The algorithm employs symmetry excluding constraint-based search [Backofen, Will: CP99] and relies heavily on good upper bounds on the number of contacts. Here, we use and strengthen upper bounds presented earlier. [Backofen, Will: CPM2001] The resulting structure prediction algorithm (including previous work [Backofen, Will: CPM2001, Backofen, Will: CP2001]) handles sequences of sizes in the range of real proteins fast, i.e. we predict a first structure often within a few minutes. The algorithm is the first exact one for the FCC, besides full enumeration which is impracticable for chain lengths greater than about 15. We tested the algorithm succesfully up to sequence length of 160, which is far beyond the capabilities even of previous heuristic approaches.
Rolf Backofen and Sebastian Will. Optimally Compact Finite Sphere Packings — Hydrophobic Cores in the FCC. In Proc. of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM2001), volume 2089 of Lecture Notes in Computer Science, pages 257–272, Berlin, 2001. Springer-Verlag.
Lattice protein models are used for hierarchical approaches to protein structure prediction, as well as for investigating principles of protein folding. The problem is that there is so far no known lattice that can model real protein conformations with good quality, textbackslash emphand for which there is an efficient method to prove whether a conformation found by some heuristic algorithm is optimal. We present such a method for the FCC-HP-Model [Agarwala et al., JCB 97]. For the FCC-HP-Model, we need to find conformations with a maximally compact hydrophobic core. Our method allows us to enumerate maximally compact hydrophobic cores for sufficiently great number of hydrophobic amino-acids. We have used our method to prove the optimality of heuristically predicted structures for HP-sequences in the FCC-HP-model.
Rolf Backofen and Sebastian Will. Fast, Constraint-based Threading of HP-Sequences to Hydrophobic Cores. In Proc. of the 7th International Conference on Principle and Practice of Constraint Programming (CP'2001), volume 2239 of Lecture Notes in Computer Science, pages 494–508, Berlin, 2001. Springer-Verlag.
Lattice protein models are used for hierarchical approaches to protein structure prediction, as well as for investigating principles of protein folding. So far, one has the problem that there exists no lattice that can model real protein conformations with good quality textbackslash emphand for which an efficient method to find native conformations is known. We present the first method for the FCC-HP-Model [Agarwala et al., JCB 97] that is capable of finding native conformations for real-sized HP-sequences. It has been shown [Park and Levitt, JMB 95] that the FCC lattice can model real protein conformations with coordinate root mean square deviation below 2 textbackslash AA. Our method uses a constraint-based approach. It works by first calculating maximally compact sets of points (hydrophobic cores), and then threading the given HP-sequence to the hydrophobic cores such that the core is occupied by H-monomers.
Rolf Backofen, Sebastian Will, and Peter Clote. Algorithmic approach to quantifying the hydrophobic force contribution in protein folding. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing (PSB 2000), volume 5, pages 92–103, 2000.
Though the electrostatic, ionic, van der Waals, Lennard-Jones, hydrogen bonding, and other forces play an important role in the energy function minimized at a protein's native state, it is widely believed that the textbackslash em hydrophobic force is the dominant term in protein folding. In this paper, we attempt to quantify the extent to which the hydrophobic force determines the positions of the backbone textbackslash alpha-carbon atoms in PDB data, by applying Monte-Carlo and genetic algorithms to determine the predicted conformation with minimum energy, where only the hydrophobic force is considered (i.e. Dill's HP-model, and refinements using Woese's polar requirement). This is done by computing the root mean square deviation between the normalized distance matrix D = (dsub>i,j) (dsub>i,j is normalized Euclidean distance between residues rsub>i and rsub>j) for PDB data with that obtained from the output of our algorithms. Our program was run on the database of ancient conserved regions drawn from GenBank 101 generously supplied by W. Gilbert's lab, as well as medium-sized proteins (E. Coli RecA, 2reb, Erythrocruorin, 1eca, and Actinidin 2act). The root mean square deviation (RMSD) between distance matrices derived from the PDB data and from our program output is quite small, and by comparison with RMSD between PDB data and random coils, allows a quantification of the hydrophobic force contribution Keywords: lattice, face-centered-cubic, hydrophobic force, automorphism
Slim Abdennadher, Matthias Saft, and Sebastian Will. Classroom Assignment using Constraint Logic Programming. In Proceedings of the Second International Conference and Exhibition on The Practical Application of Constraint Technologies and Logic Programming (PACLP 2000), 2000.
The Classroom Assignment problem consists of scheduling a set of courses into a fixed number of rooms, given a fixed timetable. At the University of Munich, a classroom plan has to be created each semester after collecting timetables of all departments and wishes of teachers. This planning is very complex since a lot of constraints have to be met, e.g. room size and equipment. Using constraint-based programming, we developed a prototype, called RoomPlan, that supports automatic creation of classroom plans. With RoomPlan a schedule can be created interactively within some minutes instead of some days. RoomPlan was presented at the Systems'99 Computer exhibition in Munich.
Rolf Backofen, Sebastian Will, and Erich Bornberg-Bauer. Application of Constraint Programming Techniques for Structure Prediction of Lattice Proteins with Extended Alphabets. Bioinformatics, 15(3):234–242, 1999.
Predicting the ground state of biopolymers is a notoriously hard problem in biocomputing. Model systems, such as lattice proteins are simple tools and valuable to test and improve new methods. Best known are HP-type models with sequences composed from a binary (hydrophobic and polar) alphabet. Major drawback is the degeneracy, i.e. the number of different ground state conformations. Here we show how recently developed constraint programming techniques can be used to solve the structure prediction problem efficiently for a higher order alphabet. To our knowledge it is the first report of an exact and computationally feasible solution to model proteins of length up to 36 and without resorting to maximally compact states. We further show that degeneracy is reduced by more than one order of magnitude and that ground state conformations are not necessarily compact. Therefore, more realistic protein simulations become feasible with our model.
Rolf Backofen and Sebastian Will. Excluding Symmetries in Constraint-Based Search. In Joxan Jaffar, editor, Proceedings of the 5th International Conference on Principle and Practice of Constraint Programming (CP'99), volume 1713 of Lecture Notes in Computer Science, pages 73–87, Berlin, 1999. Springer-Verlag.
We introduce a new method for excluding symmetries in constraint based search. To our knowledge, it is the first declarative method that can be applied to textbackslash em arbitrary symmetries. Our method is based on the notion of symmetric constraints, which are used in our modification of a general constraint based search algorithm. The method does not influence the search strategy. Furthermore, it can be used with either the full set of symmetries, or with an subset of all symmetries. We proof correctness, completeness and symmetry exclusion properties of our method. We then show how to apply the method in the special case of geometric symmetries (rotations and reflections) and permutation symmetries. Furthermore, we give results from practical applications.
Rolf Backofen, Sebastian Will, and Peter Clote. Algorithmic approach to quantifying the hydrophobic force contribution in protein folding. In Proceedings of the German Conference on Bioinformatics (GCB'99), pages 93–106, 1999. http://www.bioinfo.de/isb/gcb99.
Though the electrostatic, ionic, van der Waals, Lennard-Jones, hydrogen bonding, and other forces play an important role in the energy function minimized at a protein's native state, it is widely believed that the textbackslash em hydrophobic force is the dominant term in protein folding. In this paper, we attempt to quantify the extent to which the hydrophobic force determines the positions of the backbone textbackslash alpha-carbon atoms in PDB data, by applying Monte-Carlo and genetic algorithms to determine the predicted conformation with minimum energy, where only the hydrophobic force is considered (i.e. Dill's HP-model, and refinements using Woese's polar requirement). This is done by computing the root mean square deviation between the normalized distance matrix D = (dsub>i,j) (dsub>i,j is normalized Euclidean distance between residues rsub>i and rsub>j) for PDB data with that obtained from the output of our algorithms. Our program was run on the database of ancient conserved regions drawn from GenBank 101 generously supplied by W. Gilbert's lab, as well as medium-sized proteins (E. Coli RecA, 2reb, Erythrocruorin, 1eca, and Actinidin 2act). The root mean square deviation (RMSD) between distance matrices derived from the PDB data and from our program output is quite small, and by comparison with RMSD between PDB data and random coils, allows a quantification of the hydrophobic force contribution The final version of this paper will appear in the proceedings of PSB'2000. Keywords: lattice, face-centered-cubic, hydrophobic force, automorphism
Rolf Backofen and Sebastian Will. Structure Prediction in an HP-type Lattice with an Extended Alphabet. In Proc of German Conference on Bioinformatics (GCB'98), 1998.
The protein structure prediction problem is one of the most important problems in computational biology. This problem consists of finding the conformation of a protein (given by a sequence of amino-acids) with minimal energy. Because of the complexity of this problem, simplified models like Dill's HP-lattice model have become a major tool for investigating general properties of protein folding. Even for this simplified model, the structure prediction problem has been proven to be NP-complete. A disadvantage of the HP-problem is its high degeneracy. I.e., for every sequence there are a lot of conformations having the minimal energy. For this reason, extended alphabets have been used in the literature. One of these alphabets is the HPNX-alphabet, which considers hydrophobic amino acids as well as positive and negative charged ones. In this paper, we describe an exact algorithm for solving the structure prediction problem for the HPNX-alphabet. To our knowledge, our algorithm is the first exact one for finding the minimal conformation of an lattice protein in a lattice model with an alphabet more complex than HP. We also compare our results with results as given for the HP-model.