The 3-dimensional (3D) structure of the genome is of significant importance for many cellular processes. In this paper, we study the problem of reconstructing the 3D structure of chromosomes from Hi-C data of diploid organisms, which poses additional challenges compared to the better-studied haploid setting. With the help of techniques from algebraic geometry, we prove that a small amount of phased data is sufficient to ensure finite identifiability, both for noiseless and noisy data. In the light of these results, we propose a new 3D reconstruction method based on semidefinite programming, paired with numerical algebraic geometry and local optimization. The performance of this method is tested on several simulated datasets under different noise levels and with different amounts of phased data. We also apply it to a real dataset from mouse X chromosomes, and we are then able to recover previously known structural features.
Mitochondrial tRNAs have acquired a diverse portfolio of aberrant structures throughout metazoan evolution. With the availability of more than 12,500 mitogenome sequences, it is essential to compile a comprehensive overview of the pattern changes with regard to mitochondrial tRNA repertoire and structural variations. This, of course, requires reanalysis of the sequence data of more than 250,000 mitochondrial tRNAs with a uniform workflow. Here, we report our results on the complete reannotation of all mitogenomes available in the RefSeq database by September 2022 using mitos2. Based on the individual cases of mitochondrial tRNA variants reported throughout the literature, our data pinpoint the respective hotspots of change, i.e. Acanthocephala (Lophotrochozoa), Nematoda, Acariformes, and Araneae (Arthropoda). Less dramatic deviations of mitochondrial tRNAs from the norm are observed throughout many other clades. Loss of arms in animal mitochondrial tRNA clearly is a phenomenon that occurred independently many times, not limited to a small number of specific clades. The summary data here provide a starting point for systematic investigations into the detailed evolutionary processes of structural reduction and loss of mitochondrial tRNAs as well as a resource for further improvements of annotation workflows for mitochondrial tRNA annotation.
The periodic table organizes chemical elements into families based on their similarity, now understood through Quantum Mechanics. However, these families were inferred from limited compounds in the nineteenth century. Since then, the number of compounds has exponentially grown, leading to the discovery of new types unknown to pioneers. This situation prompts the question of whether these families can still be discerned from the amassed data, or if their recognition is confined to specific thermochemical domains. To address this inquiry, we conducted a comprehensive exploration by comparing formulae (1771–2015) as a proxy for chemical similarity. Our findings reveal that stoichiometry not only captures a significant portion of the trends observed within families but also unveils other intriguing features of the formal structure of chemical similarity. These patterns approach equivalence classes independent of thermochemical context and demonstrate high resilience. Temporal analysis demonstrates that, since approximately 1980, similarity is diminishing due to an increasing production of unique formulae for nearly all elements. Nevertheless, chemical families endure over time and they stand out as the most robust similarity patterns. Our analysis offers compelling evidence that any study will reach the same conclusions, provided there is a sufficient diversity in input compound data.
The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.
Graphs have become widely used to represent and study social, biological, and technological systems. Statistical methods to analyze empirical graphs were proposed based on the graph's spectral density. However, their running time is cubic in the number of vertices, precluding direct application to large instances. Thus, efficient algorithms to calculate the spectral density become necessary. For sparse graphs, the cavity method can efficiently approximate the spectral density of locally treelike undirected and directed graphs. However, it does not apply to most empirical graphs because they have heterogeneous structures. Thus, we propose methods for undirected and directed graphs with heterogeneous structures using a new vertex's neighborhood definition and the cavity approach. Our methods' time and space complexities are O(|E|h_{max}^{3}t) and O(|E|h_{max}^{2}t), respectively, where |E| is the number of edges, h_{max} is the size of the largest local neighborhood of a vertex, and t is the number of iterations required for convergence. We demonstrate the practical efficacy by estimating the spectral density of simulated and real-world undirected and directed graphs.
High-order structures have been recognised as suitable models for systems going beyond the binary relationships for which graph models are appropriate. Despite their importance and surge in research on these structures, their random cases have been only recently become subjects of interest. One of these high-order structures is the oriented hypergraph, which relates couples of subsets of an arbitrary number of vertices. Here we develop the Erdős-Rényi model for oriented hypergraphs, which corresponds to the random realisation of oriented hyperedges of the complete oriented hypergraph. A particular feature of random oriented hypergraphs is that the ratio between their expected number of oriented hyperedges and their expected degree or size is 3/2 for large number of vertices. We highlight the suitability of oriented hypergraphs for modelling large collections of chemical reactions and the importance of random oriented hypergraphs to analyse the unfolding of chemistry.
Over the last quarter of a century it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large-scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible non-coding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of non-coding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Hepatitis C virus (HCV) is a plus-stranded RNA virus that often chronically infects liver hepatocytes and causes liver cirrhosis and cancer. These viruses replicate their genomes employing error-prone replicases. Thereby, they routinely generate a large 'cloud' of RNA genomes (quasispecies) which-by trial and error-comprehensively explore the sequence space available for functional RNA genomes that maintain the ability for efficient replication and immune escape. In this context, it is important to identify which RNA secondary structures in the sequence space of the HCV genome are conserved, likely due to functional requirements. Here, we provide the first genome-wide multiple sequence alignment (MSA) with the prediction of RNA secondary structures throughout all representative full-length HCV genomes. We selected 57 representative genomes by clustering all complete HCV genomes from the BV-BRC database based on k-mer distributions and dimension reduction and adding RefSeq sequences. We include annotations of previously recognized features for easy comparison to other studies. Our results indicate that mainly the core coding region, the C-terminal NS5A region, and the NS5B region contain secondary structure elements that are conserved beyond coding sequence requirements, indicating functionality on the RNA level. In contrast, the genome regions in between contain less highly conserved structures. The results provide a complete description of all conserved RNA secondary structures and make clear that functionally important RNA secondary structures are present in certain HCV genome regions but are largely absent from other regions. Full-genome alignments of all branches of Hepacivirus C are provided in the supplement.
Segmentations are partitions of an ordered set into non-overlapping intervals. The Consensus Segmentation or Segmentation Aggregation problem is a special case of the median problems with applications in time series analysis and computational biology. A wide range of dissimilarity measures for segmentations can be expressed in terms of potentials, a special type of set-functions. In this contribution, we shed more light on the properties of potentials, and how such properties affect the solutions of the Consensus Segmentation problem. In particular, we disprove a conjecture stated in 2021, and we provide further insights into the theoretical foundations of the problem.
Genetic Algorithms are a powerful method to solve optimization problems with complex cost functions over vast search spaces that rely in particular on recombining parts of previous solutions. Crossover operators play a crucial role in this context. Here, we describe a large class of these operators designed for searching over spaces of graphs. These operators are based on introducing small cuts into graphs and rejoining the resulting induced subgraphs of two parents. This form of cut-and-join crossover can be restricted in a consistent way to preserve local properties such as vertex-degrees (valency), or bond-orders, as well as global properties such as graph-theoretic planarity. In contrast to crossover on strings, cut-and-join crossover on graphs is powerful enough to ergodically explore chemical space even in the absence of mutation operators. Extensive benchmarking shows that the offspring of molecular graphs are again plausible molecules with high probability, while at the same time crossover drastically increases the diversity compared to initial molecule libraries. Moreover, desirable properties such as favorable indices of synthesizability are preserved with sufficient frequency that candidate offsprings can be filtered efficiently for such properties. As an application we utilized the cut-and-join crossover in REvoLd, a GA-based system for computer-aided drug design. In optimization runs searching for ligands binding to four different target proteins we consistently found candidate molecules with binding constants exceeding the best known binders as well as candidates found in make-on-demand libraries. Taken together, cut-and-join crossover operators constitute a mathematically simple and well-characterized approach to recombination of molecules that performed very well in real-life CADD tasks.