========
Appendix
========

Pre-processing miRBase data:
----------------------------

The ``MIRfix`` pipeline :cite:`Yazbeck:19a` provides the general core of
functions to curate *bona fide* metazoan microRNA annotations. To make use of
this curation process, it is fundamental to organize the input data in a
specific format, as referenced in more detail in :cite:`Yazbeck:19a`. In
summary, it is required:

 - A set of precursor sequences with their associated mature sequences.
 - Genome sequences from which miRNAs were annotated.
 - A relation file that describes the relation between precursors and their
   annotated matures.

Additional parameters are required, but they did not depend from external
information/databases.

On ``miRNAture`` the source of the curation data has been obtained from a re-evaluation
of the annotations deposited on ``miRBase`` v.22.1 :cite:`Kozomara:19`. In this
version, ``miRBase`` accounted for a set of X *canonical* and *non-canonical* miRNA families,
from which Y are constituted by metazoan sequences. Internally, ``miRNAture``
performs an evaluation of the *canonical* model, that relies on the correct
positioning cleavages performed by Drosha and Dicer at the precursor maturation steps. 
Computationally, this is translated on a correct position of the mature sequence, 
accurate delimitation of precursor, and a phylogenetic support, addressed by the 
construction of family *mature-anchored* structural alignments. As previously
reported for the ``Rfam`` miRNA families :cite:`VelandiaDiss:2022`, an iterative
assessment involves a selection of sequences, consistent criteria to evaluate
the miRNAs and their mature products, and generation of probabilistic models
derived from anchored-alignments to search additional candidates that would
incorporate defined curation criteria. This criteria was inherited to perform an 
evaluation of the ``miRBase`` metazoan families and generate the corrected
dataset that ``miRNAture`` uses to evaluate new candidates and their maturation
entities.

As a toy example, the family miR-17 (MIPF0000001) was selected to demonstrate
the assessment steps performed over all pre-calculated dataset used by `miRNAture`.
As reported in ``miRBase`` this family is composed by 239 miRNA precursors derived from 
39 vertebrate species. Through the filtering approach the following subsetting steps are
considered:

- Remove non-metazoan sequences.
- Filter duplicates (which share 100% identity) and select one representant
  sequence.

In this family, 117 duplicated sequences where recognized. For instance the
sequence bta-mir-18a (MI0004740) from *Bos taurus* has shown 21 orthologs, as follows:

.. _table_dup:

.. figure:: ./table_MI0004740.png
   :width: 600
   :align: center
   :alt: Orthologs in other species
   :figclass: align-center

   Identified orthologs from bta-mir-18a on vertebrates.

And some corresponding alignments:

.. _align_dup:

.. figure:: ./alignments_MI0004740.png
   :width: 600
   :align: center
   :alt: Orthologs in other species
   :figclass: align-center
   
   Alignments as evidence of 100% identity.

Remaining 122 families were subject of a structural assessment by ``MIRfix``,
which filtered 4 sequences based on the incorrect miRNA folding in regard their
annotated mature sequences, and one sequence contained a bad positioned mature
sequence in the reported precursor, a successful extension of the precursor based
on the miR and miR* prediction, rescued the candidate.

================================ ==============================================
  Category                           Accession numbers
================================ ==============================================
Bad position mature sequences    MI0004822
Filtered sequences               MI0012797, MI0012947, MI0019542, and MI0013837
================================ ==============================================

At the end of the assessment 118 sequences passed all filters to be considered
into the curation dataset used on ``miRNAture``. 

The same approach curated all metazoan miRNA families from ``miRBase`` (1415),
validating about 79% (1111) of the families and setting the curation dataset used on
``miRNAture``.


Construction of Hidden Markov and Covariance Models:
----------------------------------------------------

As described in :cite:`Velandia:2021`, a set of quality-filtering steps could be
used to construct family structural alignments and their corresponding
covariance models (CMs). In this case, to build new structural alignments from
``miRBase`` sequences, we selected all sequences from metazoan
species and removed all of those from studied organisms. Given that curated
subset, a genetic algorithm was used to maximize the quality the final structural
alignment. To do so, filtering miRNA sequences was done in function of: Identity
percentage (:math:`I`), phylogenetic distribution of sequences (:math:`C`) and quality
(:math:`Q`) [#f3]_, where: :math:`I = (60, 70, 80, 90, 100)`, :math:`C =` (Metazoa, Vertebrata,
Mammalia, Primates) and :math:`Q =` (normal, high). An individual :math:`A_{n}` was
defined as a vector :math:`\overrightarrow{A_{n}}= \begin{pmatrix} I \\ C \\ Q
\end{pmatrix}`, which return a structural alignment using ``MIRfix``, using selected
sequences. The *fitness* function (:math:`F`) to be maximized was defined through
empirical observation over features inferred from generated structural alignment, 
as follows:

.. math::

   F = (N_{seq} + (N_{spe} * (-F_{energy})) + (N_{parts} * 10))

Where :math:`N_{seq}` is the final number of sequences, :math:`N_{spe}` is the number of species,
:math:`F_{energy}` corresponds to folding energy calculated using
``RNAalifold`` :cite:`Lorenz2011` and :math:`N_{parts}` accounts the number of
additional (:math:`> 1`) stem-loops on the reported consensus structure. The initial
population was :math:`A_{p}=40`, used operators were: *Selection* = Tournament,
:math:`n=39`; *Crossover* = Single point, probability=0.7; *Mutation* =
Displacement mutation, probability=0.1. The implementation were performed in
``Python`` v3.7.9 using ``deap`` package :cite:`Fortin:2012`. 

Finally, hidden Markov (HMMs) and covariance (CMs) models were build as
described in :cite:`Velandia:2021` using ``RNAalifold`` :cite:`Lorenz2011`
and ``Infernal`` package v.1.1.2 :cite:`Nawrocki:2013`.

.. rubric:: Footnotes

.. [#f3] Confidence of the annotation assigned by ``miRBase``, see `https://www.mirbase.org/blog/2014/03/high-confidence-micrornas/`