跳转至

ArchiveII

ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.

ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.

It is considered complementary to the RNAStrAlign dataset.

Disclaimer

This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.

The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

Dataset Description

Example Entry

id sequence secondary_structure family
16S_rRNA-A.fulgidus AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA… …(((((…(((.))))).((((((((((.... 16S_rRNA

Column Description

  • id: A unique identifier for each RNA entry. This ID is derived from the family and the original .sta file name, and serves as a reference to the specific RNA structure within the dataset.

  • sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:

    • A: Adenine
    • C: Cytosine
    • G: Guanine
    • U: Uracil
  • secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA’s standard:

    • Dots (.): Represent unpaired nucleotides.
    • Parentheses (( and )): Represent base pairs in standard stems (page 1).
  • family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.

Variations

This dataset is available in two additional variants:

  • archiveii: The main ArchiveII dataset.
  • archiveii.512: ArchiveII dataset with sequences no longer than 512 nucleotides.
  • archiveii.1024: ArchiveII dataset with sequences no longer than 1024 nucleotides.
  • RNAStrAlign: A database of RNA secondary with the same families as ArchiveII, usually used for training.
  • bpRNA-spot: Another commonly used database in RNA secondary structures prediction.

License

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later

Citation

BibTeX
@article{samanbooy2022rna,
  author    = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},
  journal   = {BMC Bioinformatics},
  keywords  = {Deep learning; Pseudoknotted structures; RNA structure prediction},
  month     = feb,
  number    = 1,
  pages     = {58},
  publisher = {Springer Science and Business Media LLC},
  title     = {{RNA} secondary structure prediction with convolutional neural networks},
  volume    = 23,
  year      = 2022
}