RNAStrAlign¶
RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.
RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.
It is considered complementary to the ArchiveII dataset.
Disclaimer¶
This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.
The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
Dataset Description¶
- Homepage: https://multimolecule.danling.org/datasets/rnastralign
- datasets: https://huggingface.co/datasets/multimolecule/rnastralign
- Point of Contact: David H. Mathews and Gaurav Sharma
Example Entry¶
id | sequence | secondary_structure | family | subfamily |
---|---|---|---|---|
16S_rRNA-Actinobacteria-AB002635 | ACACAUGCAAGCGAACGUGAUCUCCAGCUUGC… | .(((.(((..((..((((.(((((.((....)… | 16S_rRNA | Actinobacteria |
Column Description¶
-
id: A unique identifier for each RNA entry. This ID is derived from the family and the original
.sta
file name, and serves as a reference to the specific RNA structure within the dataset. -
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
- A: Adenine
- C: Cytosine
- G: Guanine
- U: Uracil
-
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA’s standard:
- Dots (
.
): Represent unpaired nucleotides. - Parentheses (
(
and)
): Represent base pairs in standard stems (page 1).
- Dots (
-
family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.
-
subfamily: A more specific subfamily within the family, such as Actinobacteria for 16S rRNA.
Not all families have subfamilies, in which case this field will be
None
.
Variations¶
This dataset is available in two additional variants:
- rnastralign: The main RNAStrAlign dataset.
- rnastralign.512: RNAStrAlign dataset with sequences no longer than 512 nucleotides.
- rnastralign.1024: RNAStrAlign dataset with sequences no longer than 1024 nucleotides.
Related Datasets¶
- ArchiveII: A database of RNA secondary with the same families as RNAStrAlign, usually used for testing.
- bpRNA-spot: Another commonly used database in RNA secondary structures prediction.
License¶
This dataset is licensed under the AGPL-3.0 License.
Text Only | |
---|---|