bpRNA-spot¶
bpRNA-spot is a collection of the datasets used by SPOT-RNA for RNA secondary structure prediction.
The dataset is released as a composite repository, bpRNA-spot, and three numbered component repositories:
bpRNA-spot-0: the initial bpRNA split,TR0,VL0, andTS0.bpRNA-spot-1: the PDB transfer-learning split,TR1,VL1, andTS1.bpRNA-spot-2: the NMR-only evaluation split,TS2.
bpRNA-spot concatenates the components in order:
train:TR0 + TR1validation:VL0 + VL1test:TS0 + TS1 + TS2
The TR0/VL0/TS0 split is a subset of bpRNA-1m.
It applies CD-HIT (CD-HIT-EST) to remove sequences with more than 80% sequence similarity from bpRNA-1m.
It further randomly splits the remaining sequences into training, validation, and test sets with a ratio of approximately 8:1:1.
The TR1/VL1/TS1 split contains high-resolution PDB RNAs used for transfer learning.
TS2 contains 39 RNAs solved by NMR and is used for post-training evaluation.
All secondary structures are stored as dot-bracket notation.
For the sequence/label splits, base pairs that would make a nucleotide pair with multiple partners are removed before converting to dot-bracket notation.
Non-A/C/G/U symbols in those sequence files are normalized to N.
Schema¶
| Column | Description |
|---|---|
id |
Identifier of the sequence. |
sequence |
RNA sequence. |
secondary_structure |
Secondary structure in dot-bracket notation. Pseudoknots may use bracket tiers beyond (). |
structural_annotation |
bpRNA-style structural annotation generated from the stored dot-bracket structure. |
functional_annotation |
bpRNA-style functional annotation generated from the stored dot-bracket structure. |
Disclaimer¶
This is an UNOFFICIAL release of the bpRNA-spot by Jaswinder Singh, et al.
The team releasing bpRNA-spot did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
Dataset Description¶
- Homepage: https://multimolecule.danling.org/datasets/bprna-spot
- datasets: https://huggingface.co/datasets/multimolecule/bprna-spot
- Point of Contact: Kuldip Paliwal and Yaoqi Zhou
Related Datasets¶
- bpRNA-1m: A database of single molecule secondary structures annotated using bpRNA.
- bpRNA-new: A dataset of newly discovered RNA families from Rfam 14.2, designed for cross-family validation to assess generalization capability.
- RNAStrAlign: A database of RNA secondary with the same families as ArchiveII, usually used for training.
License¶
This dataset is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows: