EternaBench-CM¶
EternaBench-CM is a synthetic RNA dataset comprising 12,711 RNA constructs that have been chemically mapped using SHAPE and MAP-seq methods. These RNA sequences are probed to obtain experimental data on their nucleotide reactivity, which indicates whether specific regions of the RNA are flexible or structured. The dataset provides high-resolution, large-scale data that can be used for studying RNA folding and stability.
Disclaimer¶
This is an UNOFFICIAL release of the EternaBench-CM by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-CM did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
Dataset Description¶
- Homepage: https://multimolecule.danling.org/datasets/eternabench_cm
- datasets: https://huggingface.co/datasets/multimolecule/eternabench-cm
- Point of Contact: Rhiju Das
The dataset includes a large set of synthetic RNA sequences with experimental chemical mapping data, which provides a quantitative readout of RNA nucleotide reactivity. These data are ensemble-averaged and serve as a critical benchmark for evaluating secondary structure prediction algorithms in their ability to model RNA folding dynamics.
Example Entry¶
index | design | sequence | secondary_structure | reactivity | errors | signal_to_noise |
---|---|---|---|---|---|---|
769337-1 | d+m plots weaker again | GGAAAAAAAAAAA… | ................ | [0.642,1.4853,0.1629, …] | [0.3181,0.4221,0.1823, …] | 3.227 |
Column Description¶
-
id: A unique identifier for each RNA sequence entry.
-
design: The name given to each RNA design by contributors, used for easy reference.
-
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
- A: Adenine
- C: Cytosine
- G: Guanine
- U: Uracil
-
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA’s standard:
- Dots (
.
): Represent unpaired nucleotides. - Parentheses (
(
and)
): Represent base pairs in standard stems (page 1). - Square Brackets (
[
and]
): Represent base pairs in pseudoknots (page 2). - Curly Braces (
{
and}
): Represent base pairs in additional pseudoknots (page 3).
- Dots (
-
reactivity: A list of normalized reactivity values for each nucleotide, representing the likelihood that a nucleotide is unpaired. High reactivity indicates high flexibility (unpaired regions), and low reactivity corresponds to paired or structured regions.
-
errors: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
-
signal_to_noise: The signal-to-noise ratio calculated from the reactivity and error values, providing a measure of data quality.
Related Datasets¶
- eternabench-switch
- eternabench-external.1200: EternaBench-External dataset with maximum sequence length of 1200 nucleotides.
License¶
This dataset is licensed under the AGPL-3.0 License.
Text Only | |
---|---|