RYOS¶

RYOS

RYOS is a database of RNA backbone stability in aqueous solution.

RYOS focuses on exploring the stability of mRNA molecules for vaccine applications. This dataset is part of a broader effort to address one of the key challenges of mRNA vaccines: degradation during shipping and storage.

Statement¶

Deep learning models for predicting RNA degradation via dual crowdsourcing is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

Disclaimer¶

This is an UNOFFICIAL release of the RYOS by Hannah K. Wayment-Steele, et al.

The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

Dataset Description¶

Homepage: https://multimolecule.danling.org/datasets/ryos
Point of Contact: Rhiju Das
Kaggle Challenge: https://www.kaggle.com/competitions/stanford-covid-vaccine
Eterna Round 1: https://eternagame.org/labs/9830365
Eterna Round 2: https://eternagame.org/labs/10207059

Example Entry¶

id	design	sequence	secondary_structure	reactivity	errors_reactivity	signal_to_noise_reactivity	deg_pH10	errors_deg_pH10	signal_to_noise_deg_pH10	deg_50C	errors_deg_50C	signal_to_noise_deg_50C	deg_Mg_pH10	errors_deg_Mg_pH10	signal_to_noise_deg_Mg_pH10	deg_Mg_50C	errors_deg_Mg_50C	signal_to_noise_deg_Mg_50C	SN_filter
9830366	testing	GGAAAUUUGC…	.......(((…	[0.4167, 1.5941, 1.2359, …]	[0.1689, 0.2323, 0.193, …]	5.326	[1.5966, 2.6482, 1.3761, …]	[0.3058, 0.3294, 0.233, …]	4.198	[0.7885, 1.93, 2.0423, …]		3.746 [0.2773, 0.328, 0.3048, …]	[1.5966, 2.6482, 1.3761, …]	[0.3058, 0.3294, 0.233, …]	4.198	[0.7885, 1.93, 2.0423, …]	[0.2773, 0.328, 0.3048, …]	3.746	True

Column Description¶

id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
- A: Adenine
- C: Cytosine
- G: Guanine
- U: Uracil
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA’s standard:
- Dots (.): Represent unpaired nucleotides.
- Parentheses (( and )): Represent base pairs in standard stems (page 1).
- Square Brackets ([ and ]): Represent base pairs in pseudoknots (page 2).
- Curly Braces ({ and }): Represent base pairs in additional pseudoknots (page 3).
reactivity: A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position. These values help determine the stability of the RNA structure under various experimental conditions.
deg_pH10 and deg_Mg_pH10: Arrays of degradation rates observed under two conditions: incubation at pH 10 without and with magnesium, respectively. These values provide insight into how different conditions affect the stability of RNA molecules.
deg_50C and deg_Mg_50C: Arrays of degradation rates after incubation at 50°C, without and with magnesium. These values capture how RNA sequences respond to elevated temperatures, which is relevant for storage and transportation conditions.
*_error_* Columns: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity and deg_ columns. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
SN_filter: A filter applied to the dataset based on the signal-to-noise ratio, indicating whether a specific sequence meets the dataset’s quality criteria.

If the SN_filter is True, the sequence meets the quality criteria; otherwise, it does not.

Note that due to technical limitations, the ground truth measurements are not available for the final bases of each RNA sequence. To facilitate processing, all measurement arrays (reactivity, deg_pH10, deg_50C, deg_Mg_pH10, deg_Mg_50C and their corresponding error fields) are padded with None values to match the full sequence length. When working with this data, please be aware that the trailing elements of these arrays are padding values and do not represent actual measurements.

Variants¶

This dataset is available in two subsets:

RYOS-1: The RYOS dataset from round 1 of the Eterna RYOS lab. The sequence length for RYOS-1 is 107, and the label length is 68.
RYOS-2: The RYOS dataset from round 2 of the Eterna RYOS lab. The sequence length for RYOS-2 is 130, and the label length is 102.

Preprocess¶

The MultiMolecule team preprocess this dataset by the following steps:

Compute signal_to_noise by averaging all 5 signal_to_noise_* columns.
Remove all sequence whose signal_to_noise < 1.
Remove all sequence without proper secondary structure (i.e., the secondary structure in dot-bracket notation do not match).
Padding/truncating all chemical measurements to sequence length.

License¶

This dataset is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
1	`SPDX-License-Identifier: AGPL-3.0-or-later`

Citation¶

BibTeX
@article{waymentsteele2021deep,
  author  = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\"O}zt{\"u}rk, Fatih and Chiu, Anthony and {\"O}zt{\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju},
  journal = {ArXiv},
  month   = oct,
  title   = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing},
  year    = 2021
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}