SPOT-RNA¶
Pre-trained model for RNA secondary structure prediction using two-dimensional deep neural networks and transfer learning.
Disclaimer¶
This is an UNOFFICIAL implementation of the RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning by Jaswinder Singh, et al.
The OFFICIAL repository of SPOT-RNA is at jaswindersingh2/SPOT-RNA.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing SPOT-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
SPOT-RNA is a 2D convolutional neural network for predicting RNA secondary structure (base-pair contact maps) from single RNA sequences. It predicts both canonical (Watson-Crick and wobble) and non-canonical base pairs, including pseudoknots and other tertiary interactions.
The model uses:
- pairwise representation: outer concatenation of canonical nucleotide features into an
L x L x 8feature matrix. - convolutional blocks: 2D residual convolution blocks with LayerNorm, dropout, and checkpoint-matched ReLU/ELU activations.
- architecture paths: checkpoint-matched 2D-BLSTM or dilated-convolution paths where used by the released predictor.
- training strategy: transfer learning from bpRNA to high-resolution PDB RNA structures.
MultiMolecule provides SPOT-RNA as a single checkpoint, multimolecule/spotrna.
Model Specification¶
| Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|
| 17.46 | 8642.10 | 4302.16 |
Links¶
- Code: multimolecule.spotrna
- Weights: multimolecule/spotrna
- Data: multimolecule/bprna-spot
- Paper: RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning
- Developed by: Jaswinder Singh, Jack Hanson, Kuldip Paliwal, Yaoqi Zhou
- Original Repository: jaswindersingh2/SPOT-RNA
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
RNA Secondary Structure Pipeline¶
You can use SPOT-RNA directly with the MultiMolecule secondary-structure pipeline:
| Python | |
|---|---|
PyTorch Inference¶
Here is how to use this model to predict RNA secondary structure in PyTorch:
Training Details¶
SPOT-RNA was trained using a two-stage transfer learning approach on RNA secondary structure prediction.
Training Data¶
- initial training source: bpRNA-1m (Version 1.0) with 102,348 annotated RNAs.
- initial training filtering: CD-HIT-EST at 80% sequence identity, removal of RNAs with PDB structures, and maximum sequence length of 500 nucleotides.
- initial training corpus: 13,419 RNAs after preprocessing.
- initial training split: TR0 = 10,814, VL0 = 1,300, TS0 = 1,305.
- transfer-learning source: high-resolution PDB RNAs downloaded on March 2, 2019.
- transfer-learning filtering: resolution better than 3.5 A and CD-HIT-EST at 80% sequence identity.
- transfer-learning corpus: 226 nonredundant RNAs after preprocessing.
- transfer-learning split before homology filtering: TR1 = 120, VL1 = 30, TS1 = 76.
- additional TS1 filtering: CD-HIT-EST against the training data at 80% identity, followed by BLAST-N against TR0 and TR1 with e-value cutoff 10.
- final TS1 benchmark: 67 RNAs.
- additional evaluation set: TS2 = 39 NMR-solved RNAs selected from 641 candidates after CD-HIT-EST filtering at 80% identity and BLAST-N filtering against TR0, TR1, and TS1.
- use of TS2: post-training evaluation only.
Training Procedure¶
Preprocessing¶
- input representation: one-hot
L x 4matrix following the MultiMolecule tokenizer order. - missing-value handling: invalid or missing residues encoded as
-1in the original TensorFlow implementation before one-hot conversion. - pairwise features: outer concatenation from
L x 4toL x L x 8. - input normalization: standardization to zero mean and unit variance using training-set statistics.
- structure labels: extracted from PDB coordinates with DSSR.
- reference NMR model: model 1.
- pseudoknot and motif definitions: bpRNA definitions from the paper.
- unknown-token handling:
Ntokens are excluded from the canonical four-base features before pairwise feature construction.
Pre-training¶
The paper states that training was run on Nvidia GTX TITAN X GPUs.
- training split: TR0.
- validation split: VL0.
- optimizer: Adam.
- regularization: 25% dropout before convolution layers and 50% dropout in hidden fully connected layers.
- hyperparameter search over
N_A: 16 to 32 residual blocks. - hyperparameter search over
D_RES: 32 to 72 convolution channels. - hyperparameter search over
D_BL: 128 to 256 2D-BLSTM hidden units per direction. - hyperparameter search over
N_B: 0 to 4 fully connected blocks. - hyperparameter search over
D_FC: 256 to 512 fully connected hidden units. - model selection: validation-performance model selection described in the paper.
Transfer Learning¶
The pretrained TR0 models were retrained on TR1 with the same architecture and optimization settings.
- initialization: start from the TR0-trained models.
- training split: TR1.
- validation split: VL1.
- frozen layers: none; all weights were updated.
- architecture and optimization settings: same as the TS0-trained models.
- model selection: validation-performance model selection described in the paper.
- decision rule: a single probability threshold chosen to optimize validation performance.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the SPOT-RNA paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.spotrna
¶
RnaTokenizer
¶
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace T with U. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
SpotRnaConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
SpotRnaModel. It is used to instantiate a SPOT-RNA model according to
the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the SPOT-RNA
jaswindersingh2/SPOT-RNA architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Token vocabulary size of the SPOT-RNA model. Defaults to 5 for the |
5
|
|
list[SpotRnaModuleConfig] | None
|
List of internal architecture configurations. Each entry is a [ |
None
|
|
int
|
Number of input feature channels after outer concatenation. Defaults to 8 for the canonical four-base pairwise representation. |
8
|
|
str
|
The non-linear activation function in the convolutional and fully connected blocks. |
'relu'
|
|
float
|
Dropout rate in the convolutional blocks. |
0.25
|
|
float
|
Dropout rate in the fully connected blocks. |
0.5
|
|
float
|
Probability threshold for predicting base pairs during post-processing. |
0.335
|
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/spotrna/configuration_spotrna.py
| Python | |
|---|---|
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | |
SpotRnaModuleConfig
¶
Bases: FlatDict
Configuration for one internal SPOT-RNA architecture member.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Number of convolutional blocks (N_A in the paper). |
required | |
|
Number of 2D bidirectional LSTM blocks. Set to 0 to disable. |
required | |
|
Number of fully connected blocks. Set to 0 to disable. |
required | |
|
Number of channels in the convolutional blocks. |
required | |
|
Hidden size per direction in the 2D-BLSTM. Ignored if num_blstm_blocks is 0. |
required | |
|
Hidden size of the fully connected blocks. Ignored if num_fc_blocks is 0. |
required | |
|
Activation used in the convolutional residual blocks. |
required | |
|
Optional activation used in the fully connected blocks. Falls back to |
required | |
|
Activation applied before the final normalization stage. |
required | |
|
Whether to use dilated convolutions. |
required | |
|
The cycle length for the dilation factor. |
required |
Source code in multimolecule/models/spotrna/configuration_spotrna.py
SpotRnaModel
¶
Bases: SpotRnaPreTrainedModel
Examples:
Source code in multimolecule/models/spotrna/modeling_spotrna.py
| Python | |
|---|---|
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
SpotRnaModelOutput
dataclass
¶
Bases: ModelOutput
Output type for SPOT-RNA model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Binary cross-entropy loss for base-pair prediction. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, seq_len, seq_len)`
|
Prediction logits before sigmoid. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, seq_len, seq_len)`, *optional*
|
Base-pair probability matrix (after sigmoid). |
None
|