ERNIE-RNA¶
Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
Disclaimer¶
This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.
The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
Variants¶
- multimolecule/ernierna: The ERNIE-RNA model pre-trained on non-coding RNA sequences.
- multimolecule/ernierna-ss: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.
Model Specification¶
| Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|---|---|
| 12 | 768 | 12 | 3072 | 85.67 | 22.37 | 11.18 | 1024 |
Links¶
- Code: multimolecule.ernierna
- Data: multimolecule/rnacentral
- Paper: ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations
- Developed by: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
- Model type: BERT - ERNIE
- Original Repository: Bruce-ywj/ERNIE-RNA
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Masked Language Modeling¶
You can use this model directly with a pipeline for masked language modeling:
| Python | |
|---|---|
Downstream Use¶
Extract Features¶
Here is how to use this model to get the features of a given sequence in PyTorch:
| Python | |
|---|---|
Sequence Classification / Regression¶
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Token Classification / Regression¶
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Contact Classification / Regression¶
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Training Details¶
ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
Training Data¶
The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing “T”s with “S”s.
Note that RnaTokenizer will convert “T”s to “U”s for you, you may disable this behaviour by passing replace_T_with_U=False.
Training Procedure¶
Preprocessing¶
ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
<mask>. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Pre-training¶
The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.
- Learning rate: 1e-4
- Learning rate warm-up: 20,000 steps
- Weight decay: 0.01
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.
License¶
This model is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.ernierna
¶
RnaTokenizer
¶
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace T with U. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
ErnieRnaConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
ErnieRnaModel. It is used to instantiate an ErnieRna model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA
architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by
the |
26
|
|
int
|
Dimensionality of the encoder layers and the pooler layer. |
768
|
|
int
|
Number of hidden layers in the Transformer encoder. |
12
|
|
int
|
Number of attention heads for each attention layer in the Transformer encoder. |
12
|
|
int
|
Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. |
3072
|
|
str
|
The non-linear activation function (function or string) in the encoder and pooler. If string, |
'gelu'
|
|
float
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. |
0.1
|
|
float
|
The dropout ratio for the attention probabilities. |
0.1
|
|
int
|
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). |
1026
|
|
float
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. |
0.02
|
|
float
|
The epsilon used by the layer normalization layers. |
1e-12
|
|
float
|
Scaling factor for pairwise bias in the attention mechanism. |
0.8
|
|
str
|
Type of position embedding. Choose one of |
'sinusoidal'
|
|
bool
|
Whether the model is used as a decoder or not. If |
False
|
|
bool
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if |
True
|
|
HeadConfig | None
|
The configuration of the head. |
None
|
|
MaskedLMHeadConfig | None
|
The configuration of the masked language model head. |
None
|
|
bool
|
Whether to return attention bias maps. |
False
|
|
bool
|
Whether to add cross-attention layers when the model is used as a decoder. |
False
|
Examples:
Source code in multimolecule/models/ernierna/configuration_ernierna.py
| Python | |
|---|---|
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
ErnieRnaForContactPrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForMaskedLM
¶
Bases: ErnieRnaPreTrainedModel
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForSecondaryStructurePrediction
¶
Bases: ErnieRnaForPreTraining
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForSequencePrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForTokenPrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaModel
¶
Bases: ErnieRnaPreTrainedModel
Examples:
Source code in multimolecule/models/ernierna/modeling_ernierna.py
| Python | |
|---|---|
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
forward
¶
forward(
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
position_ids: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
encoder_hidden_states: Tensor | None = None,
encoder_attention_mask: Tensor | None = None,
past_key_values: Cache | None = None,
use_cache: bool | None = None,
cache_position: Tensor | None = None,
output_attention_biases: bool | None = None,
**kwargs: Unpack[TransformersKwargs]
) -> (
Tuple[Tensor, ...]
| ErnieRnaModelOutputWithPoolingAndCrossAttentions
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Tensor | None
|
Shape: Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. |
None
|
|
Tensor | None
|
Shape: Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in
|
None
|
|
Cache | None
|
Tuple of length Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If |
None
|
|
bool | None
|
If set to |
None
|
Source code in multimolecule/models/ernierna/modeling_ernierna.py
| Python | |
|---|---|
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | |
ErnieRnaPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.