ERNIE-RNA¶
Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
Disclaimer¶
This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.
The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
Variations¶
multimolecule/ernierna
: The ERNIE-RNA model pre-trained on non-coding RNA sequences.multimolecule/ernierna.ss
: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.
Model Specification¶
Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
---|---|---|---|---|---|---|---|
12 | 768 | 12 | 3072 | 85.67 | 22.36 | 11.17 | 1024 |
Links¶
- Code: multimolecule.ernierna
- Data: RNAcentral
- Paper: ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations
- Developed by: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
- Model type: BERT - ERNIE
- Original Repository: https://github.com/Bruce-ywj/ERNIE-RNA
Usage¶
The model file depends on the multimolecule
library. You can install it using pip:
Bash | |
---|---|
Direct Use¶
You can use this model directly with a pipeline for masked language modeling:
Downstream Use¶
Extract Features¶
Here is how to use this model to get the features of a given sequence in PyTorch:
Python | |
---|---|
Sequence Classification / Regression¶
Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Token Classification / Regression¶
Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Contact Classification / Regression¶
Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Training Details¶
ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
Training Data¶
The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing “T”s with “S”s.
Note that RnaTokenizer
will convert “T”s to “U”s for you, you may disable this behaviour by passing replace_T_with_U=False
.
Training Procedure¶
Preprocessing¶
ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
<mask>
. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
PreTraining¶
The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.
- Learning rate: 1e-4
- Weight decay: 0.01
- Learning rate warm-up: 20,000 steps
Citation¶
BibTeX:
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.
License¶
This model is licensed under the AGPL-3.0 License.
Text Only | |
---|---|
multimolecule.models.ernierna
¶
RnaTokenizer
¶
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace T with U. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
ErnieRnaConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
ErnieRnaModel
. It is used to instantiate a ErnieRna model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA
architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int
|
Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by
the |
26
|
|
int
|
Dimensionality of the encoder layers and the pooler layer. |
768
|
|
int
|
Number of hidden layers in the Transformer encoder. |
12
|
|
int
|
Number of attention heads for each attention layer in the Transformer encoder. |
12
|
|
int
|
Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. |
3072
|
|
float
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. |
0.1
|
|
float
|
The dropout ratio for the attention probabilities. |
0.1
|
|
int
|
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). |
1026
|
|
float
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. |
0.02
|
|
float
|
The epsilon used by the layer normalization layers. |
1e-12
|
Examples:
>>> from multimolecule import ErnieRnaModel, ErnieRnaConfig
>>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration
>>> configuration = ErnieRnaConfig()
>>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration
>>> model = ErnieRnaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/ernierna/configuration_ernierna.py
ErnieRnaForContactClassification
¶
Bases: ErnieRnaForPreTraining
Examples:
>>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaForContactClassification(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Python | |
---|---|
614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 |
|
ErnieRnaForContactPrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaForContactPrediction(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))
>>> output["logits"].shape
torch.Size([1, 5, 5, 2])
>>> output["loss"]
tensor(..., grad_fn=<NllLossBackward0>)
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForMaskedLM
¶
Bases: ErnieRnaPreTrainedModel
Examples:
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaForMaskedLM(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input, labels=input["input_ids"])
>>> output["logits"].shape
torch.Size([1, 7, 26])
>>> output["loss"]
tensor(..., grad_fn=<NllLossBackward0>)
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Python | |
---|---|
522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 |
|
ErnieRnaForSequencePrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaForSequencePrediction(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
>>> output["logits"].shape
torch.Size([1, 2])
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaForTokenPrediction
¶
Bases: ErnieRnaPreTrainedModel
Examples:
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaForTokenPrediction(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input, labels=torch.randint(2, (1, 5)))
>>> output["logits"].shape
torch.Size([1, 5, 2])
>>> output["loss"]
tensor(..., grad_fn=<NllLossBackward0>)
Source code in multimolecule/models/ernierna/modeling_ernierna.py
ErnieRnaModel
¶
Bases: ErnieRnaPreTrainedModel
Examples:
>>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer
>>> config = ErnieRnaConfig()
>>> model = ErnieRnaModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/rna")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
>>> output["last_hidden_state"].shape
torch.Size([1, 7, 768])
>>> output["pooler_output"].shape
torch.Size([1, 768])
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Python | |
---|---|
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 |
|
forward
¶
Python | |
---|---|
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Tensor | None
|
Shape: Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. |
None
|
|
Tensor | None
|
Shape: Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
in the cross-attention if the model is configured as a decoder. Mask values selected in
|
None
|
|
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
|
Tuple of length Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If |
None
|
|
bool | None
|
If set to |
None
|
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Python | |
---|---|
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 |
|
ErnieRnaPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.