Optimus 5-Prime¶
Convolutional neural network that predicts the mean ribosome load (MRL) of a fixed 50 nt human 5’ untranslated region (5’UTR) from sequence alone.
Disclaimer¶
This is an UNOFFICIAL implementation of Human 5’ UTR design and variant effect prediction from a massively parallel translation assay by Paul J. Sample et al.
The OFFICIAL repository of Optimus 5-Prime is at pjsample/human_5utr_modeling.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Optimus 5-Prime did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Optimus 5-Prime is a simple, fully feed-forward 1D convolutional network trained on a massively parallel polysome-profiling assay of ~280,000 random 50 nt 5’UTRs upstream of an eGFP reporter expressed in HEK293T. The network ingests a fixed 50 nt 5’UTR one-hot tensor, applies three stacked padding="same" 1D convolutions (120 filters, kernel 8, ReLU) with dropout between the second/third convolutions, flattens the per-position activations channels-last, and emits a single standardized mean ribosome load (MRL) regression score through a 40-unit fully connected layer and a linear regression head. Please refer to the Training Details section for more information on the training process.
The MRL scalar is the per-sequence mean of polysome-profile-derived ribosome loading and is used by the original authors both to score natural human 5’UTRs and to engineer new sequences with predictable translation efficiency. Variant-effect scoring is performed externally by computing the MRL difference between the reference and alternative sequences; the model itself takes a single sequence as input.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|
| 4 | 40 | 0.48 | 24.04 | 12.00 | 50 |
Links¶
- Code: multimolecule.optimus5prime
- Weights: multimolecule/optimus5prime
- Data: Massively parallel polysome-profiling MRL library on randomized 50 nt 5’UTRs in HEK293T, GEO GSE114002
- Paper: Human 5’ UTR design and variant effect prediction from a massively parallel translation assay
- Developed by: Paul J. Sample, Ban Wang, David W. Reid, Vlad Presnyak, Iain J. McFadyen, David R. Morris, Georg Seelig
- Model type: 1D CNN for mean ribosome load (MRL) regression from a fixed 50 nt 5’UTR sequence
- Original Repository: pjsample/human_5utr_modeling
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Mean Ribosome Load Prediction¶
You can use this model directly to predict the mean ribosome load (MRL) of a fixed 50 nt 5’UTR sequence:
The pre-regression dense representation is exposed on the backbone:
Interface¶
- Input length: fixed 50 nt 5’UTR sequence
- Padding: shorter sequences are right-padded with zeros to 50 nt; longer sequences are truncated to the first 50 nt
- Alphabet:
ACGUN; the upstream checkpoint only learned the four canonical nucleotides, theNchannel stays zero - Special tokens: none added;
input_idsare consumed positionally as one-hot channels - Output: standardized mean ribosome load score (
logits) of shape(batch_size, 1); raw-MRL calibration requires the external scaler used by the upstream training workflow
Variant Effect¶
Optimus 5-Prime is a single-sequence regression model. To score the effect of a variant on translation, run the reference and alternative 5’UTRs through the model independently and compute the difference between their predicted MRL values:
Training Details¶
Optimus 5-Prime was trained to regress the per-sequence mean ribosome load (MRL) derived from polysome profiling on a massively parallel reporter assay.
Training Data¶
Optimus 5-Prime was trained on approximately 280,000 randomized 50 nt 5’UTRs placed upstream of an eGFP reporter and expressed in HEK293T cells. Mean ribosome load was computed per sequence from polysome-fractionation read counts. The raw sequencing data are available at GEO accession GSE114002.
Training Procedure¶
Pre-training¶
The published main_MRL_model checkpoint was trained with mean-squared-error loss against standardized per-sequence MRL values. The optimizer was Adam with learning rate 1e-3, batch size 128, default Adam betas (0.9, 0.999), and epsilon 1e-8.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Optimus 5-Prime paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.optimus5prime
¶
RnaTokenizer
¶
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace T with U. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Optimus5PrimeConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
Optimus5PrimeModel. It is used to instantiate an Optimus 5-Prime model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the Optimus 5-Prime main MRL model from
pjsample/human_5utr_modeling.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Optimus 5-Prime model. Defines the number of one-hot input channels derived from
|
5
|
|
int
|
The fixed 5’UTR input sequence length Optimus 5-Prime was trained on (50 nt). |
50
|
|
int
|
Number of stacked 1D convolutions. The published main MRL model uses 3. |
3
|
|
int
|
Number of filters in every convolution. The published main MRL model uses 120. |
120
|
|
int
|
Convolution kernel size. The published main MRL model uses 8 with |
8
|
|
float
|
Dropout probability applied after each intermediate convolution. The published main MRL model uses 0.0. |
0.0
|
|
int
|
Size of the fully connected layer between the convolutional stack and the regression output. The published main MRL model uses 40. |
40
|
|
float
|
Dropout probability applied after the dense hidden layer. The published main MRL model uses 0.2. |
0.2
|
|
str
|
The non-linear activation function used by the convolutional and dense layers. |
'relu'
|
|
int
|
Number of output labels. Optimus 5-Prime predicts a single mean ribosome load (MRL) scalar, so this defaults to 1. |
1
|
|
HeadConfig | None
|
The configuration of the sequence-level prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/optimus5prime/configuration_optimus5prime.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
Optimus5PrimeForSequencePrediction
¶
Bases: Optimus5PrimePreTrainedModel
Optimus 5-Prime model with a sequence-level prediction head.
The published model is a regression network that predicts the mean ribosome load (MRL) scalar for a fixed 50 nt 5’UTR. This wrapper exposes the converted upstream regression decoder through the standard MultiMolecule sequence-prediction head.
Examples:
Source code in multimolecule/models/optimus5prime/modeling_optimus5prime.py
Optimus5PrimeModel
¶
Bases: Optimus5PrimePreTrainedModel
The bare Optimus 5-Prime model outputting the pre-regression shared representation.
Examples:
Source code in multimolecule/models/optimus5prime/modeling_optimus5prime.py
Optimus5PrimeModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the Optimus 5-Prime model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
The pre-regression dense representation consumed by the MRL regression layer. |
None
|
Source code in multimolecule/models/optimus5prime/modeling_optimus5prime.py
Optimus5PrimePreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.