Framepool¶
Frame-aware pooling convolutional network for predicting mean ribosome load from variable-length 5’UTR sequences.
Disclaimer¶
This is an UNOFFICIAL implementation of Predicting mean ribosome load for 5’UTR of any length using deep learning by Alexander Karollus et al.
The OFFICIAL repository of Framepool is at Karollus/5UTR and the published Kipoi wrapper is at kipoi/models.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Framepool did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Framepool is a small 1D convolutional network that predicts the mean ribosome load (MRL) of a human 5’ untranslated region from sequence alone. It extends the fixed-length network of Sample et al., 2019 with a frame-aware pooling layer that reverses the sequence to anchor reading frames at the start codon, slices the convolutional feature map into the three reading frames, and applies global max and masked global average pooling per frame. The pooled representation is length-independent and is consumed by a small dense head followed by a per-sub-library scaling regression that recalibrates the prediction across the two training libraries (egfp_unmod_1 and random). Please refer to the Training Details section for more information on the training process.
The released combined_residual checkpoint is recommended by the upstream authors for variant effect scoring; it is the checkpoint exposed by the official Kipoi Framepool entry.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|
| 4 | 768 | 0.28 | 0.05 | 0.02 | unlimited |
Links¶
- Code: multimolecule.framepool
- Weights: multimolecule/framepool
- Data: eGFP polysome-profiling massively parallel reporter assay (MPRA) from Sample et al., 2019, HEK293T cells, fixed-length (50 nt) and variable-length (25-100 nt) 5’UTR libraries
- Paper: Predicting mean ribosome load for 5’UTR of any length using deep learning
- Developed by: Alexander Karollus, Žiga Avsec, Julien Gagneur
- Model type: 1D residual CNN with frame-aware pooling for mean-ribosome-load prediction from 5’UTR sequence
- Original Repository: Karollus/5UTR
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Mean Ribosome Load Prediction¶
You can use this model directly to predict the mean ribosome load of a 5’UTR sequence:
Interface¶
- Input length: variable; the upstream MPRA training data is 25-100 nt 5’UTR but the model accepts any length because of frame-aware pooling
- Alphabet: DNA (
A,C,G,T);Nand other non-canonical tokens are encoded as all-zero columns and ignored by the masked pooling - Padding: zero-padding is supported via
attention_maskand is excluded from pooling - Output: single scalar per sequence — predicted mean ribosome load (
logits, shape(batch_size, 1)) - Auxiliary inputs: optional
library_indicator(shape(batch_size, library_size)) selecting one of the two training sub-libraries for the scaling regression. Defaults to therandomlibrary, matching the upstream Kipoi variant effect interface
Variant Effect¶
Framepool supports paired reference/alternative scoring through the optional alternative_input_ids argument:
- Single sequence (reference only):
logitsis the predicted mean ribosome load (one scalar per sequence) - Reference + alternative:
logitsis thelog2mean ribosome load fold changelog2(MRL_alt / MRL_ref), matching the KipoiUTRVariantEffectModel.predict_on_batchmrl_fold_changeoutput - Reference and alternative sequences are scored independently; both must use the same
library_indicatorso that the scaling regression cancels out of the fold change - For the upstream “shifted-frame” variant effect outputs (
shift_1,shift_2), prepend one or two zero columns (orNtokens) to both reference and alternative inputs before scoring, matching the Kipoi loop
Training Details¶
Framepool was trained on polysome-profiling MPRA data measuring the mean ribosome load of randomized 5’UTR sequences and uses frame-aware pooling so that a single network can score sequences of arbitrary length.
Training Data¶
Framepool was trained on the eGFP polysome-profiling MPRA libraries of Sample et al., 2019 in HEK293T cells: the fixed-length library (egfp_unmod_1, 50 nt) and the variable-length library (random, 25-100 nt). Approximately 260,000 sequences were used for training, with 20,000 held out for testing; additional validation was performed on endogenous data.
Training Procedure¶
Pre-training¶
- Loss: mean squared error between the predicted and measured mean ribosome load
- Optimizer: Adam with
lr = 1e-3,beta_1 = 0.9,beta_2 = 0.999,epsilon = 1e-8 - Epochs: 6
- Mini-batch sampling: the two training libraries are mixed within every batch; a one-hot library indicator is fed to the scaling regression layer so that the network can absorb the library-specific offset
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Framepool paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.framepool
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
FramepoolConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
FramepoolModel. It is used to instantiate a Framepool model according to
the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the Framepool combined_residual architecture released with the
Karollus et al., 2021 paper.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Number of one-hot input channels derived from the MultiMolecule DNA tokenizer. Defaults to 5
( |
5
|
|
int | None
|
Channel index that represents the upstream “no nucleobase” token ( |
4
|
|
int
|
Number of stacked length-preserving residual convolutions in the encoder. |
3
|
|
int
|
Number of output channels for every convolution in the encoder. |
128
|
|
int | list[int]
|
Kernel sizes of the encoder convolutions. Either a scalar shared across all layers, or a list with one entry per layer. |
7
|
|
int | list[int]
|
Dilation rates of the encoder convolutions. Either a scalar shared across all layers, or a list with one entry per layer. |
1
|
|
str
|
Non-linear activation applied after each encoder convolution. |
'relu'
|
|
str
|
Convolution padding mode. |
'same'
|
|
str
|
|
'residual'
|
|
int
|
Number of fully-connected layers between the frame-pooled representation and the unscaled MRL output. |
1
|
|
list[int] | None
|
Hidden sizes of the fully-connected layers. Length must match |
None
|
|
float
|
Dropout probability applied after every fully-connected layer. |
0.2
|
|
bool
|
If |
False
|
|
int
|
Number of training sub-libraries supported by the scaling regression head. The released checkpoint was
trained jointly on the |
2
|
|
int
|
Default training sub-library index used to construct the one-hot library indicator at inference. Matches
the |
1
|
|
int
|
Number of scalar outputs of the model. Framepool predicts a single scalar mean ribosome load value. |
1
|
|
HeadConfig | None
|
Configuration of the [ |
None
|
Examples:
Source code in multimolecule/models/framepool/configuration_framepool.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |
FramepoolForSequencePrediction
¶
Bases: FramepoolPreTrainedModel
Framepool with a sequence-level prediction head.
When called with a single sequence the head returns the unscaled mean ribosome load (MRL) prediction. When called
with both a reference and an alternative sequence it returns the log2 mean ribosome load fold change
(log2(alternative / reference)), matching the upstream Kipoi
UTRVariantEffectModel variant effect interface.
Examples:
Source code in multimolecule/models/framepool/modeling_framepool.py
| Python | |
|---|---|
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | |
FramepoolModel
¶
Bases: FramepoolPreTrainedModel
The bare Framepool model, producing a frame-aware representation from a 5’UTR sequence.
Framepool replaces the fixed-length flatten of Sample et al., 2019 with a frame-aware pooling layer that splits the convolutional feature map into the three reading frames relative to the start codon and pools each frame independently. The resulting representation is length-independent.
Examples:
Source code in multimolecule/models/framepool/modeling_framepool.py
FramepoolModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the Framepool model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
The concatenation of per-frame max (and optionally average) pooled feature vectors consumed by the sequence-level prediction head. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_filters)`
|
The encoder feature map before the frame-aware pooling, with padded positions zeroed out. |
None
|
Source code in multimolecule/models/framepool/modeling_framepool.py
FramepoolPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.