跳转至

OptMRL

Convolutional neural network for predicting the mean ribosome load (MRL) of an mRNA from the 50 nucleotides upstream of the coding sequence.

Disclaimer

This is an UNOFFICIAL implementation of Interpreting Deep Neural Networks for the Prediction of Translation Rates by Frederick Korbel et al.

The OFFICIAL repository of OptMRL is at ohlerlab/mlcis.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing OptMRL did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

OptMRL is a small 1D convolutional neural network trained to predict the mean ribosome load (MRL), a polysome-profiling-derived translation efficiency proxy, from the 50 nucleotides of 5’ untranslated region (5’UTR) sequence immediately upstream of the coding sequence. The model was first pre-trained on roughly 260,000 random 5’UTR reporters and then fine-tuned on roughly 20,000 endogenous human 5’UTRs. Please refer to the Training Details section for more information on the training process.

The architecture is a stack of three Conv1D layers (120 filters, kernel size 8, same padding, ReLU activation) followed by a Flatten, a 40-unit Dense bottleneck with ReLU activation and dropout, and a final scalar Dense regression head.

Model Specification

Num Layers Hidden Size Num Parameters FLOPs MACs Max Num Tokens
5 40 475,641 24,036,161 12,000,040 50

Usage

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule

Direct Use

Mean Ribosome Load Prediction

You can use this model directly to predict the mean ribosome load of a 50-nucleotide 5’UTR window:

Python
>>> from multimolecule import RnaTokenizer, OptMrlForSequencePrediction

>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/optmrl")
>>> model = OptMrlForSequencePrediction.from_pretrained("multimolecule/optmrl")
>>> sequence = "ACGU" * 12 + "AC"  # 50 nt
>>> input = tokenizer(sequence, add_special_tokens=False, return_tensors="pt")
>>> output = model(**input)

>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 50 nt fixed 5’UTR window taken immediately upstream of the coding sequence
  • Padding: shorter sequences are right-padded with zeros to 50 nt; longer sequences are truncated to the first 50 nt
  • Alphabet: ACGU only; unknown / N tokens contribute zero one-hot signal
  • Special tokens: do not add (add_special_tokens=False)
  • Output: single scalar mean-ribosome-load (MRL) score per window

Training Details

OptMRL was first pre-trained on a large random-5’UTR reporter library and then fine-tuned on a smaller library of endogenous human 5’UTRs.

Training Data

  • Pre-training: ~260,000 random 5’UTR reporters paired with polysome-profiling MRL measurements.
  • Fine-tuning: ~20,000 endogenous human 5’UTR reporters paired with polysome-profiling MRL measurements.

Each reporter contributes a 50-nucleotide 5’UTR window immediately upstream of the coding sequence and a scalar MRL label.

Note RnaTokenizer will convert “T”s to “U”s for you, you may disable this behaviour by passing replace_T_with_U=False.

Training Procedure

Pre-training

The model was first pre-trained as a regression task to predict the measured MRL of each random 5’UTR reporter, then fine-tuned end-to-end on the human-5’UTR reporters using the same regression objective. The published checkpoint is the fine-tuned model.

Citation

BibTeX
1
2
3
4
5
6
7
8
@article{korbel2023interpreting,
  author    = {Korbel, Frederick and Eroshok, Ekaterina and Ohler, Uwe},
  title     = {Interpreting Deep Neural Networks for the Prediction of Translation Rates},
  journal   = {bioRxiv},
  publisher = {Cold Spring Harbor Laboratory},
  year      = {2023},
  doi       = {10.1101/2023.06.02.543405}
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the OptMRL paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later

multimolecule.models.optmrl

RnaTokenizer

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default

alphabet

Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • extended
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None

nmers

int

Size of kmer to tokenize.

1

codon

bool

Whether to tokenize into codons.

False

replace_T_with_U

bool

Whether to replace T with U.

True

do_upper_case

bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHVIX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Python
class RnaTokenizer(Tokenizer):
    """
    Tokenizer for RNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `extended`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_T_with_U: Whether to replace T with U.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import RnaTokenizer
        >>> tokenizer = RnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHVIX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = RnaTokenizer(nmers=3)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 17, 64, 49, 96, 84, 22, 2]
        >>> tokenizer = RnaTokenizer(codon=True)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 49, 22, 2]
        >>> tokenizer('uagcuuauca')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_T_with_U: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_T_with_U=replace_T_with_U,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_T_with_U = replace_T_with_U
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_T_with_U:
            text = text.replace("T", "U")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

OptMrlConfig

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a OptMrlModel. It is used to instantiate an OptMRL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the OptMRL ohlerlab/mlcis architecture.

OptMRL predicts the mean ribosome load (MRL) of an mRNA from the 50 nucleotides immediately upstream of the coding sequence. The published architecture is a three-layer 1D convolutional stack (same padding, length preserved) followed by a flattening dense bottleneck and a scalar regression head.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default

vocab_size

int

Vocabulary size of the OptMRL model. Defines the number of input channels of the first convolution. Defaults to 5 (A, C, G, U, N), matching the MultiMolecule RNA streamline alphabet. The upstream checkpoint only uses the first four (A, C, G, U); the N channel stays zero.

5

sequence_length

int

The fixed 5’UTR input sequence length OptMRL was trained on (50 nt upstream of the coding sequence).

50

num_conv_layers

int

Number of stacked 1D convolutions. The published OptMRL uses three.

3

conv_filters

int

Number of filters in each convolutional layer.

120

conv_kernel_size

int

Kernel size (sequence span) of each convolutional layer. Convolutions use same padding so the output length matches sequence_length after every layer.

8

conv_dropout

float

Dropout probability applied after the second and third convolutions.

0.0

dense_size

int

Number of units in the dense bottleneck consumed by the regression head.

40

dense_dropout

float

Dropout probability applied after the dense bottleneck activation.

0.2

hidden_act

str

The non-linear activation function used by the convolutional and dense layers.

'relu'

num_labels

int

Number of output labels. OptMRL is a single-output regression model, so this defaults to 1.

1

head

HeadConfig | None

The configuration of the sequence-level prediction head. Defaults to a regression head (problem_type="regression").

None

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> from multimolecule import OptMrlConfig, OptMrlModel
>>> # Initializing an OptMRL ohlerlab/mlcis style configuration
>>> configuration = OptMrlConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = OptMrlModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/optmrl/configuration_optmrl.py
Python
class OptMrlConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`OptMrlModel`][multimolecule.models.OptMrlModel]. It is used to instantiate an OptMRL model according to the
    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
    similar configuration to that of the OptMRL [ohlerlab/mlcis](https://github.com/ohlerlab/mlcis) architecture.

    OptMRL predicts the mean ribosome load (MRL) of an mRNA from the 50 nucleotides immediately upstream of the coding
    sequence. The published architecture is a three-layer 1D convolutional stack (same padding, length preserved)
    followed by a flattening dense bottleneck and a scalar regression head.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the OptMRL model. Defines the number of input channels of the first convolution.
            Defaults to 5 (`A`, `C`, `G`, `U`, `N`), matching the MultiMolecule RNA `streamline` alphabet. The
            upstream checkpoint only uses the first four (`A`, `C`, `G`, `U`); the `N` channel stays zero.
        sequence_length:
            The fixed 5'UTR input sequence length OptMRL was trained on (50 nt upstream of the coding sequence).
        num_conv_layers:
            Number of stacked 1D convolutions. The published OptMRL uses three.
        conv_filters:
            Number of filters in each convolutional layer.
        conv_kernel_size:
            Kernel size (sequence span) of each convolutional layer. Convolutions use `same` padding so the
            output length matches `sequence_length` after every layer.
        conv_dropout:
            Dropout probability applied after the second and third convolutions.
        dense_size:
            Number of units in the dense bottleneck consumed by the regression head.
        dense_dropout:
            Dropout probability applied after the dense bottleneck activation.
        hidden_act:
            The non-linear activation function used by the convolutional and dense layers.
        num_labels:
            Number of output labels. OptMRL is a single-output regression model, so this defaults to 1.
        head:
            The configuration of the sequence-level prediction head. Defaults to a regression head
            (`problem_type="regression"`).

    Examples:
        >>> from multimolecule import OptMrlConfig, OptMrlModel
        >>> # Initializing an OptMRL ohlerlab/mlcis style configuration
        >>> configuration = OptMrlConfig()
        >>> # Initializing a model (with random weights) from the configuration
        >>> model = OptMrlModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "optmrl"

    def __init__(
        self,
        vocab_size: int = 5,
        sequence_length: int = 50,
        num_conv_layers: int = 3,
        conv_filters: int = 120,
        conv_kernel_size: int = 8,
        conv_dropout: float = 0.0,
        dense_size: int = 40,
        dense_dropout: float = 0.2,
        hidden_act: str = "relu",
        num_labels: int = 1,
        head: HeadConfig | None = None,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, **kwargs)
        if vocab_size < 4:
            raise ValueError(
                f"vocab_size ({vocab_size}) must be at least 4 to cover the canonical nucleotide alphabet `ACGU`."
            )
        if sequence_length < 1:
            raise ValueError(f"sequence_length ({sequence_length}) must be a positive integer.")
        if conv_kernel_size < 1:
            raise ValueError(f"conv_kernel_size ({conv_kernel_size}) must be a positive integer.")
        if num_conv_layers < 1:
            raise ValueError(f"num_conv_layers ({num_conv_layers}) must be a positive integer.")
        self.vocab_size = vocab_size
        self.sequence_length = sequence_length
        self.num_conv_layers = num_conv_layers
        self.conv_filters = conv_filters
        self.conv_kernel_size = conv_kernel_size
        self.conv_dropout = conv_dropout
        self.dense_size = dense_size
        self.dense_dropout = dense_dropout
        self.hidden_act = hidden_act
        # ``hidden_size`` is the dimensionality of the dense bottleneck consumed by the
        # MultiMolecule sequence-prediction head.
        self.hidden_size = dense_size
        if head is None:
            head = HeadConfig(problem_type="regression")
        else:
            head = HeadConfig(head)
            if head.problem_type is None:
                head.problem_type = "regression"
        self.head = head

OptMrlForSequencePrediction

Bases: OptMrlPreTrainedModel

OptMRL model with a sequence-level prediction head emitting the mean ribosome load (MRL) scalar.

Examples:

Python Console Session
>>> import torch
>>> from multimolecule import OptMrlConfig, OptMrlForSequencePrediction
>>> config = OptMrlConfig()
>>> model = OptMrlForSequencePrediction(config)
>>> output = model(
...     torch.randint(config.vocab_size, (1, config.sequence_length)),
...     labels=torch.tensor([[1.0]]),
... )
>>> output["logits"].shape
torch.Size([1, 1])
>>> output["loss"]
tensor(..., grad_fn=<MseLossBackward0>)
Source code in multimolecule/models/optmrl/modeling_optmrl.py
Python
class OptMrlForSequencePrediction(OptMrlPreTrainedModel):
    """
    OptMRL model with a sequence-level prediction head emitting the mean ribosome load (MRL) scalar.

    Examples:
        >>> import torch
        >>> from multimolecule import OptMrlConfig, OptMrlForSequencePrediction
        >>> config = OptMrlConfig()
        >>> model = OptMrlForSequencePrediction(config)
        >>> output = model(
        ...     torch.randint(config.vocab_size, (1, config.sequence_length)),
        ...     labels=torch.tensor([[1.0]]),
        ... )
        >>> output["logits"].shape
        torch.Size([1, 1])
        >>> output["loss"]  # doctest:+ELLIPSIS
        tensor(..., grad_fn=<MseLossBackward0>)
    """

    def __init__(self, config: OptMrlConfig):
        super().__init__(config)
        self.model = OptMrlModel(config)
        self.sequence_head = SequencePredictionHead(config, config.head)
        self.head_config = self.sequence_head.config
        # Initialize weights and apply final processing
        self.post_init()

    @property
    def output_channels(self) -> list[str]:
        num_labels = int(self.sequence_head.num_labels)
        if num_labels != 1:
            return [f"mean_ribosome_load_{index}" for index in range(num_labels)]
        return ["mean_ribosome_load"]

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )

        output = self.sequence_head(outputs, labels)

        return SequencePredictorOutput(loss=output.loss, logits=output.logits)

OptMrlModel

Bases: OptMrlPreTrainedModel

The bare OptMRL model outputting the dense bottleneck representation used by the regression head.

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import OptMrlConfig, OptMrlModel
>>> config = OptMrlConfig()
>>> model = OptMrlModel(config)
>>> output = model(torch.randint(config.vocab_size, (1, config.sequence_length)))
>>> output["pooler_output"].shape
torch.Size([1, 40])
Source code in multimolecule/models/optmrl/modeling_optmrl.py
Python
class OptMrlModel(OptMrlPreTrainedModel):
    """
    The bare OptMRL model outputting the dense bottleneck representation used by the regression head.

    Examples:
        >>> import torch
        >>> from multimolecule import OptMrlConfig, OptMrlModel
        >>> config = OptMrlConfig()
        >>> model = OptMrlModel(config)
        >>> output = model(torch.randint(config.vocab_size, (1, config.sequence_length)))
        >>> output["pooler_output"].shape
        torch.Size([1, 40])
    """

    def __init__(self, config: OptMrlConfig):
        super().__init__(config)
        self.embeddings = OptMrlEmbedding(config)
        self.encoder = OptMrlEncoder(config)
        # Initialize weights and apply final processing
        self.post_init()

    @merge_with_config_defaults
    @capture_outputs
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> OptMrlModelOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if isinstance(input_ids, NestedTensor):
            if attention_mask is None:
                attention_mask = input_ids.mask
            input_ids = input_ids.tensor
        if isinstance(inputs_embeds, NestedTensor):
            if attention_mask is None:
                attention_mask = inputs_embeds.mask
            inputs_embeds = inputs_embeds.tensor

        embedding_output = self.embeddings(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )

        pooled_output = self.encoder(embedding_output)

        return OptMrlModelOutput(pooler_output=pooled_output)

OptMrlModelOutput dataclass

Bases: ModelOutput

Base class for outputs of the OptMRL model.

Parameters:

Name Type Description Default

pooler_output

`torch.FloatTensor` of shape `(batch_size, dense_size)`

The dense bottleneck representation consumed by the MultiMolecule sequence-prediction head.

None
Source code in multimolecule/models/optmrl/modeling_optmrl.py
Python
@dataclass
class OptMrlModelOutput(ModelOutput):
    """
    Base class for outputs of the OptMRL model.

    Args:
        pooler_output (`torch.FloatTensor` of shape `(batch_size, dense_size)`):
            The dense bottleneck representation consumed by the MultiMolecule sequence-prediction head.
    """

    pooler_output: torch.FloatTensor | None = None

OptMrlPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/optmrl/modeling_optmrl.py
Python
class OptMrlPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = OptMrlConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None
    _no_split_modules = ["OptMrlEncoder"]

    @torch.no_grad()
    def _init_weights(self, module):
        super()._init_weights(module)
        # Use transformers.initialization wrappers (imported as `init`); they check the
        # `_is_hf_initialized` flag so they don't clobber tensors loaded from a checkpoint.
        if isinstance(module, (nn.Conv1d, nn.Linear)):
            init.kaiming_uniform_(module.weight, a=math.sqrt(5))
            if module.bias is not None:
                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                init.uniform_(module.bias, -bound, bound)