Xpresso¶

Disclaimer¶

This is an UNOFFICIAL implementation of Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks by Vikram Agarwal et al.

The OFFICIAL repository of Xpresso is at vagarwal87/Xpresso.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing Xpresso did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details¶

Xpresso is a deep convolutional neural network (CNN) that predicts steady-state mRNA expression level directly from genomic sequence. It consumes a promoter window of roughly 10.5 kb centered on the transcription start site (TSS), processes it through a stack of 1D convolution + max-pooling blocks, flattens the result, concatenates a small set of auxiliary numeric mRNA half-life features, and passes the combined representation through fully-connected layers to predict a single scalar expression value. Please refer to the Training Details section for more information on the training process.

Model Specification¶

Input Length	Conv Blocks	Hidden Size	Auxiliary Features	Num Parameters (M)	FLOPs (G)	MACs (G)	Max Num Tokens
10,500	2	2	6	0.11	0.11	0.05	10,500

Links¶

Code: multimolecule.xpresso
Weights: multimolecule/xpresso
Data: Roadmap Epigenomics gene-expression data with promoter sequence and mRNA half-life features
Paper: Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks
Developed by: Vikram Agarwal, Jay Shendure
Model type: 1D CNN over promoter DNA combined with auxiliary mRNA half-life features for mRNA-abundance regression
Original Repository: vagarwal87/Xpresso

Usage¶

The model file depends on the multimolecule library. You can install it using pip:

Bash
1	`pip install multimolecule`

Direct Use¶

mRNA Expression Prediction¶

You can use this model directly to predict the mRNA expression of a promoter sequence together with its auxiliary mRNA half-life features:

Python
>>> import torch
>>> from multimolecule import DnaTokenizer, XpressoForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
>>> model = XpressoForSequencePrediction.from_pretrained("multimolecule/xpresso")
>>> input = tokenizer("ACGTACGTACGTACGT", return_tensors="pt")
>>> features = torch.randn(1, model.config.num_features)
>>> output = model(**input, features=features)

>>> output.logits.shape
torch.Size([1, 1])

The auxiliary half-life features are passed through the features argument as a float tensor of shape (batch_size, num_features). Models configured with a non-zero num_features require this tensor; models configured with num_features=0 do not accept it.

Interface¶

Input length: fixed 10,500 bp promoter window centered on the TSS
Padding: shorter inputs right-padded; longer inputs center-cropped to input_length
Auxiliary inputs: features tensor of shape (batch_size, num_features) required when num_features > 0; not accepted when num_features = 0
Output: scalar mRNA expression

Training Details¶

Xpresso was trained to predict steady-state mRNA expression levels (median across tissues/cell lines) from genomic promoter sequence.

Training Data¶

Xpresso was trained on human and mouse genes, using promoter sequences (~10.5 kb windows centered on the TSS) together with mRNA half-life features derived from gene-body and UTR properties. Expression targets are log-transformed median mRNA levels across tissues.

The default Xpresso model is the published humanMedian model. Other published variants (K562, GM12878, mESC, mouseMedian) share the same architecture but are not exposed as separate default model variants.

Training Procedure¶

Pre-training¶

The model was trained to minimize a mean-squared-error loss between predicted and observed log mRNA expression values.

Optimizer: Adam
Loss: Mean squared error

Citation¶

BibTeX
@article{agarwal2020predicting,
  author    = {Agarwal, Vikram and Shendure, Jay},
  journal   = {Cell Reports},
  number    = 7,
  pages     = {107663},
  publisher = {Elsevier BV},
  title     = {Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks},
  volume    = 31,
  year      = 2020
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact¶

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the Xpresso paper for questions or comments on the paper/model.

License¶

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
1	`SPDX-License-Identifier: AGPL-3.0-or-later`

multimolecule.models.xpresso ¶

DnaTokenizer ¶

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard RNA alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `iupac` `streamline` `nucleobase` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`nmers` ¶	`int`	Size of kmer to tokenize.	`1`
`codon` ¶	`bool`	Whether to tokenize into codons.	`False`
`replace_U_with_T` ¶	`bool`	Whether to replace U with T.	`True`
`do_upper_case` ¶	`bool`	Whether to convert input to uppercase.	`True`

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10

Source code in multimolecule/tokenisers/dna/tokenization_dna.py

Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

XpressoConfig ¶

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a XpressoModel. It is used to instantiate a Xpresso model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Xpresso vagarwal87/Xpresso architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name	Type	Description	Default
`vocab_size` ¶	`int`	Vocabulary size of the Xpresso model. Defines the number of feature channels derived from `input_ids` for the first convolution. Defaults to 5.	`5`
`input_length` ¶	`int`	The length of the promoter sequence window (centered on the TSS) consumed by the convolutional stack.	`10500`
`num_conv_layers` ¶	`int`	Number of convolutional blocks in the encoder.	`2`
`conv_channels` ¶	`list[int] \| None`	Number of output channels for each convolutional block. Length must equal `num_conv_layers`.	`None`
`conv_kernel_sizes` ¶	`list[int] \| None`	Convolution kernel size for each convolutional block. Length must equal `num_conv_layers`.	`None`
`conv_dilations` ¶	`list[int] \| None`	Dilation factor for each convolutional block. Length must equal `num_conv_layers`.	`None`
`pool_sizes` ¶	`list[int] \| None`	Max-pooling window for each convolutional block. Length must equal `num_conv_layers`.	`None`
`num_features` ¶	`int`	Number of auxiliary numeric mRNA half-life features concatenated with the convolutional representation before the fully-connected head.	`6`
`fc_dims` ¶	`list[int] \| None`	Dimensionality of each fully-connected layer in the head.	`None`
`hidden_act` ¶	`str`	The non-linear activation function (function or string) in the encoder and the head. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.	`'relu'`
`hidden_dropout` ¶	`float`	The dropout probability applied after each fully-connected layer.	`0.00099`
`num_labels` ¶	`int`	Number of output labels. Xpresso predicts a single scalar mRNA expression value.	`1`
`head` ¶	`HeadConfig \| None`	The configuration of the prediction head. Defaults to a regression head (`problem_type="regression"`), matching Xpresso’s mRNA abundance prediction task.	`None`

Examples:

Python Console Session
>>> from multimolecule import XpressoConfig, XpressoModel
>>> # Initializing a Xpresso multimolecule/xpresso style configuration
>>> configuration = XpressoConfig()
>>> # Initializing a model (with random weights) from the multimolecule/xpresso style configuration
>>> model = XpressoModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in multimolecule/models/xpresso/configuration_xpresso.py

Python
class XpressoConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`XpressoModel`][multimolecule.models.XpressoModel]. It is used to instantiate a Xpresso model according to the
    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
    similar configuration to that of the Xpresso
    [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso) architecture.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the Xpresso model. Defines the number of feature channels derived from `input_ids` for
            the first convolution. Defaults to 5.
        input_length:
            The length of the promoter sequence window (centered on the TSS) consumed by the convolutional stack.
        num_conv_layers:
            Number of convolutional blocks in the encoder.
        conv_channels:
            Number of output channels for each convolutional block. Length must equal `num_conv_layers`.
        conv_kernel_sizes:
            Convolution kernel size for each convolutional block. Length must equal `num_conv_layers`.
        conv_dilations:
            Dilation factor for each convolutional block. Length must equal `num_conv_layers`.
        pool_sizes:
            Max-pooling window for each convolutional block. Length must equal `num_conv_layers`.
        num_features:
            Number of auxiliary numeric mRNA half-life features concatenated with the convolutional representation
            before the fully-connected head.
        fc_dims:
            Dimensionality of each fully-connected layer in the head.
        hidden_act:
            The non-linear activation function (function or string) in the encoder and the head. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout:
            The dropout probability applied after each fully-connected layer.
        num_labels:
            Number of output labels. Xpresso predicts a single scalar mRNA expression value.
        head:
            The configuration of the prediction head. Defaults to a regression head
            (`problem_type="regression"`), matching Xpresso's mRNA abundance prediction task.

    Examples:
        >>> from multimolecule import XpressoConfig, XpressoModel
        >>> # Initializing a Xpresso multimolecule/xpresso style configuration
        >>> configuration = XpressoConfig()
        >>> # Initializing a model (with random weights) from the multimolecule/xpresso style configuration
        >>> model = XpressoModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "xpresso"

    def __init__(
        self,
        vocab_size: int = 5,
        input_length: int = 10500,
        num_conv_layers: int = 2,
        conv_channels: list[int] | None = None,
        conv_kernel_sizes: list[int] | None = None,
        conv_dilations: list[int] | None = None,
        pool_sizes: list[int] | None = None,
        num_features: int = 6,
        fc_dims: list[int] | None = None,
        hidden_act: str = "relu",
        hidden_dropout: float = 0.00099,
        num_labels: int = 1,
        head: HeadConfig | None = None,
        **kwargs,
    ):
        kwargs.setdefault("pad_token_id", vocab_size - 1)
        kwargs.setdefault("unk_token_id", vocab_size - 1)
        kwargs.setdefault("bos_token_id", None)
        kwargs.setdefault("eos_token_id", None)
        kwargs.setdefault("mask_token_id", None)
        kwargs.setdefault("null_token_id", None)
        super().__init__(num_labels=num_labels, **kwargs)
        self.vocab_size = vocab_size
        self.input_length = input_length
        self.num_conv_layers = num_conv_layers
        if conv_channels is None:
            conv_channels = [128, 32]
        if conv_kernel_sizes is None:
            conv_kernel_sizes = [6, 9]
        if conv_dilations is None:
            conv_dilations = [1, 1]
        if pool_sizes is None:
            pool_sizes = [30, 10]
        if fc_dims is None:
            fc_dims = [64, 2]
        self.conv_channels = conv_channels
        self.conv_kernel_sizes = conv_kernel_sizes
        self.conv_dilations = conv_dilations
        self.pool_sizes = pool_sizes
        self.num_features = num_features
        self.fc_dims = fc_dims
        self.hidden_act = hidden_act
        self.hidden_dropout = hidden_dropout
        self.num_labels = num_labels
        # `hidden_size` is the dimensionality of the pooled representation consumed by
        # `SequencePredictionHead`; it equals the width of the last fully-connected layer.
        self.hidden_size = self.fc_dims[-1]
        if head is None:
            head = HeadConfig(problem_type="regression")
        else:
            head = HeadConfig(head)
            if head.problem_type is None:
                head.problem_type = "regression"
        self.head = head
        self._validate()

    def _validate(self) -> None:
        per_layer = {
            "conv_channels": self.conv_channels,
            "conv_kernel_sizes": self.conv_kernel_sizes,
            "conv_dilations": self.conv_dilations,
            "pool_sizes": self.pool_sizes,
        }
        for name, value in per_layer.items():
            if len(value) != self.num_conv_layers:
                raise ValueError(
                    f"`{name}` must have length `num_conv_layers` ({self.num_conv_layers}), got {len(value)}."
                )
        if self.input_length <= 0:
            raise ValueError(f"`input_length` must be positive, got {self.input_length}.")
        if self.num_features < 0:
            raise ValueError(f"`num_features` must be non-negative, got {self.num_features}.")
        if not self.fc_dims:
            raise ValueError("`fc_dims` must contain at least one fully-connected dimension.")

XpressoForSequencePrediction ¶

Bases: XpressoPreTrainedModel

Examples:

Python Console Session
>>> import torch
>>> from multimolecule import XpressoConfig, XpressoForSequencePrediction, DnaTokenizer
>>> config = XpressoConfig()
>>> model = XpressoForSequencePrediction(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
>>> input = tokenizer(["ACGTACGTACGT", "TGCATGCATGCA"], return_tensors="pt")
>>> features = torch.randn(2, config.num_features)
>>> output = model(**input, features=features, labels=torch.randn(2, 1))
>>> output["logits"].shape
torch.Size([2, 1])

Source code in multimolecule/models/xpresso/modeling_xpresso.py

Python
class XpressoForSequencePrediction(XpressoPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import XpressoConfig, XpressoForSequencePrediction, DnaTokenizer
        >>> config = XpressoConfig()
        >>> model = XpressoForSequencePrediction(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
        >>> input = tokenizer(["ACGTACGTACGT", "TGCATGCATGCA"], return_tensors="pt")
        >>> features = torch.randn(2, config.num_features)
        >>> output = model(**input, features=features, labels=torch.randn(2, 1))
        >>> output["logits"].shape
        torch.Size([2, 1])
    """

    def __init__(self, config: XpressoConfig):
        super().__init__(config)
        self.model = XpressoModel(config)
        self.sequence_head = SequencePredictionHead(config)
        self.head_config = self.sequence_head.config
        # Initialize weights and apply final processing
        self.post_init()

    @property
    def output_channels(self) -> list[str]:
        if self.config.num_labels == 1:
            return ["expression"]
        return [f"expression_{index}" for index in range(self.config.num_labels)]

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        features: Tensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            features=features,
            return_dict=True,
            **kwargs,
        )

        output = self.sequence_head(outputs, labels)
        logits, loss = output.logits, output.loss

        return SequencePredictorOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

XpressoModel ¶

Bases: XpressoPreTrainedModel

Examples:

Python Console Session
>>> import torch
>>> from multimolecule import XpressoConfig, XpressoModel, DnaTokenizer
>>> config = XpressoConfig()
>>> model = XpressoModel(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
>>> input = tokenizer(["ACGTACGTACGT", "TGCATGCATGCA"], return_tensors="pt")
>>> features = torch.randn(2, config.num_features)
>>> output = model(**input, features=features)
>>> output["pooler_output"].shape
torch.Size([2, 2])

Source code in multimolecule/models/xpresso/modeling_xpresso.py

Python
class XpressoModel(XpressoPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import XpressoConfig, XpressoModel, DnaTokenizer
        >>> config = XpressoConfig()
        >>> model = XpressoModel(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
        >>> input = tokenizer(["ACGTACGTACGT", "TGCATGCATGCA"], return_tensors="pt")
        >>> features = torch.randn(2, config.num_features)
        >>> output = model(**input, features=features)
        >>> output["pooler_output"].shape
        torch.Size([2, 2])
    """

    def __init__(self, config: XpressoConfig):
        super().__init__(config)
        self.gradient_checkpointing = False
        self.embeddings = XpressoEmbedding(config)
        self.encoder = XpressoEncoder(config)
        self.head = XpressoHead(config)
        # Initialize weights and apply final processing
        self.post_init()

    # Xpresso's `last_hidden_state` is the *flattened* convolutional representation, not a
    # per-position layer output, so it must not be tied into the recorded `hidden_states` tuple.
    @merge_with_config_defaults
    @capture_outputs(tie_last_hidden_states=False)
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        features: Tensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> XpressoModelOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        output_hidden_states = kwargs.get("output_hidden_states", self.config.output_hidden_states)

        if isinstance(input_ids, NestedTensor):
            if attention_mask is None:
                attention_mask = input_ids.mask
            input_ids = input_ids.tensor
        if isinstance(inputs_embeds, NestedTensor):
            if attention_mask is None:
                attention_mask = inputs_embeds.mask
            inputs_embeds = inputs_embeds.tensor
        if input_ids is not None:
            batch_size = input_ids.size(0)
        else:
            if inputs_embeds is None:
                raise ValueError("You have to specify either input_ids or inputs_embeds")
            batch_size = inputs_embeds.size(0)
        self._validate_features(features, batch_size)

        embedding_output = self.embeddings(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )
        encoder_outputs = self.encoder(embedding_output, **kwargs)
        conv_output = encoder_outputs.last_hidden_state
        pooler_output = self.head(conv_output, features=features)

        return XpressoModelOutput(
            last_hidden_state=conv_output,
            pooler_output=pooler_output,
            hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
            attentions=None,
        )

    def _validate_features(self, features: Tensor | None, batch_size: int) -> None:
        if self.config.num_features == 0:
            if features is not None:
                raise ValueError(
                    "This Xpresso model is configured with num_features=0 and does not accept a `features` tensor."
                )
            return
        if features is None:
            raise ValueError(
                f"This Xpresso model is configured with num_features={self.config.num_features}; "
                "you must pass the auxiliary `features` tensor."
            )
        if features.ndim != 2:
            raise ValueError(
                "`features` must be a 2D tensor of shape "
                f"(batch_size, {self.config.num_features}), got shape {tuple(features.shape)}."
            )
        if features.size(0) != batch_size:
            raise ValueError(f"`features` batch size ({features.size(0)}) must match input batch size ({batch_size}).")
        if features.size(1) != self.config.num_features:
            raise ValueError(
                f"`features` last dimension ({features.size(1)}) must equal "
                f"`config.num_features` ({self.config.num_features})."
            )

XpressoModelOutput `dataclass` ¶

Bases: ModelOutput

Base class for outputs of the Xpresso backbone.

Parameters:

Name	Type	Description	Default
`last_hidden_state` ¶	`torch.FloatTensor` of shape `(batch_size, flattened_conv_size)`	Flattened convolutional representation of the promoter sequence.	`None`
`pooler_output` ¶	`torch.FloatTensor` of shape `(batch_size, hidden_size)`	Final fully-connected representation, with the auxiliary mRNA half-life features fused in. This is the tensor consumed by `SequencePredictionHead`.	`None`
`attentions` ¶	always `None`	Xpresso is a purely convolutional architecture and has no attention; this field is always `None` and is present only for compatibility with the Transformers output convention.	`None`

Source code in multimolecule/models/xpresso/modeling_xpresso.py

Python
@dataclass
class XpressoModelOutput(ModelOutput):
    """
    Base class for outputs of the Xpresso backbone.

    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, flattened_conv_size)`):
            Flattened convolutional representation of the promoter sequence.
        pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
            Final fully-connected representation, with the auxiliary mRNA half-life features fused in. This is the
            tensor consumed by [`SequencePredictionHead`][multimolecule.modules.SequencePredictionHead].
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or
            when `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the embedding output plus one after each convolutional block) of
            shape `(batch_size, length, channels)`. Convolutional feature maps recorded along the encoder stack.
        attentions (always `None`):
            Xpresso is a purely convolutional architecture and has no attention; this field is always `None` and is
            present only for compatibility with the Transformers output convention.
    """

    last_hidden_state: torch.FloatTensor | None = None
    pooler_output: torch.FloatTensor | None = None
    hidden_states: tuple[torch.FloatTensor, ...] | None = None
    attentions: tuple[torch.FloatTensor, ...] | None = None

XpressoPreTrainedModel ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/xpresso/modeling_xpresso.py

Python
class XpressoPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = XpressoConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _can_record_outputs: dict[str, Any] | None = None
    _no_split_modules = ["XpressoBlock"]

    @torch.no_grad()
    def _init_weights(self, module):
        super()._init_weights(module)
        # Use transformers.initialization wrappers (imported as `init`); they check the
        # `_is_hf_initialized` flag so they don't clobber tensors loaded from a checkpoint.
        if isinstance(module, nn.Conv1d):
            init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
            if module.bias is not None:
                init.zeros_(module.bias)
        # copied from the `reset_parameters` method of `class Linear(Module)` in `torch`.
        elif isinstance(module, nn.Linear):
            init.kaiming_uniform_(module.weight, a=math.sqrt(5))
            if module.bias is not None:
                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                init.uniform_(module.bias, -bound, bound)
        elif isinstance(module, (nn.BatchNorm1d, nn.LayerNorm, nn.GroupNorm)):
            init.ones_(module.weight)
            init.zeros_(module.bias)

Xpresso¶

Disclaimer¶

Model Details¶

Model Specification¶

Links¶

Usage¶

Direct Use¶

mRNA Expression Prediction¶

Interface¶

Training Details¶

Training Data¶

Training Procedure¶

Pre-training¶

Citation¶

Contact¶

License¶

multimolecule.models.xpresso ¶

DnaTokenizer ¶

alphabet ¶

nmers ¶

codon ¶

replace_U_with_T ¶

do_upper_case ¶

XpressoConfig ¶

vocab_size ¶

input_length ¶

num_conv_layers ¶

conv_channels ¶

conv_kernel_sizes ¶

conv_dilations ¶

pool_sizes ¶

num_features ¶

fc_dims ¶

hidden_act ¶

hidden_dropout ¶

num_labels ¶

head ¶

XpressoForSequencePrediction ¶

XpressoModel ¶

XpressoModelOutput dataclass ¶

last_hidden_state ¶

pooler_output ¶

attentions ¶

XpressoPreTrainedModel ¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

`replace_U_with_T` ¶

`do_upper_case` ¶

`vocab_size` ¶

`input_length` ¶

`num_conv_layers` ¶

`conv_channels` ¶

`conv_kernel_sizes` ¶

`conv_dilations` ¶

`pool_sizes` ¶

`num_features` ¶

`fc_dims` ¶

`hidden_act` ¶

`hidden_dropout` ¶

`num_labels` ¶

`head` ¶

XpressoModelOutput `dataclass` ¶

`last_hidden_state` ¶

`pooler_output` ¶

`attentions` ¶