Malinois¶

Convolutional neural network for predicting cell-type-targeting cis-regulatory element (CRE) activity from DNA sequence.

Disclaimer¶

This is an UNOFFICIAL implementation of Machine-guided design of cell-type-targeting cis-regulatory elements by Sager J. Gosai, Rodrigo I. Castro, et al.

The OFFICIAL repository of Malinois is at sjgosai/boda2.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing Malinois did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details¶

Malinois is a deep convolutional neural network (a tuned Basset-style “branched” architecture) trained to quantitatively predict cell-type-informed CRE activity from ~200 bp DNA sequences measured by a massively parallel reporter assay (MPRA). The model emits three regression outputs, one per human cell line: K562, HepG2 and SK-N-SH (in that order).

The architecture consists of three convolutional blocks, one shared fully-connected block, and a branched grouped-linear tower with an independent parameter set per cell line. Please refer to the Training Details section for more information on the training process.

Model Specification¶

Num Layers	Hidden Size	Num Parameters (M)	FLOPs (M)	MACs (M)	Max Num Tokens
8	420	4.11	332.95	165.70	600

Links¶

Code: multimolecule.malinois
Weights: multimolecule/malinois
Data: MPRA libraries across K562, HepG2, and SK-N-SH human cell lines
Paper: Machine-guided design of cell-type-targeting cis-regulatory elements
Developed by: Sager J. Gosai, Rodrigo I. Castro, Natalia Fuentes, John C. Butts, Kousuke Mouri, Michael Alasoadura, Susan Kales, Thanh Thanh L. Nguyen, Ramil R. Noche, Arya S. Rao, Mary T. Joy, Pardis C. Sabeti, Steven K. Reilly, Ryan Tewhey
Model type: 1D CNN with cell-type-specific grouped-linear output head for MPRA cis-regulatory element activity
Original Repository: sjgosai/boda2

Usage¶

The model file depends on the multimolecule library. You can install it using pip:

Bash
1	`pip install multimolecule`

Direct Use¶

CRE Activity Prediction¶

You can use this model directly to predict the cell-type-informed CRE activity (K562, HepG2, SK-N-SH) of a sequence. Malinois pads each ~200 bp candidate to 600 bp with fixed MPRA plasmid flanks before inference; the example below uses a pre-padded 600 bp sequence:

Python
>>> import torch
>>> from multimolecule import DnaTokenizer, MalinoisForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
>>> model = MalinoisForSequencePrediction.from_pretrained("multimolecule/malinois")
>>> sequence = "ACGT" * 150
>>> output = model(**tokenizer(sequence, return_tensors="pt"))

>>> output.logits.shape
torch.Size([1, 3])

Interface¶

Input length: fixed 600 bp window
Padding: each ~200 bp candidate CRE is centered and padded with fixed MPRA plasmid flanks (MPRA_UPSTREAM / MPRA_DOWNSTREAM); flank padding is part of the data pipeline, not the model
Output: 3 cell-line CRE activity values (K562, HepG2, SK-N-SH)

Training Details¶

Malinois was trained to predict quantitative, cell-type-informed CRE activity from DNA sequence.

Training Data¶

Malinois was trained on a lentiMPRA dataset measuring the regulatory activity of ~200 bp sequences across three human cell lines (K562, HepG2 and SK-N-SH). Each training example is a sequence with three continuous activity values (log2 fold-change over input), one per cell line. Genomic sequences were split by chromosome into training, validation, and test sets to avoid sequence leakage.

Training Procedure¶

Pre-training¶

The model was trained to minimize an L1 + KL-divergence mixed loss between predicted and measured cell-type CRE activities, with the architecture and training hyperparameters selected by Bayesian optimization.

Optimizer: Adam
Loss: L1 + KL-divergence mixed loss
Early stopping on validation loss

Citation¶

BibTeX
@article{gosai2024malinois,
  author    = {Gosai, Sager J. and Castro, Rodrigo I. and Fuentes, Natalia and Butts, John C. and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh L. and Noche, Ramil R. and Rao, Arya S. and Joy, Mary T. and Sabeti, Pardis C. and Reilly, Steven K. and Tewhey, Ryan},
  journal   = {Nature},
  month     = oct,
  number    = 8036,
  pages     = {1211--1220},
  publisher = {Springer Science and Business Media LLC},
  title     = {Machine-guided design of cell-type-targeting cis-regulatory elements},
  volume    = 634,
  year      = 2024
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact¶

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the Malinois paper for questions or comments on the paper/model.

License¶

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
1	`SPDX-License-Identifier: AGPL-3.0-or-later`

multimolecule.models.malinois ¶

DnaTokenizer ¶

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard RNA alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `iupac` `streamline` `nucleobase` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`nmers` ¶	`int`	Size of kmer to tokenize.	`1`
`codon` ¶	`bool`	Whether to tokenize into codons.	`False`
`replace_U_with_T` ¶	`bool`	Whether to replace U with T.	`True`
`do_upper_case` ¶	`bool`	Whether to convert input to uppercase.	`True`

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10

Source code in multimolecule/tokenisers/dna/tokenization_dna.py

Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MalinoisConfig ¶

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a MalinoisModel. It is used to instantiate a Malinois model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Malinois sjgosai/boda2 architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name	Type	Description	Default
`vocab_size` ¶	`int`	Vocabulary size of the Malinois model. Defines the number of feature channels in the one-hot encoded input fed to the first convolution. Defaults to 5.	`5`
`input_length` ¶	`int`	The fixed length (in base pairs) of the input fed to the first convolution. Upstream Malinois pads each 200 bp candidate sequence with fixed MPRA plasmid flanks up to this length before the convolution stack. Defaults to 600.	`600`
`conv_channels` ¶	`list[int] \| None`	Number of output channels for each convolutional block.	`None`
`conv_kernel_sizes` ¶	`list[int] \| None`	Convolution kernel size for each convolutional block.	`None`
`num_linear_layers` ¶	`int`	Number of fully-connected layers between the convolutional stack and the branched tower.	`1`
`linear_channels` ¶	`int`	Hidden size for each fully-connected layer.	`1000`
`linear_act` ¶	`str`	The non-linear activation function (function or string) applied after the convolutional and linear layers. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.	`'relu'`
`linear_dropout` ¶	`float`	The dropout probability for the fully-connected layers.	`0.11625456877954289`
`num_branched_layers` ¶	`int`	Number of grouped (branched) layers, one independent tower per output cell line.	`3`
`branched_channels` ¶	`int`	Hidden size for each branch in the branched tower.	`140`
`branched_act` ¶	`str`	The non-linear activation function applied between branched layers.	`'relu'`
`branched_dropout` ¶	`float`	The dropout probability for the branched tower.	`0.5757068086404574`
`batch_norm_eps` ¶	`float`	The epsilon used by the batch normalization layers.	`1e-05`
`batch_norm_momentum` ¶	`float`	The momentum used by the batch normalization layers.	`0.1`
`num_labels` ¶	`int`	Number of regression outputs. Malinois predicts cell-type-informed cis-regulatory activity for three human cell lines: K562, HepG2 and SK-N-SH (in that order).	`3`
`head` ¶	`HeadConfig \| None`	The configuration of the prediction head. Defaults to a regression head (`problem_type="regression"`), matching Malinois’s CRE activity prediction task.	`None`

Examples:

Python Console Session
>>> from multimolecule import MalinoisConfig, MalinoisModel
>>> # Initializing a Malinois multimolecule/malinois style configuration
>>> configuration = MalinoisConfig()
>>> # Initializing a model (with random weights) from the multimolecule/malinois style configuration
>>> model = MalinoisModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in multimolecule/models/malinois/configuration_malinois.py

Python
class MalinoisConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`MalinoisModel`][multimolecule.models.MalinoisModel]. It is used to instantiate a Malinois model according to the
    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
    similar configuration to that of the Malinois [sjgosai/boda2](https://github.com/sjgosai/boda2) architecture.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the Malinois model. Defines the number of feature channels in the one-hot encoded
            input fed to the first convolution.
            Defaults to 5.
        input_length:
            The fixed length (in base pairs) of the input fed to the first convolution. Upstream Malinois pads each
            200 bp candidate sequence with fixed MPRA plasmid flanks up to this length before the convolution stack.
            Defaults to 600.
        conv_channels:
            Number of output channels for each convolutional block.
        conv_kernel_sizes:
            Convolution kernel size for each convolutional block.
        num_linear_layers:
            Number of fully-connected layers between the convolutional stack and the branched tower.
        linear_channels:
            Hidden size for each fully-connected layer.
        linear_act:
            The non-linear activation function (function or string) applied after the convolutional and linear
            layers. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
        linear_dropout:
            The dropout probability for the fully-connected layers.
        num_branched_layers:
            Number of grouped (branched) layers, one independent tower per output cell line.
        branched_channels:
            Hidden size for each branch in the branched tower.
        branched_act:
            The non-linear activation function applied between branched layers.
        branched_dropout:
            The dropout probability for the branched tower.
        batch_norm_eps:
            The epsilon used by the batch normalization layers.
        batch_norm_momentum:
            The momentum used by the batch normalization layers.
        num_labels:
            Number of regression outputs. Malinois predicts cell-type-informed cis-regulatory activity for three
            human cell lines: K562, HepG2 and SK-N-SH (in that order).
        head:
            The configuration of the prediction head. Defaults to a regression head
            (`problem_type="regression"`), matching Malinois's CRE activity prediction task.

    Examples:
        >>> from multimolecule import MalinoisConfig, MalinoisModel
        >>> # Initializing a Malinois multimolecule/malinois style configuration
        >>> configuration = MalinoisConfig()
        >>> # Initializing a model (with random weights) from the multimolecule/malinois style configuration
        >>> model = MalinoisModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "malinois"

    def __init__(
        self,
        vocab_size: int = 5,
        input_length: int = 600,
        conv_channels: list[int] | None = None,
        conv_kernel_sizes: list[int] | None = None,
        num_linear_layers: int = 1,
        linear_channels: int = 1000,
        linear_act: str = "relu",
        linear_dropout: float = 0.11625456877954289,
        num_branched_layers: int = 3,
        branched_channels: int = 140,
        branched_act: str = "relu",
        branched_dropout: float = 0.5757068086404574,
        batch_norm_eps: float = 1e-5,
        batch_norm_momentum: float = 0.1,
        num_labels: int = 3,
        head: HeadConfig | None = None,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, **kwargs)
        if conv_channels is None:
            conv_channels = [300, 200, 200]
        if conv_kernel_sizes is None:
            conv_kernel_sizes = [19, 11, 7]
        if len(conv_channels) != len(conv_kernel_sizes):
            raise ValueError(
                f"conv_channels and conv_kernel_sizes must have the same length, "
                f"got {len(conv_channels)} and {len(conv_kernel_sizes)}."
            )
        if len(conv_channels) != 3:
            raise ValueError(f"Malinois uses exactly 3 convolutional blocks, got {len(conv_channels)}.")
        if input_length <= 0:
            raise ValueError(f"input_length must be positive, got {input_length}.")
        if num_linear_layers <= 0:
            raise ValueError(f"num_linear_layers must be positive, got {num_linear_layers}.")
        if num_branched_layers <= 0:
            raise ValueError(f"num_branched_layers must be positive, got {num_branched_layers}.")
        self.vocab_size = vocab_size
        self.input_length = input_length
        self.conv_channels = conv_channels
        self.conv_kernel_sizes = conv_kernel_sizes
        self.num_conv_layers = len(conv_channels)
        self.num_linear_layers = num_linear_layers
        self.linear_channels = linear_channels
        self.linear_act = linear_act
        self.linear_dropout = linear_dropout
        self.num_branched_layers = num_branched_layers
        self.branched_channels = branched_channels
        self.branched_act = branched_act
        self.branched_dropout = branched_dropout
        self.batch_norm_eps = batch_norm_eps
        self.batch_norm_momentum = batch_norm_momentum
        if head is None:
            head = HeadConfig(problem_type="regression")
        else:
            head = HeadConfig(head)
            if head.problem_type is None:
                head.problem_type = "regression"
        self.head = head

    @property
    def flatten_factor(self) -> int:
        hook = self.input_length // 3 // 4
        return (hook + 2) // 4

    @property
    def hidden_size(self) -> int:
        return self.num_labels * self.branched_channels

MalinoisForSequencePrediction ¶

Bases: MalinoisPreTrainedModel

Examples:

Python Console Session
>>> import torch
>>> from multimolecule import MalinoisConfig, MalinoisForSequencePrediction, DnaTokenizer
>>> config = MalinoisConfig()
>>> model = MalinoisForSequencePrediction(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
>>> input = tokenizer(["ACGT" * 150, "TGCA" * 150], return_tensors="pt")
>>> output = model(**input, labels=torch.randn(2, 3))
>>> output["logits"].shape
torch.Size([2, 3])

Source code in multimolecule/models/malinois/modeling_malinois.py

Python
class MalinoisForSequencePrediction(MalinoisPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import MalinoisConfig, MalinoisForSequencePrediction, DnaTokenizer
        >>> config = MalinoisConfig()
        >>> model = MalinoisForSequencePrediction(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
        >>> input = tokenizer(["ACGT" * 150, "TGCA" * 150], return_tensors="pt")
        >>> output = model(**input, labels=torch.randn(2, 3))
        >>> output["logits"].shape
        torch.Size([2, 3])
    """

    def __init__(self, config: MalinoisConfig):
        super().__init__(config)
        self.model = MalinoisModel(config)
        self.sequence_head = SequencePredictionHead(config)
        self.head_config = self.sequence_head.config

        # Initialize weights and apply final processing
        self.post_init()

    @property
    def output_channels(self) -> list[str]:
        if self.config.num_labels == 3:
            return ["K562", "HepG2", "SK-N-SH"]
        return [f"cell_{index}" for index in range(self.config.num_labels)]

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )

        output = self.sequence_head(outputs, labels)
        logits, loss = output.logits, output.loss

        return SequencePredictorOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

MalinoisModel ¶

Bases: MalinoisPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import MalinoisConfig, MalinoisModel, DnaTokenizer
>>> config = MalinoisConfig()
>>> model = MalinoisModel(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
>>> input = tokenizer(["ACGT" * 150, "TGCA" * 150], return_tensors="pt")
>>> output = model(**input)
>>> output["pooler_output"].shape
torch.Size([2, 420])

Source code in multimolecule/models/malinois/modeling_malinois.py

Python
class MalinoisModel(MalinoisPreTrainedModel):
    """
    Examples:
        >>> from multimolecule import MalinoisConfig, MalinoisModel, DnaTokenizer
        >>> config = MalinoisConfig()
        >>> model = MalinoisModel(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
        >>> input = tokenizer(["ACGT" * 150, "TGCA" * 150], return_tensors="pt")
        >>> output = model(**input)
        >>> output["pooler_output"].shape
        torch.Size([2, 420])
    """

    def __init__(self, config: MalinoisConfig):
        super().__init__(config)
        self.embeddings = MalinoisEmbedding(config)
        self.encoder = MalinoisEncoder(config)
        self.pooler = MalinoisPooler(config)

        # Initialize weights and apply final processing
        self.post_init()

    @merge_with_config_defaults
    @capture_outputs
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> MalinoisModelOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if isinstance(input_ids, NestedTensor):
            if attention_mask is None:
                attention_mask = input_ids.mask
            input_ids = input_ids.tensor
        if isinstance(inputs_embeds, NestedTensor):
            if attention_mask is None:
                attention_mask = inputs_embeds.mask
            inputs_embeds = inputs_embeds.tensor

        embedding_output = self.embeddings(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )
        sequence_output = self.encoder(embedding_output)
        pooled_output = self.pooler(sequence_output)

        return MalinoisModelOutput(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
        )

MalinoisModelOutput `dataclass` ¶

Bases: ModelOutput

Base class for outputs of the Malinois model.

Parameters:

Name	Type	Description	Default
`last_hidden_state` ¶	`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`	Flattened feature map produced by the convolutional encoder.	`None`
`pooler_output` ¶	`torch.FloatTensor` of shape `(batch_size, num_labels * branched_channels)`	Branch-major sequence-level representation produced by the fully-connected and branched tower. The first `branched_channels` features belong to the K562 branch, the next to HepG2, and the last to SK-N-SH.	`None`
`attentions` ¶	`tuple(torch.FloatTensor)`, optional	Always `None`; Malinois is a convolutional model without attention.	`None`

Source code in multimolecule/models/malinois/modeling_malinois.py

Python
@dataclass
class MalinoisModelOutput(ModelOutput):
    """
    Base class for outputs of the Malinois model.

    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`):
            Flattened feature map produced by the convolutional encoder.
        pooler_output (`torch.FloatTensor` of shape `(batch_size, num_labels * branched_channels)`):
            Branch-major sequence-level representation produced by the fully-connected and branched tower. The first
            `branched_channels` features belong to the K562 branch, the next to HepG2, and the last to SK-N-SH.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or
            when `config.output_hidden_states=True`):
            Hidden-states of the model at the output of each layer.
        attentions (`tuple(torch.FloatTensor)`, *optional*):
            Always `None`; Malinois is a convolutional model without attention.
    """

    last_hidden_state: torch.FloatTensor | None = None
    pooler_output: torch.FloatTensor | None = None
    hidden_states: tuple[torch.FloatTensor, ...] | None = None
    attentions: tuple[torch.FloatTensor, ...] | None = None

MalinoisPreTrainedModel ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/malinois/modeling_malinois.py

Python
class MalinoisPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = MalinoisConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None
    _no_split_modules = ["MalinoisConvBlock", "MalinoisBranchedLayer"]

    @torch.no_grad()
    def _init_weights(self, module):
        super()._init_weights(module)
        # Use transformers.initialization wrappers (imported as `init`); they check the
        # `_is_hf_initialized` flag so they don't clobber tensors loaded from a checkpoint.
        if isinstance(module, nn.Conv1d):
            init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
            if module.bias is not None:
                init.zeros_(module.bias)
        # copied from the `reset_parameters` method of `class Linear(Module)` in `torch`.
        elif isinstance(module, nn.Linear):
            init.kaiming_uniform_(module.weight, a=math.sqrt(5))
            if module.bias is not None:
                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                init.uniform_(module.bias, -bound, bound)
        elif isinstance(module, MalinoisGroupedLinear):
            init.kaiming_uniform_(module.weight, a=math.sqrt(3))
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            init.uniform_(module.bias, -bound, bound)
        elif isinstance(module, (nn.BatchNorm1d, nn.LayerNorm, nn.GroupNorm)):
            init.ones_(module.weight)
            init.zeros_(module.bias)

Malinois¶

Disclaimer¶

Model Details¶

Model Specification¶

Links¶

Usage¶

Direct Use¶

CRE Activity Prediction¶

Interface¶

Training Details¶

Training Data¶

Training Procedure¶

Pre-training¶

Citation¶

Contact¶

License¶

multimolecule.models.malinois ¶

DnaTokenizer ¶

alphabet ¶

nmers ¶

codon ¶

replace_U_with_T ¶

do_upper_case ¶

MalinoisConfig ¶

vocab_size ¶

input_length ¶

conv_channels ¶

conv_kernel_sizes ¶

num_linear_layers ¶

linear_channels ¶

linear_act ¶

linear_dropout ¶

num_branched_layers ¶

branched_channels ¶

branched_act ¶

branched_dropout ¶

batch_norm_eps ¶

batch_norm_momentum ¶

num_labels ¶

head ¶

MalinoisForSequencePrediction ¶

MalinoisModel ¶

MalinoisModelOutput dataclass ¶

last_hidden_state ¶

pooler_output ¶

attentions ¶

MalinoisPreTrainedModel ¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

`replace_U_with_T` ¶

`do_upper_case` ¶

`vocab_size` ¶

`input_length` ¶

`conv_channels` ¶

`conv_kernel_sizes` ¶

`num_linear_layers` ¶

`linear_channels` ¶

`linear_act` ¶

`linear_dropout` ¶

`num_branched_layers` ¶

`branched_channels` ¶

`branched_act` ¶

`branched_dropout` ¶

`batch_norm_eps` ¶

`batch_norm_momentum` ¶

`num_labels` ¶

`head` ¶

MalinoisModelOutput `dataclass` ¶

`last_hidden_state` ¶

`pooler_output` ¶

`attentions` ¶