DeepSTARR¶

Convolutional neural network for predicting enhancer activity directly from DNA sequence.

Disclaimer¶

This is an UNOFFICIAL implementation of DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers by Bernardo P. de Almeida, Franziska Reiter, et al.

The OFFICIAL repository of DeepSTARR is at bernardo-de-almeida/DeepSTARR.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing DeepSTARR did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details¶

DeepSTARR is a convolutional neural network (CNN) trained to quantitatively predict enhancer activity from 249 bp DNA sequences. The model was trained on genome-wide STARR-seq data from Drosophila melanogaster S2 cells and predicts two regression outputs: developmental and housekeeping enhancer activity. The architecture consists of four convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool) followed by two fully-connected layers. Please refer to the Training Details section for more information on the training process.

Model Specification¶

Num Conv Layers	Num FC Layers	Hidden Size	Num Parameters (M)	FLOPs (M)	MACs (M)	Max Num Tokens
4	2	256	0.62	21.03	10.26	249

Links¶

Code: multimolecule.deepstarr
Weights: multimolecule/deepstarr
Data: Drosophila S2 UMI-STARR-seq enhancer-activity data
Paper: DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers
Developed by: Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark
Model type: Four-block 1D CNN over 249 bp DNA for developmental and housekeeping enhancer-activity regression
Original Repository: bernardo-de-almeida/DeepSTARR

Usage¶

The model file depends on the multimolecule library. You can install it using pip:

Bash
1	`pip install multimolecule`

Direct Use¶

Enhancer Activity Prediction¶

You can use this model directly to predict the developmental and housekeeping enhancer activity of a 249 bp DNA sequence:

Python
>>> import torch
>>> from multimolecule import DnaTokenizer, DeepStarrForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
>>> model = DeepStarrForSequencePrediction.from_pretrained("multimolecule/deepstarr")
>>> sequence = "ACGT" * 62 + "A"
>>> output = model(**tokenizer(sequence, return_tensors="pt"))

>>> output.logits.shape
torch.Size([1, 2])

Interface¶

Input length: fixed 249 bp DNA window
Output: 2 regression outputs (developmental and housekeeping enhancer activity, log2 enrichment over input)

Training Details¶

DeepSTARR was trained to predict quantitative enhancer activity from DNA sequence.

Training Data¶

DeepSTARR was trained on genome-wide UMI-STARR-seq data from Drosophila melanogaster S2 cells, measuring enhancer activity under two transcriptional programs: a developmental program (driven by a developmental core promoter) and a housekeeping program (driven by a housekeeping core promoter).

Each training example is a 249 bp genomic sequence with two continuous activity values (developmental and housekeeping, log2 enrichment over input). Chromosomes were split into training, validation, and test sets to avoid sequence leakage.

Training Procedure¶

Pre-training¶

The model was trained to minimize a mean-squared-error loss between predicted and measured enhancer activities.

Optimizer: Adam
Learning rate: 2e-3
Loss: Mean Squared Error
Early stopping on validation loss

Citation¶

BibTeX
@article{deAlmeida2022deepstarr,
  author    = {de Almeida, Bernardo P. and Reiter, Franziska and Pagani, Michaela and Stark, Alexander},
  journal   = {Nature Genetics},
  month     = may,
  number    = 5,
  pages     = {613--624},
  publisher = {Springer Science and Business Media LLC},
  title     = {{DeepSTARR} predicts enhancer activity from {DNA} sequence and enables the de novo design of synthetic enhancers},
  volume    = 54,
  year      = 2022
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact¶

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the DeepSTARR paper for questions or comments on the paper/model.

License¶

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
1	`SPDX-License-Identifier: AGPL-3.0-or-later`

multimolecule.models.deepstarr ¶

DnaTokenizer ¶

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard RNA alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `iupac` `streamline` `nucleobase` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`nmers` ¶	`int`	Size of kmer to tokenize.	`1`
`codon` ¶	`bool`	Whether to tokenize into codons.	`False`
`replace_U_with_T` ¶	`bool`	Whether to replace U with T.	`True`
`do_upper_case` ¶	`bool`	Whether to convert input to uppercase.	`True`

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10

Source code in multimolecule/tokenisers/dna/tokenization_dna.py

Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

DeepStarrConfig ¶

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a DeepStarrModel. It is used to instantiate a DeepSTARR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the DeepSTARR bernardo-de-almeida/DeepSTARR architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name	Type	Description	Default
`vocab_size` ¶	`int`	Vocabulary size of the DeepSTARR model. Defines the number of feature channels in the one-hot encoded input fed to the first convolution. Defaults to 5.	`5`
`input_length` ¶	`int`	The fixed length (in base pairs) of the input DNA sequence. Defaults to 249.	`249`
`num_conv_layers` ¶	`int`	Number of convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool).	`4`
`conv_channels` ¶	`list[int] \| None`	Number of output channels for each convolutional block.	`None`
`conv_kernel_sizes` ¶	`list[int] \| None`	Convolution kernel size for each convolutional block.	`None`
`pool_size` ¶	`int`	Max pooling window applied after every convolutional block.	`2`
`num_fc_layers` ¶	`int`	Number of fully-connected layers between the convolutional stack and the prediction head.	`2`
`fc_dims` ¶	`list[int] \| None`	Hidden size for each fully-connected layer.	`None`
`hidden_act` ¶	`str`	The non-linear activation function (function or string) in the encoder. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.	`'relu'`
`hidden_dropout` ¶	`float`	The dropout probability for the fully-connected layers.	`0.4`
`batch_norm_eps` ¶	`float`	The epsilon used by the batch normalization layers.	`0.001`
`batch_norm_momentum` ¶	`float`	The momentum used by the batch normalization layers.	`0.1`
`num_labels` ¶	`int`	Number of regression outputs. DeepSTARR predicts developmental and housekeeping enhancer activity.	`2`
`head` ¶	`HeadConfig \| None`	The configuration of the prediction head. Defaults to a regression head (`problem_type="regression"`), matching DeepSTARR’s enhancer activity prediction task.	`None`

Examples:

Python Console Session
>>> from multimolecule import DeepStarrConfig, DeepStarrModel
>>> # Initializing a DeepSTARR multimolecule/deepstarr style configuration
>>> configuration = DeepStarrConfig()
>>> # Initializing a model (with random weights) from the multimolecule/deepstarr style configuration
>>> model = DeepStarrModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in multimolecule/models/deepstarr/configuration_deepstarr.py

Python
class DeepStarrConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`DeepStarrModel`][multimolecule.models.DeepStarrModel]. It is used to instantiate a DeepSTARR model according to
    the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
    yield a similar configuration to that of the DeepSTARR
    [bernardo-de-almeida/DeepSTARR](https://github.com/bernardo-de-almeida/DeepSTARR) architecture.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the DeepSTARR model. Defines the number of feature channels in the one-hot encoded
            input fed to the first convolution.
            Defaults to 5.
        input_length:
            The fixed length (in base pairs) of the input DNA sequence.
            Defaults to 249.
        num_conv_layers:
            Number of convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool).
        conv_channels:
            Number of output channels for each convolutional block.
        conv_kernel_sizes:
            Convolution kernel size for each convolutional block.
        pool_size:
            Max pooling window applied after every convolutional block.
        num_fc_layers:
            Number of fully-connected layers between the convolutional stack and the prediction head.
        fc_dims:
            Hidden size for each fully-connected layer.
        hidden_act:
            The non-linear activation function (function or string) in the encoder. If string, `"gelu"`, `"relu"`,
            `"silu"` and `"gelu_new"` are supported.
        hidden_dropout:
            The dropout probability for the fully-connected layers.
        batch_norm_eps:
            The epsilon used by the batch normalization layers.
        batch_norm_momentum:
            The momentum used by the batch normalization layers.
        num_labels:
            Number of regression outputs. DeepSTARR predicts developmental and housekeeping enhancer activity.
        head:
            The configuration of the prediction head. Defaults to a regression head
            (`problem_type="regression"`), matching DeepSTARR's enhancer activity prediction task.

    Examples:
        >>> from multimolecule import DeepStarrConfig, DeepStarrModel
        >>> # Initializing a DeepSTARR multimolecule/deepstarr style configuration
        >>> configuration = DeepStarrConfig()
        >>> # Initializing a model (with random weights) from the multimolecule/deepstarr style configuration
        >>> model = DeepStarrModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "deepstarr"

    def __init__(
        self,
        vocab_size: int = 5,
        input_length: int = 249,
        num_conv_layers: int = 4,
        conv_channels: list[int] | None = None,
        conv_kernel_sizes: list[int] | None = None,
        pool_size: int = 2,
        num_fc_layers: int = 2,
        fc_dims: list[int] | None = None,
        hidden_act: str = "relu",
        hidden_dropout: float = 0.4,
        batch_norm_eps: float = 1e-3,
        batch_norm_momentum: float = 0.1,
        num_labels: int = 2,
        head: HeadConfig | None = None,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, **kwargs)
        if conv_channels is None:
            conv_channels = [256, 60, 60, 120]
        if conv_kernel_sizes is None:
            conv_kernel_sizes = [7, 3, 5, 3]
        if fc_dims is None:
            fc_dims = [256, 256]
        if len(conv_channels) != num_conv_layers:
            raise ValueError(f"conv_channels must have {num_conv_layers} entries, got {len(conv_channels)}.")
        if len(conv_kernel_sizes) != num_conv_layers:
            raise ValueError(f"conv_kernel_sizes must have {num_conv_layers} entries, got {len(conv_kernel_sizes)}.")
        if len(fc_dims) != num_fc_layers:
            raise ValueError(f"fc_dims must have {num_fc_layers} entries, got {len(fc_dims)}.")
        if input_length <= 0:
            raise ValueError(f"input_length must be positive, got {input_length}.")
        if pool_size <= 0:
            raise ValueError(f"pool_size must be positive, got {pool_size}.")
        if not fc_dims:
            raise ValueError("fc_dims must contain at least one fully-connected layer.")
        self.vocab_size = vocab_size
        self.input_length = input_length
        self.num_conv_layers = num_conv_layers
        self.conv_channels = conv_channels
        self.conv_kernel_sizes = conv_kernel_sizes
        self.pool_size = pool_size
        self.num_fc_layers = num_fc_layers
        self.fc_dims = fc_dims
        self.hidden_size = fc_dims[-1]
        self.hidden_act = hidden_act
        self.hidden_dropout = hidden_dropout
        self.batch_norm_eps = batch_norm_eps
        self.batch_norm_momentum = batch_norm_momentum
        if head is None:
            head = HeadConfig(problem_type="regression")
        else:
            head = HeadConfig(head)
            if head.problem_type is None:
                head.problem_type = "regression"
        self.head = head

DeepStarrForSequencePrediction ¶

Bases: DeepStarrPreTrainedModel

Examples:

Python Console Session
>>> import torch
>>> from multimolecule import DeepStarrConfig, DeepStarrForSequencePrediction, DnaTokenizer
>>> config = DeepStarrConfig()
>>> model = DeepStarrForSequencePrediction(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
>>> input = tokenizer(["ACGT" * 62 + "A", "TGCA" * 62 + "T"], return_tensors="pt")
>>> output = model(**input, labels=torch.randn(2, 2))
>>> output["logits"].shape
torch.Size([2, 2])

Source code in multimolecule/models/deepstarr/modeling_deepstarr.py

Python
class DeepStarrForSequencePrediction(DeepStarrPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import DeepStarrConfig, DeepStarrForSequencePrediction, DnaTokenizer
        >>> config = DeepStarrConfig()
        >>> model = DeepStarrForSequencePrediction(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
        >>> input = tokenizer(["ACGT" * 62 + "A", "TGCA" * 62 + "T"], return_tensors="pt")
        >>> output = model(**input, labels=torch.randn(2, 2))
        >>> output["logits"].shape
        torch.Size([2, 2])
    """

    def __init__(self, config: DeepStarrConfig):
        super().__init__(config)
        self.model = DeepStarrModel(config)
        self.sequence_head = SequencePredictionHead(config)
        self.head_config = self.sequence_head.config

        # Initialize weights and apply final processing
        self.post_init()

    @property
    def output_channels(self) -> list[str]:
        if self.config.num_labels == 2:
            return ["developmental", "housekeeping"]
        return [f"enhancer_activity_{index}" for index in range(self.config.num_labels)]

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )

        output = self.sequence_head(outputs, labels)
        logits, loss = output.logits, output.loss

        return SequencePredictorOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

DeepStarrModel ¶

Bases: DeepStarrPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import DeepStarrConfig, DeepStarrModel, DnaTokenizer
>>> config = DeepStarrConfig()
>>> model = DeepStarrModel(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
>>> input = tokenizer(["ACGT" * 62 + "A", "TGCA" * 62 + "T"], return_tensors="pt")
>>> output = model(**input)
>>> output["pooler_output"].shape
torch.Size([2, 256])

Source code in multimolecule/models/deepstarr/modeling_deepstarr.py

Python
class DeepStarrModel(DeepStarrPreTrainedModel):
    """
    Examples:
        >>> from multimolecule import DeepStarrConfig, DeepStarrModel, DnaTokenizer
        >>> config = DeepStarrConfig()
        >>> model = DeepStarrModel(config)
        >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
        >>> input = tokenizer(["ACGT" * 62 + "A", "TGCA" * 62 + "T"], return_tensors="pt")
        >>> output = model(**input)
        >>> output["pooler_output"].shape
        torch.Size([2, 256])
    """

    def __init__(self, config: DeepStarrConfig):
        super().__init__(config)
        self.embeddings = DeepStarrEmbedding(config)
        self.encoder = DeepStarrEncoder(config)
        self.pooler = DeepStarrPooler(config)

        # Initialize weights and apply final processing
        self.post_init()

    @merge_with_config_defaults
    @capture_outputs
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Unpack[TransformersKwargs],
    ) -> DeepStarrModelOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if isinstance(input_ids, NestedTensor):
            if attention_mask is None:
                attention_mask = input_ids.mask
            input_ids = input_ids.tensor
        if isinstance(inputs_embeds, NestedTensor):
            if attention_mask is None:
                attention_mask = inputs_embeds.mask
            inputs_embeds = inputs_embeds.tensor

        embedding_output = self.embeddings(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )
        sequence_output = self.encoder(embedding_output)
        pooled_output = self.pooler(sequence_output)

        return DeepStarrModelOutput(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
        )

DeepStarrModelOutput `dataclass` ¶

Bases: ModelOutput

Base class for outputs of DeepSTARR model.

Parameters:

Name	Type	Description	Default
`last_hidden_state` ¶	`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`	Flattened feature map produced by the convolutional encoder.	`None`
`pooler_output` ¶	`torch.FloatTensor` of shape `(batch_size, hidden_size)`	Sequence-level representation produced by the fully-connected pooler.	`None`
`attentions` ¶	`tuple(torch.FloatTensor)`, optional	Always `None`; DeepSTARR is a convolutional model without attention.	`None`

Source code in multimolecule/models/deepstarr/modeling_deepstarr.py

Python
@dataclass
class DeepStarrModelOutput(ModelOutput):
    """
    Base class for outputs of DeepSTARR model.

    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`):
            Flattened feature map produced by the convolutional encoder.
        pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
            Sequence-level representation produced by the fully-connected pooler.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or
            when `config.output_hidden_states=True`):
            Hidden-states of the model at the output of each layer.
        attentions (`tuple(torch.FloatTensor)`, *optional*):
            Always `None`; DeepSTARR is a convolutional model without attention.
    """

    last_hidden_state: torch.FloatTensor | None = None
    pooler_output: torch.FloatTensor | None = None
    hidden_states: tuple[torch.FloatTensor, ...] | None = None
    attentions: tuple[torch.FloatTensor, ...] | None = None

DeepStarrPreTrainedModel ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/deepstarr/modeling_deepstarr.py

Python
class DeepStarrPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = DeepStarrConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _can_record_outputs: dict[str, Any] | None = None
    _no_split_modules = ["DeepStarrBlock"]

    @torch.no_grad()
    def _init_weights(self, module):
        super()._init_weights(module)
        # Use transformers.initialization wrappers (imported as `init`); they check the
        # `_is_hf_initialized` flag so they don't clobber tensors loaded from a checkpoint.
        if isinstance(module, nn.Conv1d):
            init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
            if module.bias is not None:
                init.zeros_(module.bias)
        # copied from the `reset_parameters` method of `class Linear(Module)` in `torch`.
        elif isinstance(module, nn.Linear):
            init.kaiming_uniform_(module.weight, a=math.sqrt(5))
            if module.bias is not None:
                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                init.uniform_(module.bias, -bound, bound)
        elif isinstance(module, (nn.BatchNorm1d, nn.LayerNorm, nn.GroupNorm)):
            init.ones_(module.weight)
            init.zeros_(module.bias)

DeepSTARR¶

Disclaimer¶

Model Details¶

Model Specification¶

Links¶

Usage¶

Direct Use¶

Enhancer Activity Prediction¶

Interface¶

Training Details¶

Training Data¶

Training Procedure¶

Pre-training¶

Citation¶

Contact¶

License¶

multimolecule.models.deepstarr ¶

DnaTokenizer ¶

alphabet ¶

nmers ¶

codon ¶

replace_U_with_T ¶

do_upper_case ¶

DeepStarrConfig ¶

vocab_size ¶

input_length ¶

num_conv_layers ¶

conv_channels ¶

conv_kernel_sizes ¶

pool_size ¶

num_fc_layers ¶

fc_dims ¶

hidden_act ¶

hidden_dropout ¶

batch_norm_eps ¶

batch_norm_momentum ¶

num_labels ¶

head ¶

DeepStarrForSequencePrediction ¶

DeepStarrModel ¶

DeepStarrModelOutput dataclass ¶

last_hidden_state ¶

pooler_output ¶

attentions ¶

DeepStarrPreTrainedModel ¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

`replace_U_with_T` ¶

`do_upper_case` ¶

`vocab_size` ¶

`input_length` ¶

`num_conv_layers` ¶

`conv_channels` ¶

`conv_kernel_sizes` ¶

`pool_size` ¶

`num_fc_layers` ¶

`fc_dims` ¶

`hidden_act` ¶

`hidden_dropout` ¶

`batch_norm_eps` ¶

`batch_norm_momentum` ¶

`num_labels` ¶

`head` ¶

DeepStarrModelOutput `dataclass` ¶

`last_hidden_state` ¶

`pooler_output` ¶

`attentions` ¶