跳转至

MaxEntScan

Maximum-entropy model for scoring short sequence motifs at RNA splice sites.

Disclaimer

This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo et al.

The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

MaxEntScan is a maximum-entropy model for the splice donor (5’) and splice acceptor (3’) sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences. These tables are registered as persistent buffers on the model so they serialize with saved checkpoints.

Model Specification

MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module.

Mode Window Num Parameters (M) FLOPs (G) MACs (G)
score5 9 0.00 0.00 0.00
score3 23 0.00 0.00 0.00

Usage

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule

Direct Use

5’ Splice-Site Scoring

Python
>>> import torch
>>> from multimolecule import DnaTokenizer, MaxEntScanModel, MaxEntScanConfig

>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/maxentscan-score5")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGTAAGT", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])

3’ Splice-Site Scoring

Python
1
2
3
4
5
>>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 9 nt fixed window for score5; 23 nt fixed window for score3
  • Alphabet: ACGT only; unknown / N tokens are clamped onto A before table lookup
  • Special tokens: do not add (add_special_tokens=False)
  • inputs_embeds: not supported; the model scores discrete token windows only
  • Output: single scalar splice-site log-odds score per window

Training Details

MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.

Scoring Modes

  • score5: scores 5’ (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published me2x5 maximum-entropy probability table combined with the consensus background ratios.
  • score3: scores 3’ (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.

Training Data

  • Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
  • Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.

MaxEntScan has no neural checkpoint. Its parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5’ scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3’ scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.

The MaxEntScan model includes those tables as score5_me2x5.txt and score3_me2x3acc.txt in their native one-float-per-line order, which equals the base-4 order of the published splice5sequences enumeration. convert_checkpoint.py builds persistent score-table buffers directly from the bundled plain-text tables.

Citation

BibTeX
@article{yeo2004maximum,
  author    = {Yeo, Gene and Burge, Christopher B.},
  title     = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
  journal   = {Journal of Computational Biology},
  volume    = {11},
  number    = {2-3},
  pages     = {377--394},
  year      = {2004},
  publisher = {Mary Ann Liebert, Inc.},
  doi       = {10.1089/1066527041410418}
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later

multimolecule.models.maxentscan

DnaTokenizer

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name Type Description Default

alphabet

Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None

nmers

int

Size of kmer to tokenize.

1

codon

bool

Whether to tokenize into codons.

False

replace_U_with_T

bool

Whether to replace U with T.

True

do_upper_case

bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MaxEntScanConfig

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a MaxEntScanModel. It is used to instantiate a MaxEntScan scorer according to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield a configuration equivalent to the 5’ splice-site scorer (score5) of the original MaxEntScan tool.

MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy probability tables published with the original tool and are registered as buffers on the model.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default

vocab_size

int

Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by the input_ids passed when calling [MaxEntScanModel]. Defaults to 5 (the streamline DNA alphabet ACGTN).

5

mode

str

Which splice-site scorer to use. "score5" scores 5’ (donor) splice sites, "score3" scores 3’ (acceptor) splice sites.

'score5'

window

int | None

The fixed length of the input window. Must match mode: 9 for score5, 23 for score3. If None, it is derived from mode.

None

num_labels

int

Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.

1

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
>>> configuration = MaxEntScanConfig()
>>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
>>> model = MaxEntScanModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/maxentscan/configuration_maxentscan.py
Python
class MaxEntScanConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`MaxEntScanModel`][multimolecule.models.MaxEntScanModel]. It is used to instantiate a MaxEntScan scorer according
    to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
    a configuration equivalent to the 5' splice-site scorer (`score5`) of the original MaxEntScan tool.

    MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy
    probability tables published with the original tool and are registered as buffers on the model.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
            the `input_ids` passed when calling [`MaxEntScanModel`].
            Defaults to 5 (the streamline DNA alphabet `ACGTN`).
        mode:
            Which splice-site scorer to use. `"score5"` scores 5' (donor) splice sites, `"score3"` scores 3' (acceptor)
            splice sites.
        window:
            The fixed length of the input window. Must match `mode`: 9 for `score5`, 23 for `score3`. If `None`, it is
            derived from `mode`.
        num_labels:
            Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.

    Examples:
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
        >>> configuration = MaxEntScanConfig()
        >>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
        >>> model = MaxEntScanModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "maxentscan"

    def __init__(
        self,
        vocab_size: int = 5,
        mode: str = "score5",
        window: int | None = None,
        hidden_size: int = 1,
        head: HeadConfig | None = None,
        num_labels: int = 1,
        bos_token_id: int | None = None,
        eos_token_id: int | None = None,
        pad_token_id: int = 4,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, pad_token_id=pad_token_id, **kwargs)
        if mode not in WINDOW_FOR_MODE:
            raise ValueError(f"`mode` must be one of {sorted(WINDOW_FOR_MODE)}, got {mode!r}")
        expected_window = WINDOW_FOR_MODE[mode]
        if window is None:
            window = expected_window
        if window != expected_window:
            raise ValueError(f"`window` ({window}) does not match `mode` ({mode!r}); expected window {expected_window}")
        if num_labels != 1:
            raise ValueError(f"MaxEntScan emits a single score; `num_labels` must be 1, got {num_labels}")
        if hidden_size != 1:
            raise ValueError(f"MaxEntScan emits a single scalar feature; `hidden_size` must be 1, got {hidden_size}")
        self.bos_token_id = bos_token_id  # type: ignore[assignment]
        self.eos_token_id = eos_token_id  # type: ignore[assignment]
        self.vocab_size = vocab_size
        self.mode = mode
        self.window = window
        # The maximum-entropy score is a single scalar feature; the downstream regression head projects from it.
        self.hidden_size = hidden_size
        self.num_labels = num_labels
        self.problem_type = "regression"
        self.head = HeadConfig(head) if head is not None else HeadConfig(num_labels=1, problem_type="regression")

MaxEntScanForSequencePrediction

Bases: MaxEntScanPreTrainedModel

MaxEntScan scorer with sequence-level regression loss support.

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
>>> output["logits"].shape
torch.Size([1, 1])
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanForSequencePrediction(MaxEntScanPreTrainedModel):
    """
    MaxEntScan scorer with sequence-level regression loss support.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanForSequencePrediction(config)
        >>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.model = MaxEntScanModel(config)
        self.sequence_head = SequencePredictionHead(config)
        self.head_config = self.sequence_head.config
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Any,
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )
        output = self.sequence_head(outputs, labels, output_name="logits")
        logits, loss = output.logits, output.loss
        return SequencePredictorOutput(loss=loss, logits=logits)

MaxEntScanModel

Bases: MaxEntScanPreTrainedModel

Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed score-table buffers populated from the published Yeo & Burge (2004) tables.

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output["logits"].shape
torch.Size([1, 1])
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanModel(MaxEntScanPreTrainedModel):
    """
    Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

    The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed
    score-table buffers populated from the published Yeo & Burge (2004) tables.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanModel(config)
        >>> output = model(torch.randint(4, (1, config.window)))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.mode = config.mode
        self.window = config.window
        self.scorer = MaxEntScanScorer(config)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Any,
    ) -> SequencePredictorOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")
        if inputs_embeds is not None:
            raise ValueError("MaxEntScan scores discrete token windows and does not support inputs_embeds")
        assert input_ids is not None
        if isinstance(input_ids, NestedTensor):
            input_ids = input_ids.tensor
        if input_ids.dim() == 1:
            input_ids = input_ids.unsqueeze(0)
        if input_ids.size(1) != self.window:
            raise ValueError(
                f"MaxEntScan {self.mode} expects a fixed window of {self.window} tokens, " f"got {input_ids.size(1)}"
            )
        score = self.scorer(input_ids)
        # The maximum-entropy score is exposed through `logits`; the downstream head reads it via `output_name`.
        return SequencePredictorOutput(logits=score)

MaxEntScanPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and loading the published MaxEntScan parameters.

Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and
    loading the published MaxEntScan parameters.
    """

    config_class = MaxEntScanConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None

    @torch.no_grad()
    def _init_weights(self, module):
        # MaxEntScan has no trainable parameters; nothing to initialize.
        return

    @property
    def dtype(self) -> torch.dtype:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.dtype` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the dtype of the first floating-point buffer (or float32 if none is set yet).
        for tensor in self.buffers():
            if tensor.is_floating_point():
                return tensor.dtype
        return torch.float32

    @property
    def device(self) -> torch.device:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.device` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the device of the first score-table buffer.
        for tensor in self.buffers():
            return tensor.device
        return torch.device("cpu")