Skip to content

MaxEntScan

Maximum-entropy model for scoring short sequence motifs at RNA splice sites.

Disclaimer

This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo, et al.

The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

MaxEntScan is a maximum-entropy model for the splice donor (5’) and splice acceptor (3’) sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences.

Model Specification

MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module. The bundled score tables that serve as the model’s fixed parameters are:

  • score5: a single 16,384-entry me2x5 probability table (4⁷ floats) indexed by the base-4 hash of the 7 non-consensus positions of the 9-mer.
  • score3: nine overlapping maximum-entropy decomposition tables (me2x3acc1..9) with sizes 4⁷, 4⁷, 4⁷, 4⁷, 4⁷, 4³, 4⁴, 4³, 4⁴ (5 × 16384 + 2 × 64 + 2 × 256 = 82560 floats total).
Mode Window Num Parameters (M) FLOPs (G) MACs (G)
score5 9 0.00 0.00 0.00
score3 23

Usage

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule

Direct Use

5’ Splice-Site Scoring

Python
>>> import torch
>>> from multimolecule import RnaTokenizer, MaxEntScanModel, MaxEntScanConfig

>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/maxentscan-score5")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGUAAGU", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])

3’ Splice-Site Scoring

Python
1
2
3
4
5
>>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 9 nt fixed window for score5; 23 nt fixed window for score3
  • Alphabet: ACGU only; unknown / N tokens are clamped onto A before table lookup
  • Special tokens: do not add (add_special_tokens=False)
  • inputs_embeds: not supported; the model scores discrete token windows only
  • Output: single scalar splice-site log-odds score per window

Training Details

MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.

Scoring Modes

  • score5: scores 5’ (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published me2x5 maximum-entropy probability table combined with the consensus background ratios.
  • score3: scores 3’ (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.

Training Data

  • Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
  • Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.

The model parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5’ scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3’ scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.

Training Procedure

Pre-training

MaxEntScan does not use neural-network pre-training. Its maximum-entropy probability tables are reused from the original MaxEntScan distribution.

Citation

BibTeX
@article{yeo2004maximum,
  author    = {Yeo, Gene and Burge, Christopher B.},
  title     = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
  journal   = {Journal of Computational Biology},
  volume    = {11},
  number    = {2-3},
  pages     = {377--394},
  year      = {2004},
  publisher = {Mary Ann Liebert, Inc.},
  doi       = {10.1089/1066527041410418}
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later

API Reference

MaxEntScanConfig

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a MaxEntScanModel. It is used to instantiate a MaxEntScan scorer according to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield a configuration equivalent to the 5’ splice-site scorer (score5) of the original MaxEntScan tool.

MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy probability tables published with the original tool and are registered as buffers on the model.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default

vocab_size

int

Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by the input_ids passed when calling [MaxEntScanModel]. Defaults to 5 (the streamline RNA alphabet ACGUN).

5

mode

str

Which splice-site scorer to use. "score5" scores 5’ (donor) splice sites, "score3" scores 3’ (acceptor) splice sites.

'score5'

window

int | None

The fixed length of the input window. Must match mode: 9 for score5, 23 for score3. If None, it is derived from mode.

None

num_labels

int

Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.

1

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
>>> configuration = MaxEntScanConfig()
>>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
>>> model = MaxEntScanModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/maxentscan/configuration_maxentscan.py
Python
class MaxEntScanConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`MaxEntScanModel`][multimolecule.models.MaxEntScanModel]. It is used to instantiate a MaxEntScan scorer according
    to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
    a configuration equivalent to the 5' splice-site scorer (`score5`) of the original MaxEntScan tool.

    MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy
    probability tables published with the original tool and are registered as buffers on the model.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
            the `input_ids` passed when calling [`MaxEntScanModel`].
            Defaults to 5 (the streamline RNA alphabet `ACGUN`).
        mode:
            Which splice-site scorer to use. `"score5"` scores 5' (donor) splice sites, `"score3"` scores 3' (acceptor)
            splice sites.
        window:
            The fixed length of the input window. Must match `mode`: 9 for `score5`, 23 for `score3`. If `None`, it is
            derived from `mode`.
        num_labels:
            Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.

    Examples:
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
        >>> configuration = MaxEntScanConfig()
        >>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
        >>> model = MaxEntScanModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "maxentscan"

    def __init__(
        self,
        vocab_size: int = 5,
        mode: str = "score5",
        window: int | None = None,
        hidden_size: int = 1,
        head: HeadConfig | None = None,
        num_labels: int = 1,
        bos_token_id: int | None = None,
        eos_token_id: int | None = None,
        pad_token_id: int = 4,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, pad_token_id=pad_token_id, **kwargs)
        if mode not in WINDOW_FOR_MODE:
            raise ValueError(f"`mode` must be one of {sorted(WINDOW_FOR_MODE)}, got {mode!r}")
        expected_window = WINDOW_FOR_MODE[mode]
        if window is None:
            window = expected_window
        if window != expected_window:
            raise ValueError(f"`window` ({window}) does not match `mode` ({mode!r}); expected window {expected_window}")
        if num_labels != 1:
            raise ValueError(f"MaxEntScan emits a single score; `num_labels` must be 1, got {num_labels}")
        if hidden_size != 1:
            raise ValueError(f"MaxEntScan emits a single scalar feature; `hidden_size` must be 1, got {hidden_size}")
        self.bos_token_id = bos_token_id  # type: ignore[assignment]
        self.eos_token_id = eos_token_id  # type: ignore[assignment]
        self.vocab_size = vocab_size
        self.mode = mode
        self.window = window
        # The maximum-entropy score is a single scalar feature; the downstream regression head projects from it.
        self.hidden_size = hidden_size
        self.num_labels = num_labels
        self.problem_type = "regression"
        self.head = HeadConfig(head) if head is not None else HeadConfig(num_labels=1, problem_type="regression")

MaxEntScanForSequencePrediction

Bases: MaxEntScanPreTrainedModel

MaxEntScan scorer with sequence-level regression loss support.

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
>>> output["logits"].shape
torch.Size([1, 1])
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanForSequencePrediction(MaxEntScanPreTrainedModel):
    """
    MaxEntScan scorer with sequence-level regression loss support.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanForSequencePrediction(config)
        >>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.model = MaxEntScanModel(config)
        head = config.head
        if head is None:
            raise ValueError("MaxEntScanForSequencePrediction requires `config.head` to be set")
        # MaxEntScan is parameter-free: the score is passed straight to `Criterion` with no trainable head.
        self.criterion = Criterion(head)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Any,
    ) -> tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )
        logits = outputs.logits
        loss = self.criterion(logits, labels) if labels is not None else None
        return SequencePredictorOutput(loss=loss, logits=logits)

MaxEntScanModel

Bases: MaxEntScanPreTrainedModel

Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed score-table buffers populated from the published Yeo & Burge (2004) tables.

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output["logits"].shape
torch.Size([1, 1])
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanModel(MaxEntScanPreTrainedModel):
    """
    Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

    The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed
    score-table buffers populated from the published Yeo & Burge (2004) tables.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanModel(config)
        >>> output = model(torch.randint(4, (1, config.window)))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.mode = config.mode
        self.window = config.window
        self.scorer = MaxEntScanScorer(config)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Any,
    ) -> SequencePredictorOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")
        if inputs_embeds is not None:
            raise ValueError("MaxEntScan scores discrete token windows and does not support inputs_embeds")
        assert input_ids is not None  # narrowed: both-None and inputs_embeds-not-None are rejected above
        if isinstance(input_ids, NestedTensor):
            input_ids = input_ids.tensor
        if input_ids.dim() == 1:
            input_ids = input_ids.unsqueeze(0)
        if input_ids.size(1) != self.window:
            raise ValueError(
                f"MaxEntScan {self.mode} expects a fixed window of {self.window} tokens, " f"got {input_ids.size(1)}"
            )
        score = self.scorer(input_ids)
        # The maximum-entropy score is exposed through `logits`; the downstream head reads it via `output_name`.
        return SequencePredictorOutput(logits=score)

MaxEntScanPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and loading the published MaxEntScan parameters.

Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
Python
class MaxEntScanPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and
    loading the published MaxEntScan parameters.
    """

    config_class = MaxEntScanConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None

    @torch.no_grad()
    def _init_weights(self, module):
        # MaxEntScan has no trainable parameters; nothing to initialize.
        return

    @property
    def dtype(self) -> torch.dtype:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.dtype` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the dtype of the first floating-point buffer (or float32 if none is set yet).
        for tensor in self.buffers():
            if tensor.is_floating_point():
                return tensor.dtype
        return torch.float32

    @property
    def device(self) -> torch.device:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.device` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the device of the first score-table buffer.
        for tensor in self.buffers():
            return tensor.device
        return torch.device("cpu")