MaxEntScan¶

Maximum-entropy model for scoring short sequence motifs at RNA splice sites.

Disclaimer¶

This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo, et al.

The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details¶

MaxEntScan is a maximum-entropy model for the splice donor (5’) and splice acceptor (3’) sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences.

Model Specification¶

MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module. The bundled score tables that serve as the model’s fixed parameters are:

score5: a single 16,384-entry me2x5 probability table (4⁷ floats) indexed by the base-4 hash of the 7 non-consensus positions of the 9-mer.
score3: nine overlapping maximum-entropy decomposition tables (me2x3acc1..9) with sizes 4⁷, 4⁷, 4⁷, 4⁷, 4⁷, 4³, 4⁴, 4³, 4⁴ (5 × 16384 + 2 × 64 + 2 × 256 = 82560 floats total).

Mode	Window	Num Parameters (M)	FLOPs (G)	MACs (G)
score5	9	0.00	0.00	0.00
score3	23	0.00	0.00	0.00

Links¶

Code: multimolecule.maxentscan
Data: Human RefSeq splice-site sequences curated by Yeo and Burge
Paper: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals
Developed by: Gene Yeo, Christopher B. Burge
Model type: Maximum-entropy splice-site scoring with fixed probability tables for 5’ and 3’ splice sites
Original Repository: Burge Lab MaxEntScan

Usage¶

The model file depends on the multimolecule library. You can install it using pip:

Bash
1	`pip install multimolecule`

Direct Use¶

5’ Splice-Site Scoring¶

Python
>>> import torch
>>> from multimolecule import RnaTokenizer, MaxEntScanModel, MaxEntScanConfig

>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/maxentscan-score5")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGUAAGU", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])

3’ Splice-Site Scoring¶

Python
>>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])

Interface¶

Input length: 9 nt fixed window for score5; 23 nt fixed window for score3
Alphabet: ACGU only; unknown / N tokens are clamped onto A before table lookup
Special tokens: do not add (add_special_tokens=False)
inputs_embeds: not supported; the model scores discrete token windows only
Output: single scalar splice-site log-odds score per window

Training Details¶

MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.

Scoring Modes¶

score5: scores 5’ (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published me2x5 maximum-entropy probability table combined with the consensus background ratios.
score3: scores 3’ (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.

Training Data¶

Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.

The model parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5’ scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3’ scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.

Training Procedure¶

Pre-training¶

MaxEntScan does not use neural-network pre-training. Its maximum-entropy probability tables are reused from the original MaxEntScan distribution.

Citation¶

BibTeX
@article{yeo2004maximum,
  author    = {Yeo, Gene and Burge, Christopher B.},
  title     = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
  journal   = {Journal of Computational Biology},
  volume    = {11},
  number    = {2-3},
  pages     = {377--394},
  year      = {2004},
  publisher = {Mary Ann Liebert, Inc.},
  doi       = {10.1089/1066527041410418}
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact¶

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.

License¶

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
1	`SPDX-License-Identifier: AGPL-3.0-or-later`

API Reference¶

MaxEntScanConfig ¶

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a MaxEntScanModel. It is used to instantiate a MaxEntScan scorer according to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield a configuration equivalent to the 5’ splice-site scorer (score5) of the original MaxEntScan tool.

MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy probability tables published with the original tool and are registered as buffers on the model.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

参数：

名称	类型	描述	默认
`vocab_size` ¶	`int`	Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by the `input_ids` passed when calling [`MaxEntScanModel`]. Defaults to 5 (the streamline RNA alphabet `ACGUN`).	`5`
`mode` ¶	`str`	Which splice-site scorer to use. `"score5"` scores 5’ (donor) splice sites, `"score3"` scores 3’ (acceptor) splice sites.	`'score5'`
`window` ¶	`int \| None`	The fixed length of the input window. Must match `mode`: 9 for `score5`, 23 for `score3`. If `None`, it is derived from `mode`.	`None`
`num_labels` ¶	`int`	Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.	`1`

示例：

Python Console Session
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
>>> configuration = MaxEntScanConfig()
>>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
>>> model = MaxEntScanModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

源代码位于： multimolecule/models/maxentscan/configuration_maxentscan.py

Python
class MaxEntScanConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`MaxEntScanModel`][multimolecule.models.MaxEntScanModel]. It is used to instantiate a MaxEntScan scorer according
    to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
    a configuration equivalent to the 5' splice-site scorer (`score5`) of the original MaxEntScan tool.

    MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy
    probability tables published with the original tool and are registered as buffers on the model.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
            the `input_ids` passed when calling [`MaxEntScanModel`].
            Defaults to 5 (the streamline RNA alphabet `ACGUN`).
        mode:
            Which splice-site scorer to use. `"score5"` scores 5' (donor) splice sites, `"score3"` scores 3' (acceptor)
            splice sites.
        window:
            The fixed length of the input window. Must match `mode`: 9 for `score5`, 23 for `score3`. If `None`, it is
            derived from `mode`.
        num_labels:
            Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.

    Examples:
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
        >>> configuration = MaxEntScanConfig()
        >>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
        >>> model = MaxEntScanModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "maxentscan"

    def __init__(
        self,
        vocab_size: int = 5,
        mode: str = "score5",
        window: int | None = None,
        hidden_size: int = 1,
        head: HeadConfig | None = None,
        num_labels: int = 1,
        bos_token_id: int | None = None,
        eos_token_id: int | None = None,
        pad_token_id: int = 4,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, pad_token_id=pad_token_id, **kwargs)
        if mode not in WINDOW_FOR_MODE:
            raise ValueError(f"`mode` must be one of {sorted(WINDOW_FOR_MODE)}, got {mode!r}")
        expected_window = WINDOW_FOR_MODE[mode]
        if window is None:
            window = expected_window
        if window != expected_window:
            raise ValueError(f"`window` ({window}) does not match `mode` ({mode!r}); expected window {expected_window}")
        if num_labels != 1:
            raise ValueError(f"MaxEntScan emits a single score; `num_labels` must be 1, got {num_labels}")
        if hidden_size != 1:
            raise ValueError(f"MaxEntScan emits a single scalar feature; `hidden_size` must be 1, got {hidden_size}")
        self.bos_token_id = bos_token_id  # type: ignore[assignment]
        self.eos_token_id = eos_token_id  # type: ignore[assignment]
        self.vocab_size = vocab_size
        self.mode = mode
        self.window = window
        # The maximum-entropy score is a single scalar feature; the downstream regression head projects from it.
        self.hidden_size = hidden_size
        self.num_labels = num_labels
        self.problem_type = "regression"
        self.head = HeadConfig(head) if head is not None else HeadConfig(num_labels=1, problem_type="regression")

MaxEntScanForSequencePrediction ¶

Bases: MaxEntScanPreTrainedModel

MaxEntScan scorer with sequence-level regression loss support.

示例：

Python Console Session
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
>>> output["logits"].shape
torch.Size([1, 1])

源代码位于： multimolecule/models/maxentscan/modeling_maxentscan.py

Python
class MaxEntScanForSequencePrediction(MaxEntScanPreTrainedModel):
    """
    MaxEntScan scorer with sequence-level regression loss support.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanForSequencePrediction(config)
        >>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.model = MaxEntScanModel(config)
        head = config.head
        if head is None:
            raise ValueError("MaxEntScanForSequencePrediction requires `config.head` to be set")
        # MaxEntScan is parameter-free: the score is passed straight to `Criterion` with no trainable head.
        self.criterion = Criterion(head)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Any,
    ) -> tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )
        logits = outputs.logits
        loss = self.criterion(logits, labels) if labels is not None else None
        return SequencePredictorOutput(loss=loss, logits=logits)

MaxEntScanModel ¶

Bases: MaxEntScanPreTrainedModel

Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed score-table buffers populated from the published Yeo & Burge (2004) tables.

示例：

Python Console Session
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output["logits"].shape
torch.Size([1, 1])

源代码位于： multimolecule/models/maxentscan/modeling_maxentscan.py

Python
class MaxEntScanModel(MaxEntScanPreTrainedModel):
    """
    Maximum-entropy splice-site scorer (Yeo & Burge, 2004).

    The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed
    score-table buffers populated from the published Yeo & Burge (2004) tables.

    Examples:
        >>> import torch
        >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
        >>> config = MaxEntScanConfig()
        >>> model = MaxEntScanModel(config)
        >>> output = model(torch.randint(4, (1, config.window)))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: MaxEntScanConfig):
        super().__init__(config)
        self.mode = config.mode
        self.window = config.window
        self.scorer = MaxEntScanScorer(config)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Any,
    ) -> SequencePredictorOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")
        if inputs_embeds is not None:
            raise ValueError("MaxEntScan scores discrete token windows and does not support inputs_embeds")
        assert input_ids is not None  # narrowed: both-None and inputs_embeds-not-None are rejected above
        if isinstance(input_ids, NestedTensor):
            input_ids = input_ids.tensor
        if input_ids.dim() == 1:
            input_ids = input_ids.unsqueeze(0)
        if input_ids.size(1) != self.window:
            raise ValueError(
                f"MaxEntScan {self.mode} expects a fixed window of {self.window} tokens, " f"got {input_ids.size(1)}"
            )
        score = self.scorer(input_ids)
        # The maximum-entropy score is exposed through `logits`; the downstream head reads it via `output_name`.
        return SequencePredictorOutput(logits=score)

MaxEntScanPreTrainedModel ¶

Bases: PreTrainedModel

An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and loading the published MaxEntScan parameters.

源代码位于： multimolecule/models/maxentscan/modeling_maxentscan.py

Python
class MaxEntScanPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and
    loading the published MaxEntScan parameters.
    """

    config_class = MaxEntScanConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None

    @torch.no_grad()
    def _init_weights(self, module):
        # MaxEntScan has no trainable parameters; nothing to initialize.
        return

    @property
    def dtype(self) -> torch.dtype:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.dtype` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the dtype of the first floating-point buffer (or float32 if none is set yet).
        for tensor in self.buffers():
            if tensor.is_floating_point():
                return tensor.dtype
        return torch.float32

    @property
    def device(self) -> torch.device:
        # MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.device` iterates
        # `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
        # to the device of the first score-table buffer.
        for tensor in self.buffers():
            return tensor.device
        return torch.device("cpu")

MaxEntScan¶

Disclaimer¶

Model Details¶

Model Specification¶

Links¶

Usage¶

Direct Use¶

5’ Splice-Site Scoring¶

3’ Splice-Site Scoring¶

Interface¶

Training Details¶

Scoring Modes¶

Training Data¶

Training Procedure¶

Pre-training¶

Citation¶

Contact¶

License¶

API Reference¶

MaxEntScanConfig ¶

`vocab_size` ¶

`mode` ¶

`window` ¶

`num_labels` ¶

MaxEntScanForSequencePrediction ¶

MaxEntScanModel ¶

MaxEntScanPreTrainedModel ¶

MaxEntScan¶

Disclaimer¶

Model Details¶

Model Specification¶

Links¶

Usage¶

Direct Use¶

5’ Splice-Site Scoring¶

3’ Splice-Site Scoring¶

Interface¶

Training Details¶

Scoring Modes¶

Training Data¶

Training Procedure¶

Pre-training¶

Citation¶

Contact¶

License¶

API Reference¶

MaxEntScanConfig ¶

vocab_size ¶

mode ¶

window ¶

num_labels ¶

MaxEntScanForSequencePrediction ¶

MaxEntScanModel ¶

MaxEntScanPreTrainedModel ¶

`vocab_size` ¶

`mode` ¶

`window` ¶

`num_labels` ¶