MaxEntScan
Maximum-entropy model for scoring short sequence motifs at RNA splice sites.
Disclaimer
This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo, et al.
The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
MaxEntScan is a maximum-entropy model for the splice donor (5’) and splice acceptor (3’) sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences.
Model Specification
MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module. The bundled score tables that serve as the model’s fixed parameters are:
score5: a single 16,384-entry me2x5 probability table (4⁷ floats) indexed by the base-4 hash of the 7 non-consensus positions of the 9-mer.
score3: nine overlapping maximum-entropy decomposition tables (me2x3acc1..9) with sizes 4⁷, 4⁷, 4⁷, 4⁷, 4⁷, 4³, 4⁴, 4³, 4⁴ (5 × 16384 + 2 × 64 + 2 × 256 = 82560 floats total).
| Mode |
Window |
Num Parameters (M) |
FLOPs (G) |
MACs (G) |
| score5 |
9 |
0.00 |
0.00 |
0.00 |
| score3 |
23 |
Links
Usage
The model file depends on the multimolecule library. You can install it using pip:
| Bash |
|---|
| pip install multimolecule
|
Direct Use
5’ Splice-Site Scoring
| Python |
|---|
| >>> import torch
>>> from multimolecule import RnaTokenizer, MaxEntScanModel, MaxEntScanConfig
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/maxentscan-score5")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGUAAGU", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])
|
3’ Splice-Site Scoring
| Python |
|---|
| >>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])
|
Interface
- Input length: 9 nt fixed window for
score5; 23 nt fixed window for score3
- Alphabet:
ACGU only; unknown / N tokens are clamped onto A before table lookup
- Special tokens: do not add (
add_special_tokens=False)
inputs_embeds: not supported; the model scores discrete token windows only
- Output: single scalar splice-site log-odds score per window
Training Details
MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.
Scoring Modes
score5: scores 5’ (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published me2x5 maximum-entropy probability table combined with the consensus background ratios.
score3: scores 3’ (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.
Training Data
- Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
- Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.
The model parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5’ scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3’ scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.
Training Procedure
Pre-training
MaxEntScan does not use neural-network pre-training. Its maximum-entropy probability tables are reused from the original MaxEntScan distribution.
Citation
| BibTeX |
|---|
| @article{yeo2004maximum,
author = {Yeo, Gene and Burge, Christopher B.},
title = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
journal = {Journal of Computational Biology},
volume = {11},
number = {2-3},
pages = {377--394},
year = {2004},
publisher = {Mary Ann Liebert, Inc.},
doi = {10.1089/1066527041410418}
}
|
Note
The artifacts distributed in this repository are part of the MultiMolecule project.
If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
| BibTeX |
|---|
| @software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
|
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.
License
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only |
|---|
| SPDX-License-Identifier: AGPL-3.0-or-later
|
API Reference
MaxEntScanConfig
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
MaxEntScanModel. It is used to instantiate a MaxEntScan scorer according
to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
a configuration equivalent to the 5’ splice-site scorer (score5) of the original MaxEntScan tool.
MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy
probability tables published with the original tool and are registered as buffers on the model.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name |
Type |
Description |
Default |
vocab_size
|
int
|
Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
the input_ids passed when calling [MaxEntScanModel].
Defaults to 5 (the streamline RNA alphabet ACGUN).
|
5
|
mode
|
str
|
Which splice-site scorer to use. "score5" scores 5’ (donor) splice sites, "score3" scores 3’ (acceptor)
splice sites.
|
'score5'
|
window
|
int | None
|
The fixed length of the input window. Must match mode: 9 for score5, 23 for score3. If None, it is
derived from mode.
|
None
|
num_labels
|
int
|
Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.
|
1
|
Examples:
| Python Console Session |
|---|
| >>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
>>> configuration = MaxEntScanConfig()
>>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
>>> model = MaxEntScanModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
|
Source code in multimolecule/models/maxentscan/configuration_maxentscan.py
| Python |
|---|
| class MaxEntScanConfig(PreTrainedConfig):
r"""
This is the configuration class to store the configuration of a
[`MaxEntScanModel`][multimolecule.models.MaxEntScanModel]. It is used to instantiate a MaxEntScan scorer according
to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
a configuration equivalent to the 5' splice-site scorer (`score5`) of the original MaxEntScan tool.
MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy
probability tables published with the original tool and are registered as buffers on the model.
Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
for more information.
Args:
vocab_size:
Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
the `input_ids` passed when calling [`MaxEntScanModel`].
Defaults to 5 (the streamline RNA alphabet `ACGUN`).
mode:
Which splice-site scorer to use. `"score5"` scores 5' (donor) splice sites, `"score3"` scores 3' (acceptor)
splice sites.
window:
The fixed length of the input window. Must match `mode`: 9 for `score5`, 23 for `score3`. If `None`, it is
derived from `mode`.
num_labels:
Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1.
Examples:
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> # Initializing a MaxEntScan multimolecule/maxentscan-score5 style configuration
>>> configuration = MaxEntScanConfig()
>>> # Initializing a model (with random buffers) from the multimolecule/maxentscan-score5 style configuration
>>> model = MaxEntScanModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "maxentscan"
def __init__(
self,
vocab_size: int = 5,
mode: str = "score5",
window: int | None = None,
hidden_size: int = 1,
head: HeadConfig | None = None,
num_labels: int = 1,
bos_token_id: int | None = None,
eos_token_id: int | None = None,
pad_token_id: int = 4,
**kwargs,
):
super().__init__(num_labels=num_labels, pad_token_id=pad_token_id, **kwargs)
if mode not in WINDOW_FOR_MODE:
raise ValueError(f"`mode` must be one of {sorted(WINDOW_FOR_MODE)}, got {mode!r}")
expected_window = WINDOW_FOR_MODE[mode]
if window is None:
window = expected_window
if window != expected_window:
raise ValueError(f"`window` ({window}) does not match `mode` ({mode!r}); expected window {expected_window}")
if num_labels != 1:
raise ValueError(f"MaxEntScan emits a single score; `num_labels` must be 1, got {num_labels}")
if hidden_size != 1:
raise ValueError(f"MaxEntScan emits a single scalar feature; `hidden_size` must be 1, got {hidden_size}")
self.bos_token_id = bos_token_id # type: ignore[assignment]
self.eos_token_id = eos_token_id # type: ignore[assignment]
self.vocab_size = vocab_size
self.mode = mode
self.window = window
# The maximum-entropy score is a single scalar feature; the downstream regression head projects from it.
self.hidden_size = hidden_size
self.num_labels = num_labels
self.problem_type = "regression"
self.head = HeadConfig(head) if head is not None else HeadConfig(num_labels=1, problem_type="regression")
|
MaxEntScanForSequencePrediction
Bases: MaxEntScanPreTrainedModel
MaxEntScan scorer with sequence-level regression loss support.
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
>>> output["logits"].shape
torch.Size([1, 1])
|
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
| Python |
|---|
| class MaxEntScanForSequencePrediction(MaxEntScanPreTrainedModel):
"""
MaxEntScan scorer with sequence-level regression loss support.
Examples:
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanForSequencePrediction
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.window)), labels=torch.randn(1, 1))
>>> output["logits"].shape
torch.Size([1, 1])
"""
def __init__(self, config: MaxEntScanConfig):
super().__init__(config)
self.model = MaxEntScanModel(config)
head = config.head
if head is None:
raise ValueError("MaxEntScanForSequencePrediction requires `config.head` to be set")
# MaxEntScan is parameter-free: the score is passed straight to `Criterion` with no trainable head.
self.criterion = Criterion(head)
# Initialize weights and apply final processing
self.post_init()
@can_return_tuple
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
labels: Tensor | None = None,
**kwargs: Any,
) -> tuple[Tensor, ...] | SequencePredictorOutput:
outputs = self.model(
input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
return_dict=True,
**kwargs,
)
logits = outputs.logits
loss = self.criterion(logits, labels) if labels is not None else None
return SequencePredictorOutput(loss=loss, logits=logits)
|
MaxEntScanModel
Bases: MaxEntScanPreTrainedModel
Maximum-entropy splice-site scorer (Yeo & Burge, 2004).
The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed
score-table buffers populated from the published Yeo & Burge (2004) tables.
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output["logits"].shape
torch.Size([1, 1])
|
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
| Python |
|---|
| class MaxEntScanModel(MaxEntScanPreTrainedModel):
"""
Maximum-entropy splice-site scorer (Yeo & Burge, 2004).
The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed
score-table buffers populated from the published Yeo & Burge (2004) tables.
Examples:
>>> import torch
>>> from multimolecule import MaxEntScanConfig, MaxEntScanModel
>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output["logits"].shape
torch.Size([1, 1])
"""
def __init__(self, config: MaxEntScanConfig):
super().__init__(config)
self.mode = config.mode
self.window = config.window
self.scorer = MaxEntScanScorer(config)
# Initialize weights and apply final processing
self.post_init()
@can_return_tuple
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
**kwargs: Any,
) -> SequencePredictorOutput:
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
if input_ids is None and inputs_embeds is None:
raise ValueError("You have to specify either input_ids or inputs_embeds")
if inputs_embeds is not None:
raise ValueError("MaxEntScan scores discrete token windows and does not support inputs_embeds")
assert input_ids is not None # narrowed: both-None and inputs_embeds-not-None are rejected above
if isinstance(input_ids, NestedTensor):
input_ids = input_ids.tensor
if input_ids.dim() == 1:
input_ids = input_ids.unsqueeze(0)
if input_ids.size(1) != self.window:
raise ValueError(
f"MaxEntScan {self.mode} expects a fixed window of {self.window} tokens, " f"got {input_ids.size(1)}"
)
score = self.scorer(input_ids)
# The maximum-entropy score is exposed through `logits`; the downstream head reads it via `output_name`.
return SequencePredictorOutput(logits=score)
|
MaxEntScanPreTrainedModel
Bases: PreTrainedModel
An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and
loading the published MaxEntScan parameters.
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
| Python |
|---|
| class MaxEntScanPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and
loading the published MaxEntScan parameters.
"""
config_class = MaxEntScanConfig
base_model_prefix = "model"
_can_record_outputs: dict[str, Any] | None = None
@torch.no_grad()
def _init_weights(self, module):
# MaxEntScan has no trainable parameters; nothing to initialize.
return
@property
def dtype(self) -> torch.dtype:
# MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.dtype` iterates
# `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
# to the dtype of the first floating-point buffer (or float32 if none is set yet).
for tensor in self.buffers():
if tensor.is_floating_point():
return tensor.dtype
return torch.float32
@property
def device(self) -> torch.device:
# MaxEntScan has no `nn.Parameter`; the base `PreTrainedModel.device` iterates
# `self.parameters()` and raises `StopIteration` for a parameter-free model. Fall back
# to the device of the first score-table buffer.
for tensor in self.buffers():
return tensor.device
return torch.device("cpu")
|