MaxEntScan¶
Maximum-entropy model for scoring short sequence motifs at RNA splice sites.
Disclaimer¶
This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo et al.
The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
MaxEntScan is a maximum-entropy model for the splice donor (5’) and splice acceptor (3’) sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences. These tables are registered as persistent buffers on the model so they serialize with saved checkpoints.
Model Specification¶
MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module.
| Mode | Window | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|
| score5 | 9 | 0.00 | 0.00 | 0.00 |
| score3 | 23 | 0.00 | 0.00 | 0.00 |
Links¶
- Code: multimolecule.maxentscan
- Data: Human RefSeq splice-site sequences curated by Yeo and Burge
- Paper: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals
- Developed by: Gene Yeo, Christopher B. Burge
- Model type: Maximum-entropy splice-site scoring with fixed probability tables for 5’ and 3’ splice sites
- Original Distribution: Burge Lab MaxEntScan
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
5’ Splice-Site Scoring¶
3’ Splice-Site Scoring¶
| Python | |
|---|---|
Interface¶
- Input length: 9 nt fixed window for
score5; 23 nt fixed window forscore3 - Alphabet:
ACGTonly; unknown /Ntokens are clamped ontoAbefore table lookup - Special tokens: do not add (
add_special_tokens=False) inputs_embeds: not supported; the model scores discrete token windows only- Output: single scalar splice-site log-odds score per window
Training Details¶
MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.
Scoring Modes¶
score5: scores 5’ (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the publishedme2x5maximum-entropy probability table combined with the consensus background ratios.score3: scores 3’ (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.
Training Data¶
- Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
- Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.
MaxEntScan has no neural checkpoint. Its parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5’ scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3’ scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.
The MaxEntScan model includes those tables as score5_me2x5.txt and score3_me2x3acc.txt in their native one-float-per-line order, which equals the base-4 order of the published splice5sequences enumeration. convert_checkpoint.py builds persistent score-table buffers directly from the bundled plain-text tables.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.maxentscan
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
MaxEntScanConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
MaxEntScanModel. It is used to instantiate a MaxEntScan scorer according
to the specified arguments, defining the model behavior. Instantiating a configuration with the defaults will yield
a configuration equivalent to the 5’ splice-site scorer (score5) of the original MaxEntScan tool.
MaxEntScan is a maximum-entropy model and has no trainable weights. The score tables are fixed maximum-entropy probability tables published with the original tool and are registered as buffers on the model.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the MaxEntScan model. Defines the number of different tokens that can be represented by
the |
5
|
|
str
|
Which splice-site scorer to use. |
'score5'
|
|
int | None
|
The fixed length of the input window. Must match |
None
|
|
int
|
Number of output labels. MaxEntScan emits a single maximum-entropy score, so this must be 1. |
1
|
Examples:
Source code in multimolecule/models/maxentscan/configuration_maxentscan.py
MaxEntScanForSequencePrediction
¶
Bases: MaxEntScanPreTrainedModel
MaxEntScan scorer with sequence-level regression loss support.
Examples:
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
MaxEntScanModel
¶
Bases: MaxEntScanPreTrainedModel
Maximum-entropy splice-site scorer (Yeo & Burge, 2004).
The model has no trainable weights. It exposes a single maximum-entropy score per input window through fixed score-table buffers populated from the published Yeo & Burge (2004) tables.
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/maxentscan/modeling_maxentscan.py
MaxEntScanPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle the fixed maximum-entropy score tables and a simple interface for downloading and loading the published MaxEntScan parameters.