HAL¶
HAL¶
Hexamer Additive Linear model for predicting alternative splicing from sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences by Alexander B. Rosenberg et al.
The OFFICIAL repository of HAL is at Alex-Rosenberg/cell-2015.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing HAL did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
HAL is a linear (additive) model that scores alternative 5’ splice-site usage from normalized hexamer (6-mer) frequencies across a 160-nucleotide donor-region window. It was learned from massively parallel reporter assays measuring splicing of millions of random synthetic sequences. The published coefficient table contains a (4096, 8) matrix of hexamer effects; the model averages the eight coefficient columns into one effect per hexamer and applies those effects to normalized hexamer frequencies.
Model Specification¶
| Window | Published Coefficient Columns | Hexamer Features | Num Parameters | FLOPs | MACs |
|---|---|---|---|---|---|
| 160 nt | 8 averaged | 4,096 | 4,096 | 8,192 | 4,096 |
Links¶
- Code: multimolecule.hal
- Data: Rosenberg lab random-library 5’ splice-site MPRA
- Paper: Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences
- Developed by: Alexander B. Rosenberg, Rupali P. Patwardhan, Jay Shendure, Georg Seelig
- Model type: Linear regression over normalized hexamer-frequency features with learned per-hexamer effect coefficients
- Original Repository: Alex-Rosenberg/cell-2015
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Alternative Splicing Prediction¶
You can use this model directly to predict a splicing score for a 160-nucleotide DNA sequence window:
Interface¶
- Input length: 160 nt fixed donor-region window
- Alphabet:
ACGTonly; any hexamer spanning an unknown /Ntoken is ignored - Special tokens: do not add (
add_special_tokens=False) - Output: single scalar splicing score per window
- Variant effect: subtract two window scores and apply sigmoid externally for paired donor comparisons
Training Details¶
HAL was learned from massively parallel splicing reporter assays in which millions of random synthetic sequences were inserted into an alternatively spliced reporter minigene. Splicing outcomes were measured by high-throughput sequencing of the resulting mRNA isoforms.
Training Data¶
The model was trained on the splicing measurements of millions of degenerate (random) sequences from the reporter library described in the HAL paper. Hexamer coefficients were estimated by regressing the measured splicing index against the hexamer composition of each sequence.
Training Procedure¶
Pre-training¶
HAL is a linear regression model. The published hexamer coefficient table is fit to the measured splicing index, and the model prediction is the linear combination of normalized hexamer frequencies with the averaged hexamer effects.
The HAL model uses the published HAL_mer_scores.npz hexamer coefficient table from Rosenberg et al. The table stores 4,096 hexamer rows and eight coefficient columns; the eight columns are averaged into the single per-hexamer effect used by the HAL formula.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the HAL paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.hal
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
HalConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
HalModel. It is used to instantiate a HAL model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the HAL model from
Learning the Sequence Determinants of Alternative Splicing from Millions of Random
Sequences.
HAL (Hexamer Additive Linear model) is a linear model over hexamer (k-mer) features that predicts alternative splicing outcomes such as 5’ splice-site usage. The model weights are a published table of hexamer coefficients.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the HAL model. Defines the number of different tokens that can be represented by the
|
5
|
|
int
|
The k-mer (hexamer) size used for feature extraction. The published HAL model uses hexamers ( |
6
|
|
int
|
Number of canonical nucleotides used to enumerate k-mers. The number of k-mer features is
|
4
|
|
int
|
The length of the sequence region scored by the model. The published HAL/Kipoi model scores a fixed 160-nucleotide 5’ splice-site window. |
160
|
|
int
|
Size of the scalar feature consumed by the optional sequence prediction loss wrapper. HAL emits one score, so this must be 1. |
1
|
|
int
|
Number of output labels. HAL is a single-output regression model, so this defaults to 1. |
1
|
Examples:
Source code in multimolecule/models/hal/configuration_hal.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
num_kmers
property
¶
num_kmers: int
Number of distinct k-mer features (nucleobase_size ** kmer_size).
num_regions
property
¶
num_regions: int
Number of position-specific HAL coefficient regions in the published artifact.
HalForSequencePrediction
¶
Bases: HalPreTrainedModel
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/hal/modeling_hal.py
HalModel
¶
Bases: HalPreTrainedModel
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/hal/modeling_hal.py
HalModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the HAL model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
The HAL splicing score predicted by the linear hexamer model. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_kmers)`, *optional*
|
The normalized hexamer (k-mer) frequency features derived from the input sequence region. |
None
|
|
Tuple[FloatTensor, ...] | None
|
Always |
None
|
|
Tuple[FloatTensor, ...] | None
|
Always |
None
|
Source code in multimolecule/models/hal/modeling_hal.py
HalPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.