OpenSpliceAI¶
Modular native-PyTorch reimplementation of SpliceAI for predicting pre-mRNA splice sites from primary DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species by Kuan-Hao Chao, Alan Mao et al.
The OFFICIAL repository of OpenSpliceAI is at Kuanhao-Chao/OpenSpliceAI.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing OpenSpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
OpenSpliceAI is a deep dilated residual convolutional neural network that reimplements the SpliceAI architecture in native PyTorch. It predicts, for each nucleotide of a pre-mRNA transcript, whether the position is a splice acceptor, a splice donor, or neither. The model stacks dilated residual units with increasing kernel size and atrous rate so that a wide genomic context window contributes to each per-nucleotide prediction, while skip connections aggregate multi-scale features. OpenSpliceAI reproduces the predictive behavior of SpliceAI while providing an efficient, modular training pipeline that can be retrained on non-human species.
Variants¶
OpenSpliceAI ships trained model families for human MANE and four non-human species. Each family provides four
flanking-context sizes. The listed Hub repositories use one deterministic seed (rs10) for each family/context pair;
the other seeds are training replicates and are not exposed as separate model variants.
Model Specification¶
| Flanking Context | Residual Blocks | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|---|
| 80 nt | 4 | 32 | 0.09 | 0.95 | 0.47 |
| 400 nt | 8 | 32 | 0.19 | 2.00 | 0.99 |
| 2,000 nt | 12 | 32 | 0.36 | 5.03 | 2.50 |
| 10,000 nt | 16 | 32 | 0.70 | 20.90 | 10.40 |
Model size is determined by flanking context and is shared across species for the same context. FLOPs and MACs are reported for a single 5,000-nucleotide output sequence.
Links¶
- Code: multimolecule.openspliceai
- Data: Human MANE/GENCODE for the MANE variants; species annotations follow the original OpenSpliceAI release.
- Paper: OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species
- Developed by: Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg, Mihaela Pertea
- Model type: Dilated residual 1D CNN over pre-mRNA DNA for per-nucleotide three-class splice-site classification
- Original Repository: Kuanhao-Chao/OpenSpliceAI
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
RNA Splicing Site Prediction¶
You can use this model directly to predict the splice sites of a pre-mRNA sequence:
Each output position carries three logits corresponding to neither, acceptor, and donor.
Interface¶
- Input length: variable pre-mRNA sequence
- Flanking context: 80 / 400 / 2,000 / 10,000 nt per variant family, split evenly on both sides of every predicted position
- Padding: sequence ends padded with
N - Output: per-position 3-class logits (
neither,acceptor,donor)
Training Details¶
OpenSpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence, following the SpliceAI training methodology.
Training Data¶
The MANE variants were trained on transcripts from the GENCODE/MANE human reference annotation. The non-human variants use the species annotations released by OpenSpliceAI for mouse, zebrafish, honeybee, and Arabidopsis. For each predicted nucleotide, the model receives a flanking context of 80, 400, 2,000, or 10,000 nucleotides, split evenly across the two sides of the output sequence, with sequence ends padded with N. Annotated splice donor and acceptor sites serve as positive labels; all other positions are negative.
Training Procedure¶
Pre-training¶
The model was trained to minimize a cross-entropy loss between predicted splice-site probabilities and the reference annotation.
- Optimizer: Adam
- Loss: cross-entropy
Please refer to the OpenSpliceAI paper for the full training protocol and hardware details.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the OpenSpliceAI paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.openspliceai
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
OpenSpliceAiConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
OpenSpliceAiModel. It is used to instantiate an OpenSpliceAI model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the OpenSpliceAI
Kuanhao-Chao/OpenSpliceAI openspliceai-mane 10000nt architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the OpenSpliceAI model. Defines the number of different tokens that can be represented
by the |
4
|
|
int
|
The length of the context window. The input sequence will be padded with zeros of length |
10000
|
|
int
|
Dimensionality of the encoder layers. |
32
|
|
list[OpenSpliceAiStageConfig] | None
|
Configuration for each stage in the OpenSpliceAI model. Each stage is a [ |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder. String values are resolved through
|
'leaky_relu'
|
|
dict[str, object] | None
|
Keyword arguments used when instantiating string activations. Defaults to |
None
|
|
float
|
The epsilon used by the batch normalization layers. |
1e-05
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of output labels (neither / acceptor / donor). |
3
|
|
HeadConfig | None
|
The configuration of the prediction head. |
None
|
|
bool
|
Whether to output the context vectors for each stage. |
False
|
Examples:
Source code in multimolecule/models/openspliceai/configuration_openspliceai.py
| Python | |
|---|---|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | |
OpenSpliceAiStageConfig
¶
Bases: FlatDict
Configuration for a single OpenSpliceAI stage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Number of residual convolutional blocks in the stage. |
required | |
|
Convolution kernel size for the stage. |
required | |
|
Dilation (atrous) factor for the stage. |
required |
Source code in multimolecule/models/openspliceai/configuration_openspliceai.py
OpenSpliceAiForTokenPrediction
¶
Bases: OpenSpliceAiPreTrainedModel
OpenSpliceAI model for per-nucleotide splice-site classification (neither / acceptor / donor).
Examples:
Source code in multimolecule/models/openspliceai/modeling_openspliceai.py
| Python | |
|---|---|
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
postprocess
¶
postprocess(
outputs: (
OpenSpliceAiTokenPredictorOutput
| ModelOutput
| Tensor
),
) -> tuple[Tensor, list[str]]
Return OpenSpliceAI splice-site probabilities with semantic channel names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
OpenSpliceAiTokenPredictorOutput | ModelOutput | Tensor
|
The output of
|
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, list[str]]
|
A tuple of |
Source code in multimolecule/models/openspliceai/modeling_openspliceai.py
OpenSpliceAiModel
¶
Bases: OpenSpliceAiPreTrainedModel
The bare OpenSpliceAI backbone producing per-nucleotide context representations.
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/openspliceai/modeling_openspliceai.py
OpenSpliceAiModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the OpenSpliceAI backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
Per-nucleotide context representation after the dilated residual stack. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_contexts=True`
|
Per-stage context representations. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`
|
Per-stage context representations. |
None
|
Source code in multimolecule/models/openspliceai/modeling_openspliceai.py
OpenSpliceAiPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/openspliceai/modeling_openspliceai.py
OpenSpliceAiTokenPredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of OpenSpliceAI token prediction models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor`, *optional*, returned when `labels` is provided
|
Token prediction loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`
|
Per-nucleotide splice-site classification scores. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_contexts=True`
|
Per-stage context representations. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`
|
Per-stage context representations. |
None
|