SpliceAI
Convolutional neural network for predicting mRNA splicing from pre-mRNA sequences.
Disclaimer
This is an UNOFFICIAL implementation of the Predicting Splicing from Primary Sequence with Deep Learning by Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou and Jeremy F. McRae.
The OFFICIAL repository of SpliceAI is at Illumina/SpliceAI.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing SpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
SpliceAI is a convolutional neural network (CNN) trained to predict mRNA splicing site locations (acceptor and donor) from primary pre-mRNA sequences. The model was trained in a supervised manner using annotated splice junctions from human reference transcripts. It processes input RNA sequences and, for each nucleotide, predicts the probability of it being a splice acceptor, a splice donor, or neither. This allows for the identification of canonical splice sites and the prediction of cryptic splice sites potentially activated or inactivated by sequence variants. Please refer to the Training Details section for more information on the training process.
Model Specification
Num Layers |
Hidden Size |
Num Parameters (M) |
FLOPs (G) |
MACs (G) |
16 |
32 |
3.48 |
70.39 |
35.11 |
Links
- Code: multimolecule.spliceai
- Weights: multimolecule/spliceai
- Paper: Predicting Splicing from Primary Sequence with Deep Learning
- Developed by: Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, Juan Arbelaez, Wenwu Cui, Grace B. Schwartz, Eric D. Chow, Efstathios Kanterakis, Hong Gao, Amirali Kia, Serafim Batzoglou, Stephan J. Sanders, Kyle Kai-How Farh
- Original Repository: Illumina/SpliceAI
Usage
The model file depends on the multimolecule
library. You can install it using pip:
Bash |
---|
| pip install multimolecule
|
Direct Use
RNA Splicing Site Prediction
You can use this model directly to predict the splicing sites of an RNA sequence:
Python |
---|
| >>> from multimolecule import RnaTokenizer, SpliceAiModel
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
>>> model = SpliceAiModel.from_pretrained("multimolecule/spliceai")
>>> output = model(tokenizer("agcagucauuauggcgaa", return_tensors="pt")["input_ids"])
>>> output.keys()
odict_keys(['logits'])
>>> output.logits.squeeze()
tensor([[ 8.5123, -4.9607, -7.6787],
[ 8.6559, -4.4936, -8.6357],
[ 5.8514, -1.9375, -6.8030],
[ 7.3739, -5.3444, -5.2559],
[ 8.6336, -5.3187, -7.5741],
[ 6.1947, -1.5497, -7.6286],
[ 9.0482, -6.1002, -7.1229],
[ 7.9647, -5.6973, -6.5327],
[ 8.8795, -6.3714, -7.0204],
[ 7.9459, -5.4744, -6.0865],
[ 8.4272, -5.2556, -7.9027],
[ 7.7523, -5.8517, -6.9109],
[ 7.3027, -4.6946, -5.9420],
[ 8.1432, -4.3085, -7.7892],
[ 7.9060, -4.9454, -7.0091],
[ 8.9770, -5.3971, -7.3313],
[ 8.4292, -5.7455, -6.7811],
[ 8.2709, -6.1388, -6.6784]], grad_fn=<SqueezeBackward0>)
|
Training Details
SpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence.
Training Data
The SpliceAI model was trained on human reference transcripts obtained from GENCODE (release 24, GRCh38).
This dataset comprises both protein-coding and non-protein-coding transcripts.
For training, a sequence window of 10,000 base pairs (bp) was used for each nucleotide whose splicing status was to be predicted, including 5,000 bp upstream and 5,000 bp downstream.
Sequences near transcript ends were padded with ‘N’ (unknown nucleotide) characters to maintain a consistent input length.
Annotated splice donor and acceptor sites from GENCODE served as positive labels for their respective classes.
All other intronic and exonic positions within these transcripts were considered negative (non-splice site) labels.
The data was partitioned by chromosome:
Chromosomes 1-19, X, and Y were designated for the training set.
Chromosome 20 was reserved as a test set.
A validation set, comprising 5% of transcripts from each training chromosome, was used for model selection and to monitor for overfitting.
Positions within 50 bp of a masked interval (an interval of >10 ‘N’s) or within 50 bp of a transcript end were excluded from the training and validation datasets.
To address class imbalance, training examples were weighted such that the total loss contribution from positive examples (acceptor or donor sites) equaled that from negative examples (non-splice sites).
Within positive examples, acceptor and donor sites were weighted equally.
Training Procedure
Pre-training
The model was trained to minimize a cross-entropy loss, comparing its predicted splice site probabilities against the ground truth labels from GENCODE.
- Batch Size:64
- Epochs: 4
- Optimizer: Adam
- Learning rate: 1e-3
- Learning rate scheduler: Exponential
- Minimum learning rate: 1e-5
Citation
BibTeX:
BibTeX |
---|
| @article{jaganathan2019the,
abstract = {The splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9\%-11\% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.},
author = {Jaganathan, Kishore and Kyriazopoulou Panagiotopoulou, Sofia and McRae, Jeremy F and Darbandi, Siavash Fazel and Knowles, David and Li, Yang I and Kosmicki, Jack A and Arbelaez, Juan and Cui, Wenwu and Schwartz, Grace B and Chow, Eric D and Kanterakis, Efstathios and Gao, Hong and Kia, Amirali and Batzoglou, Serafim and Sanders, Stephan J and Farh, Kyle Kai-How},
copyright = {http://www.elsevier.com/open-access/userlicense/1.0/},
journal = {Cell},
keywords = {artificial intelligence; deep learning; genetics; splicing},
language = {en},
month = jan,
number = 3,
pages = {535--548.e24},
publisher = {Elsevier BV},
title = {Predicting splicing from primary sequence with deep learning},
volume = 176,
year = 2019
}
|
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the SpliceAI paper for questions or comments on the paper/model.
License
This model is licensed under the AGPL-3.0 License and the CC-BY-NC-4.0 License.
Text Only |
---|
| SPDX-License-Identifier: AGPL-3.0-or-later AND CC-BY-NC-4.0
|
multimolecule.models.spliceai
RnaTokenizer
Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name |
Type |
Description |
Default |
alphabet
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
- If is
None , the standard RNA alphabet will be used.
- If is a
string , it should correspond to the name of a predefined alphabet. The options include
standard
extended
streamline
nucleobase
- If is an alphabet or a list of characters, that specific alphabet will be used.
|
None
|
nmers
|
int
|
Size of kmer to tokenize.
|
1
|
codon
|
bool
|
Whether to tokenize into codons.
|
False
|
replace_T_with_U
|
bool
|
Whether to replace T with U.
|
True
|
do_upper_case
|
bool
|
Whether to convert input to uppercase.
|
True
|
Examples:
Python Console Session |
---|
| >>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
|
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Python |
---|
| class RnaTokenizer(Tokenizer):
"""
Tokenizer for RNA sequences.
Args:
alphabet: alphabet to use for tokenization.
- If is `None`, the standard RNA alphabet will be used.
- If is a `string`, it should correspond to the name of a predefined alphabet. The options include
+ `standard`
+ `extended`
+ `streamline`
+ `nucleobase`
- If is an alphabet or a list of characters, that specific alphabet will be used.
nmers: Size of kmer to tokenize.
codon: Whether to tokenize into codons.
replace_T_with_U: Whether to replace T with U.
do_upper_case: Whether to convert input to uppercase.
Examples:
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
"""
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
alphabet: Alphabet | str | List[str] | None = None,
nmers: int = 1,
codon: bool = False,
replace_T_with_U: bool = True,
do_upper_case: bool = True,
additional_special_tokens: List | Tuple | None = None,
**kwargs,
):
if codon and (nmers > 1 and nmers != 3):
raise ValueError("Codon and nmers cannot be used together.")
if codon:
nmers = 3 # set to 3 to get correct vocab
if not isinstance(alphabet, Alphabet):
alphabet = get_alphabet(alphabet, nmers=nmers)
super().__init__(
alphabet=alphabet,
nmers=nmers,
codon=codon,
replace_T_with_U=replace_T_with_U,
do_upper_case=do_upper_case,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
self.replace_T_with_U = replace_T_with_U
self.nmers = nmers
self.codon = codon
def _tokenize(self, text: str, **kwargs):
if self.do_upper_case:
text = text.upper()
if self.replace_T_with_U:
text = text.replace("T", "U")
if self.codon:
if len(text) % 3 != 0:
raise ValueError(
f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
)
return [text[i : i + 3] for i in range(0, len(text), 3)]
if self.nmers > 1:
return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203
return list(text)
|
SpliceAiConfig
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
SpliceAiModel
. It is used to instantiate a SpliceAI model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the SpliceAI Illumina/SpliceAI
architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name |
Type |
Description |
Default |
vocab_size
|
int
|
Vocabulary size of the SpliceAI model. Defines the number of different tokens that can be represented by the
inputs_ids passed when calling [SpliceAiModel ].
Defaults to 5.
|
4
|
context
|
int
|
The length of the context window. The input sequence will be padded with zeros of length context // 2 on
each side.
|
10000
|
hidden_size
|
int
|
Dimensionality of the encoder layers.
|
32
|
stages
|
list[SpliceAiStageConfig] | None
|
Configuration for each stage in the SpliceAI model. Each stage is a [SpliceAiStageConfig ] object.
|
None
|
hidden_act
|
str
|
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu" ,
"relu" , "silu" and "gelu_new" are supported.
|
'gelu'
|
hidden_dropout
|
float
|
The dropout probability for all convolution layers in the encoder.
|
0.1
|
batch_norm_eps
|
float
|
The epsilon used by the batch normalization layers.
|
0.001
|
batch_norm_momentum
|
float
|
The momentum used by the batch normalization layers.
|
0.01
|
output_contexts
|
bool
|
Whether to output the context vectors for each stage.
|
False
|
Examples:
Python Console Session |
---|
| >>> from multimolecule import SpliceAiConfig, SpliceAiModel
>>> # Initializing a SpliceAI multimolecule/spliceai style configuration
>>> configuration = SpliceAiConfig()
>>> # Initializing a model (with random weights) from the multimolecule/spliceai style configuration
>>> model = SpliceAiModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
|
Source code in multimolecule/models/spliceai/configuration_spliceai.py
Python |
---|
| class SpliceAiConfig(PreTrainedConfig):
r"""
This is the configuration class to store the configuration of a
[`SpliceAiModel`][multimolecule.models.SpliceAiModel]. It is used to instantiate a SpliceAI model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the SpliceAI [Illumina/SpliceAI](https://github.com/Illumina/SpliceAI)
architecture.
Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
for more information.
Args:
vocab_size:
Vocabulary size of the SpliceAI model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`SpliceAiModel`].
Defaults to 5.
context:
The length of the context window. The input sequence will be padded with zeros of length `context // 2` on
each side.
hidden_size:
Dimensionality of the encoder layers.
stages:
Configuration for each stage in the SpliceAI model. Each stage is a [`SpliceAiStageConfig`] object.
hidden_act:
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout:
The dropout probability for all convolution layers in the encoder.
batch_norm_eps:
The epsilon used by the batch normalization layers.
batch_norm_momentum:
The momentum used by the batch normalization layers.
output_contexts:
Whether to output the context vectors for each stage.
Examples:
>>> from multimolecule import SpliceAiConfig, SpliceAiModel
>>> # Initializing a SpliceAI multimolecule/spliceai style configuration
>>> configuration = SpliceAiConfig()
>>> # Initializing a model (with random weights) from the multimolecule/spliceai style configuration
>>> model = SpliceAiModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "spliceai"
def __init__(
self,
vocab_size: int = 4,
context: int = 10000,
hidden_size: int = 32,
stages: list[SpliceAiStageConfig] | None = None,
hidden_act: str = "gelu",
hidden_dropout: float = 0.1,
batch_norm_eps: float = 1e-3,
batch_norm_momentum: float = 0.01,
num_labels: int = 3,
output_contexts: bool = False,
**kwargs,
):
super().__init__(**kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.context = context
if stages is None:
stages = [
SpliceAiStageConfig(num_blocks=4, kernel_size=11),
SpliceAiStageConfig(num_blocks=4, kernel_size=11, dilation=4),
SpliceAiStageConfig(num_blocks=4, kernel_size=21, dilation=10),
SpliceAiStageConfig(num_blocks=4, kernel_size=41, dilation=25),
]
self.stages = stages
self.hidden_act = hidden_act
self.hidden_dropout = hidden_dropout
self.batch_norm_eps = batch_norm_eps
self.batch_norm_momentum = batch_norm_momentum
self.num_labels = num_labels
self.output_contexts = output_contexts
|
SpliceAiModel
Bases: SpliceAiPreTrainedModel
Examples:
Python Console Session |
---|
| >>> from multimolecule import SpliceAiConfig, SpliceAiModel, RnaTokenizer
>>> config = SpliceAiConfig()
>>> model = SpliceAiModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
>>> output["logits"].shape
torch.Size([1, 5, 3])
|
Source code in multimolecule/models/spliceai/modeling_spliceai.py
Python |
---|
| class SpliceAiModel(SpliceAiPreTrainedModel):
"""
Examples:
>>> from multimolecule import SpliceAiConfig, SpliceAiModel, RnaTokenizer
>>> config = SpliceAiConfig()
>>> model = SpliceAiModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
>>> output["logits"].shape
torch.Size([1, 5, 3])
"""
def __init__(self, config: SpliceAiConfig):
super().__init__(config)
self.embeddings = SpliceAiEmbedding(config)
self.networks = nn.ModuleList([SpliceAiModule(config) for _ in range(5)])
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
output_contexts: bool | None = None,
output_hidden_states: bool | None = None,
return_dict: bool | None = None,
**kwargs,
) -> SpliceAiModelOutput | Tuple[Tensor, Tuple[Tensor, ...]] | Tensor:
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is None and inputs_embeds is None:
raise ValueError("You have to specify either input_ids or inputs_embeds")
output_contexts = output_contexts if output_contexts is not None else self.config.output_contexts
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if isinstance(input_ids, NestedTensor):
input_ids, attention_mask = input_ids.tensor, input_ids.mask
embedding_output = self.embeddings(
input_ids=input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
)
all_outputs = [
module(
embedding_output,
output_contexts=output_contexts,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
for module in self.networks
]
if not return_dict:
return tuple(average_output(output) for output in zip(*all_outputs))
outputs: Dict = {k: [outputs[k] for outputs in all_outputs] for k in all_outputs[0]}
for key, output in outputs.items():
outputs[key] = average_output(output)
return SpliceAiModelOutput(**outputs)
|