跳转至

SpliceAI

Convolutional neural network for predicting mRNA splicing from pre-mRNA sequences.

Disclaimer

This is an UNOFFICIAL implementation of the Predicting Splicing from Primary Sequence with Deep Learning by Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou and Jeremy F. McRae.

The OFFICIAL repository of SpliceAI is at Illumina/SpliceAI.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing SpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

SpliceAI is a convolutional neural network (CNN) trained to predict mRNA splicing site locations (acceptor and donor) from primary pre-mRNA sequences. The model was trained in a supervised manner using annotated splice junctions from human reference transcripts. It processes input RNA sequences and, for each nucleotide, predicts the probability of it being a splice acceptor, a splice donor, or neither. This allows for the identification of canonical splice sites and the prediction of cryptic splice sites potentially activated or inactivated by sequence variants. Please refer to the Training Details section for more information on the training process.

Model Specification

Num Layers Hidden Size Num Parameters (M) FLOPs (G) MACs (G)
16 32 3.48 70.39 35.11

Usage

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule

Direct Use

RNA Splicing Site Prediction

You can use this model directly to predict the splicing sites of an RNA sequence:

Python
>>> from multimolecule import RnaTokenizer, SpliceAiModel

>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
>>> model = SpliceAiModel.from_pretrained("multimolecule/spliceai")
>>> output = model(tokenizer("agcagucauuauggcgaa", return_tensors="pt")["input_ids"])

>>> output.keys()
odict_keys(['logits'])

>>> output.logits.squeeze()
tensor([[ 8.5123, -4.9607, -7.6787],
        [ 8.6559, -4.4936, -8.6357],
        [ 5.8514, -1.9375, -6.8030],
        [ 7.3739, -5.3444, -5.2559],
        [ 8.6336, -5.3187, -7.5741],
        [ 6.1947, -1.5497, -7.6286],
        [ 9.0482, -6.1002, -7.1229],
        [ 7.9647, -5.6973, -6.5327],
        [ 8.8795, -6.3714, -7.0204],
        [ 7.9459, -5.4744, -6.0865],
        [ 8.4272, -5.2556, -7.9027],
        [ 7.7523, -5.8517, -6.9109],
        [ 7.3027, -4.6946, -5.9420],
        [ 8.1432, -4.3085, -7.7892],
        [ 7.9060, -4.9454, -7.0091],
        [ 8.9770, -5.3971, -7.3313],
        [ 8.4292, -5.7455, -6.7811],
        [ 8.2709, -6.1388, -6.6784]], grad_fn=<SqueezeBackward0>)

Training Details

SpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence.

Training Data

The SpliceAI model was trained on human reference transcripts obtained from GENCODE (release 24, GRCh38). This dataset comprises both protein-coding and non-protein-coding transcripts.

For training, a sequence window of 10,000 base pairs (bp) was used for each nucleotide whose splicing status was to be predicted, including 5,000 bp upstream and 5,000 bp downstream. Sequences near transcript ends were padded with ‘N’ (unknown nucleotide) characters to maintain a consistent input length. Annotated splice donor and acceptor sites from GENCODE served as positive labels for their respective classes. All other intronic and exonic positions within these transcripts were considered negative (non-splice site) labels.

The data was partitioned by chromosome: Chromosomes 1-19, X, and Y were designated for the training set. Chromosome 20 was reserved as a test set. A validation set, comprising 5% of transcripts from each training chromosome, was used for model selection and to monitor for overfitting. Positions within 50 bp of a masked interval (an interval of >10 ‘N’s) or within 50 bp of a transcript end were excluded from the training and validation datasets.

To address class imbalance, training examples were weighted such that the total loss contribution from positive examples (acceptor or donor sites) equaled that from negative examples (non-splice sites). Within positive examples, acceptor and donor sites were weighted equally.

Training Procedure

Pre-training

The model was trained to minimize a cross-entropy loss, comparing its predicted splice site probabilities against the ground truth labels from GENCODE.

  • Batch Size:64
  • Epochs: 4
  • Optimizer: Adam
  • Learning rate: 1e-3
  • Learning rate scheduler: Exponential
  • Minimum learning rate: 1e-5

Citation

BibTeX:

BibTeX
@article{jaganathan2019the,
  abstract  = {The splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9\%-11\% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.},
  author    = {Jaganathan, Kishore and Kyriazopoulou Panagiotopoulou, Sofia and McRae, Jeremy F and Darbandi, Siavash Fazel and Knowles, David and Li, Yang I and Kosmicki, Jack A and Arbelaez, Juan and Cui, Wenwu and Schwartz, Grace B and Chow, Eric D and Kanterakis, Efstathios and Gao, Hong and Kia, Amirali and Batzoglou, Serafim and Sanders, Stephan J and Farh, Kyle Kai-How},
  copyright = {http://www.elsevier.com/open-access/userlicense/1.0/},
  journal   = {Cell},
  keywords  = {artificial intelligence; deep learning; genetics; splicing},
  language  = {en},
  month     = jan,
  number    = 3,
  pages     = {535--548.e24},
  publisher = {Elsevier BV},
  title     = {Predicting splicing from primary sequence with deep learning},
  volume    = 176,
  year      = 2019
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the SpliceAI paper for questions or comments on the paper/model.

License

This model is licensed under the AGPL-3.0 License and the CC-BY-NC-4.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later AND CC-BY-NC-4.0

multimolecule.models.spliceai

RnaTokenizer

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default

alphabet

Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • extended
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None

nmers

int

Size of kmer to tokenize.

1

codon

bool

Whether to tokenize into codons.

False

replace_T_with_U

bool

Whether to replace T with U.

True

do_upper_case

bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Python
class RnaTokenizer(Tokenizer):
    """
    Tokenizer for RNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `extended`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_T_with_U: Whether to replace T with U.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import RnaTokenizer
        >>> tokenizer = RnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = RnaTokenizer(nmers=3)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 17, 64, 49, 96, 84, 22, 2]
        >>> tokenizer = RnaTokenizer(codon=True)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 49, 22, 2]
        >>> tokenizer('uagcuuauca')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_T_with_U: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_T_with_U=replace_T_with_U,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_T_with_U = replace_T_with_U
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_T_with_U:
            text = text.replace("T", "U")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

SpliceAiConfig

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a SpliceAiModel. It is used to instantiate a SpliceAI model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpliceAI Illumina/SpliceAI architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default

vocab_size

int

Vocabulary size of the SpliceAI model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [SpliceAiModel]. Defaults to 5.

4

context

int

The length of the context window. The input sequence will be padded with zeros of length context // 2 on each side.

10000

hidden_size

int

Dimensionality of the encoder layers.

32

stages

list[SpliceAiStageConfig] | None

Configuration for each stage in the SpliceAI model. Each stage is a [SpliceAiStageConfig] object.

None

hidden_act

str

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

'gelu'

hidden_dropout

float

The dropout probability for all convolution layers in the encoder.

0.1

batch_norm_eps

float

The epsilon used by the batch normalization layers.

0.001

batch_norm_momentum

float

The momentum used by the batch normalization layers.

0.01

output_contexts

bool

Whether to output the context vectors for each stage.

False

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> from multimolecule import SpliceAiConfig, SpliceAiModel
>>> # Initializing a SpliceAI multimolecule/spliceai style configuration
>>> configuration = SpliceAiConfig()
>>> # Initializing a model (with random weights) from the multimolecule/spliceai style configuration
>>> model = SpliceAiModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/spliceai/configuration_spliceai.py
Python
class SpliceAiConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`SpliceAiModel`][multimolecule.models.SpliceAiModel]. It is used to instantiate a SpliceAI model according to the
    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
    similar configuration to that of the SpliceAI [Illumina/SpliceAI](https://github.com/Illumina/SpliceAI)
    architecture.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the SpliceAI model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`SpliceAiModel`].
            Defaults to 5.
        context:
            The length of the context window. The input sequence will be padded with zeros of length `context // 2` on
            each side.
        hidden_size:
            Dimensionality of the encoder layers.
        stages:
            Configuration for each stage in the SpliceAI model. Each stage is a [`SpliceAiStageConfig`] object.
        hidden_act:
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout:
            The dropout probability for all convolution layers in the encoder.
        batch_norm_eps:
            The epsilon used by the batch normalization layers.
        batch_norm_momentum:
            The momentum used by the batch normalization layers.
        output_contexts:
            Whether to output the context vectors for each stage.

    Examples:
        >>> from multimolecule import SpliceAiConfig, SpliceAiModel
        >>> # Initializing a SpliceAI multimolecule/spliceai style configuration
        >>> configuration = SpliceAiConfig()
        >>> # Initializing a model (with random weights) from the multimolecule/spliceai style configuration
        >>> model = SpliceAiModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "spliceai"

    def __init__(
        self,
        vocab_size: int = 4,
        context: int = 10000,
        hidden_size: int = 32,
        stages: list[SpliceAiStageConfig] | None = None,
        hidden_act: str = "gelu",
        hidden_dropout: float = 0.1,
        batch_norm_eps: float = 1e-3,
        batch_norm_momentum: float = 0.01,
        num_labels: int = 3,
        output_contexts: bool = False,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.context = context
        if stages is None:
            stages = [
                SpliceAiStageConfig(num_blocks=4, kernel_size=11),
                SpliceAiStageConfig(num_blocks=4, kernel_size=11, dilation=4),
                SpliceAiStageConfig(num_blocks=4, kernel_size=21, dilation=10),
                SpliceAiStageConfig(num_blocks=4, kernel_size=41, dilation=25),
            ]
        self.stages = stages
        self.hidden_act = hidden_act
        self.hidden_dropout = hidden_dropout
        self.batch_norm_eps = batch_norm_eps
        self.batch_norm_momentum = batch_norm_momentum
        self.num_labels = num_labels
        self.output_contexts = output_contexts

SpliceAiModel

Bases: SpliceAiPreTrainedModel

Examples:

Python Console Session
1
2
3
4
5
6
7
8
>>> from multimolecule import SpliceAiConfig, SpliceAiModel, RnaTokenizer
>>> config = SpliceAiConfig()
>>> model = SpliceAiModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
>>> input = tokenizer("ACGUN", return_tensors="pt")
>>> output = model(**input)
>>> output["logits"].shape
torch.Size([1, 5, 3])
Source code in multimolecule/models/spliceai/modeling_spliceai.py
Python
class SpliceAiModel(SpliceAiPreTrainedModel):
    """
    Examples:
        >>> from multimolecule import SpliceAiConfig, SpliceAiModel, RnaTokenizer
        >>> config = SpliceAiConfig()
        >>> model = SpliceAiModel(config)
        >>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
        >>> input = tokenizer("ACGUN", return_tensors="pt")
        >>> output = model(**input)
        >>> output["logits"].shape
        torch.Size([1, 5, 3])
    """

    def __init__(self, config: SpliceAiConfig):
        super().__init__(config)
        self.embeddings = SpliceAiEmbedding(config)
        self.networks = nn.ModuleList([SpliceAiModule(config) for _ in range(5)])

    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        output_contexts: bool | None = None,
        output_hidden_states: bool | None = None,
        return_dict: bool | None = None,
        **kwargs,
    ) -> SpliceAiModelOutput | Tuple[Tensor, Tuple[Tensor, ...]] | Tensor:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        output_contexts = output_contexts if output_contexts is not None else self.config.output_contexts
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if isinstance(input_ids, NestedTensor):
            input_ids, attention_mask = input_ids.tensor, input_ids.mask

        embedding_output = self.embeddings(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )

        all_outputs = [
            module(
                embedding_output,
                output_contexts=output_contexts,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
            for module in self.networks
        ]

        if not return_dict:
            return tuple(average_output(output) for output in zip(*all_outputs))

        outputs: Dict = {k: [outputs[k] for outputs in all_outputs] for k in all_outputs[0]}
        for key, output in outputs.items():
            outputs[key] = average_output(output)

        return SpliceAiModelOutput(**outputs)