MTSplice¶
Tissue-specific modeling of the effects of genetic variants on splicing.
Disclaimer¶
This is an UNOFFICIAL implementation of the MTSplice predicts effects of genetic variants on tissue-specific splicing by Jun Cheng et al.
The OFFICIAL repository of MTSplice is at gagneurlab/MMSplice_MTSplice.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing MTSplice did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
MTSplice is the tissue-specific second generation of MMSplice. It predicts the effect of genetic variants on cassette-exon splicing across 56 GTEx tissues. The cassette exon together with its flanking introns is fed into two parallel sequence towers whose outputs are combined into a per-tissue delta-logit-PSI splicing-effect vector. Please refer to the Training Details section for more information on the training process.
MTSplice is distributed as a deep four-member ensemble (mtsplice_deep0..3) and an earlier eight-member ensemble (mtsplice0..7). The default deep-family model is represented as a single deterministic model based on mtsplice_deep0.
Model Specification¶
| Num Blocks | Hidden Size | Num Tissues | Num Parameters | FLOPs (M) | MACs (M) |
|---|---|---|---|---|---|
| 8 | 64 | 56 | 210,840 | 164.36 | 80.90 |
(Num Blocks is per tower; FLOPs and MACs measured on an 800 bp cassette-exon-with-flanks input.)
Links¶
- Code: multimolecule.mtsplice
- Weights: multimolecule/mtsplice
- Data: GTEx cassette-exon PSI quantifications across 56 tissues
- Paper: MTSplice predicts effects of genetic variants on tissue-specific splicing
- Developed by: Jun Cheng, Muhammed Hasan Çelik, Anshul Kundaje, Julien Gagneur
- Model type: Two parallel dilated 1D CNN towers with positional B-spline re-weighting for tissue-specific delta-logit-PSI prediction
- Original Repository: gagneurlab/MMSplice_MTSplice
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Tissue Scores¶
Variant Effect¶
Interface¶
- Input length: cassette exon with flanking intronic context (typical ~800 bp)
- Output (reference-only call,
input_ids/inputs_embeds): per-tissue score vectorlogitsof shape(batch_size, 56)
Variant Effect¶
- Reference + alternative call (also pass
alternative_input_ids/alternative_inputs_embeds): additionally returnsalternative_logitsand per-tissuedelta_logits = alternative_logits - logits MtSpliceForSequencePrediction: returns per-tissue deltas (or per-tissue scores when no alternative is supplied); applies standard regression loss when labels are provided
Training Details¶
MTSplice was trained to predict tissue-specific percent-spliced-in (PSI) of cassette exons across GTEx tissues, building on the MMSplice modular splicing model with an added tissue-specific neural module.
Training Data¶
MTSplice was trained on cassette-exon PSI quantifications across 56 GTEx tissues, together with the human reference splice-site and exon sequence context. The variant-effect predictions were validated against tissue-specific splicing quantitative trait loci (sQTL) and MPRA exon-skipping data.
Training Procedure¶
Pre-training¶
The two sequence towers consume one-hot encoded DNA. A dilated-convolution stack with positional B-spline re-weighting extracts splicing features, which a dense head maps to per-tissue delta-logit-PSI. The tissue-resolved predictions are formed from the reference/alternative score deltas.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the MTSplice paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.mtsplice
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
MtSpliceConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
MtSpliceModel. It is used to instantiate a MTSplice model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the MTSplice
gagneurlab/MMSplice_MTSplice architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
MTSplice (Cheng et al. 2021) is the tissue-specific second generation of MMSplice. It scores a cassette exon together with its flanking introns through two parallel dilated-convolution towers: an acceptor (3’ splice site) tower over the upstream region and a donor (5’ splice site) tower over the downstream region. The two towers are positionally re-weighted by B-spline transformations, pooled, and combined by a small dense head into a tissue-resolved delta-logit-PSI splicing-effect score across 56 GTEx tissues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the MTSplice model. Defines the number of feature channels derived from the one-hot
encoded |
4
|
|
int
|
Number of convolution filters in the two sequence towers. |
64
|
|
int
|
Kernel size of the first (stem) convolution in each tower. |
11
|
|
int
|
Number of residual dilated-convolution blocks per tower. |
8
|
|
int
|
Kernel size of the residual dilated-convolution blocks. |
3
|
|
int
|
Base of the exponentially growing dilation rate; block |
2
|
|
int
|
Length (in bp) of the acceptor (3’ splice site) input region, intron overhang plus exon flank. |
400
|
|
int
|
Length (in bp) of the donor (5’ splice site) input region, exon flank plus intron overhang. |
400
|
|
int
|
Number of B-spline bases used by the positional re-weighting layers. |
10
|
|
int
|
Polynomial degree of the B-spline bases. |
3
|
|
int
|
Hidden size of the dense head that maps pooled features to tissue scores. |
32
|
|
str
|
The non-linear activation function in the convolution towers and the dense head. |
'relu'
|
|
float
|
The epsilon used by the batch normalization layers. Defaults to 0.001 to match the upstream
Keras |
0.001
|
|
float
|
The dropout probability applied before the tissue projection. |
0.5
|
|
int
|
Number of tissue outputs. MTSplice predicts delta-logit-PSI for the 56 GTEx tissues, so this defaults to 56. |
56
|
Examples:
Source code in multimolecule/models/mtsplice/configuration_mtsplice.py
| Python | |
|---|---|
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
MtSpliceForSequencePrediction
¶
Bases: MtSplicePreTrainedModel
MTSplice with sequence-level regression loss support.
The wrapper returns the per-tissue score vector (or, when a reference and an alternative sequence are provided, the per-tissue score deltas) and applies a regression criterion when labels are supplied.
Examples:
Source code in multimolecule/models/mtsplice/modeling_mtsplice.py
MtSpliceModel
¶
Bases: MtSplicePreTrainedModel
The bare MTSplice tissue-specific backbone.
MTSplice scores a cassette exon together with its flanking introns with two parallel dilated-convolution towers (an acceptor tower over the upstream region and a donor tower over the downstream region), positionally re-weights each tower with B-spline transformations, pools, and combines the two towers into a per-tissue delta-logit-PSI vector. The backbone returns the per-tissue score vector. For variant-effect prediction, pass both a reference and an alternative sequence; the backbone then also returns the per-tissue deltas.
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/mtsplice/modeling_mtsplice.py
| Python | |
|---|---|
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
MtSpliceModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the MTSplice tissue-specific model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
The per-tissue delta-logit-PSI score vector for the (reference) input
sequence, ordered as the 56 GTEx tissues (see |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`, *optional*
|
The per-tissue score vector for the alternative sequence, returned when an alternative sequence is provided. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`, *optional*
|
|
None
|
Source code in multimolecule/models/mtsplice/modeling_mtsplice.py
MtSplicePreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.