SpTransformer¶
Transformer network for predicting tissue-specific splicing from pre-mRNA sequences.
Disclaimer¶
This is an UNOFFICIAL implementation of SpliceTransformer predicts tissue-specific splicing linked to human diseases by Ningyuan You et al.
The OFFICIAL repository of SpliceTransformer (SpTransformer) is at ShenLab-Genomics/SpliceTransformer.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing SpTransformer did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
SpTransformer (SpliceTransformer) is a deep neural network that predicts tissue-specific splicing from primary pre-mRNA sequence. It combines two pretrained SpliceAI-style dilated-residual convolutional feature extractors with a trainable input-projection path; the concatenated features are processed by a Sinkhorn transformer attention block with axial positional embeddings. For each position the network predicts a 3-channel splice-site score (no-splice / acceptor / donor) and a per-position splice-site usage score across 15 human tissues. The model uses a fixed flanking context of 4,000 nucleotides on each side of every predicted position. SpTransformer is typically used to estimate the effect of genetic variants on tissue-specific splicing by scoring reference and alternate sequences and taking the difference. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Layers | Hidden Size | Num Heads | Intermediate Size | Max Seq Len | Num Parameters (M) | FLOPs (G) | MACs (G) | Context |
|---|---|---|---|---|---|---|---|---|
| 8 | 256 | 8 | 1024 | 8192 | 17.07 | 290.72 | 144.65 | 4000 |
Links¶
- Code: multimolecule.sptransformer
- Weights: multimolecule/sptransformer
- Data: GTEx human RNA-seq across 15 tissues with gene annotations from GENCODE and multi-species sequence data
- Paper: SpliceTransformer predicts tissue-specific splicing linked to human diseases
- Developed by: Ningyuan You, Chang Liu, Yuxin Gu, Rong Wang, Hanying Jia, Tianyun Zhang, Song Jiang, Jinsong Shi, Ming Chen, Min-Xin Guan, Siqi Sun, Shanshan Pei, Zhihong Liu, Ning Shen
- Model type: Transformer encoder with windowed-local and Sinkhorn sorted-bucket attention for tissue-specific splicing prediction
- Original Repository: ShenLab-Genomics/SpliceTransformer
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
RNA Splicing Site Prediction¶
You can use this model directly to predict per-nucleotide tissue-specific splicing of a pre-mRNA sequence:
The logits tensor reproduces the original SpTransformer output: a 3-channel splice-site score (no-splice / acceptor / donor) and a per-tissue (15 tissues) splice-site usage score for each position.
Downstream Use¶
Token Prediction¶
You can fine-tune SpTransformer for per-nucleotide tissue-specific splicing regression with SpTransformerForTokenPrediction, which adds a shared token prediction head on top of the backbone.
Interface¶
- Input length: variable pre-mRNA sequence
- Flanking context: fixed 4,000 nt on each side of every predicted position
- Padding: ends padded with
N - Output: per-position 3-channel splice-site score (
no-splice/acceptor/donor) + per-tissue (15 tissues) splice-site usage score - Attention recording: opt-in via
output_attentions=True; returns faithful sparse-attention maps — see Faithful Sparse-Attention Exposure
Faithful Sparse-Attention Exposure¶
SpTransformer’s attention block does not compute dense self-attention. Each layer
([SpTransformerSelfAttention][multimolecule.models.sptransformer.modeling_sptransformer.SpTransformerSelfAttention])
splits its heads into two groups with fundamentally different sparse-attention structures:
- Windowed-local heads — each window of
bucket_sizetokens attends only to itself plus the immediately preceding and following window (alook_backward=1,look_forward=1look-around). Boundary positions are masked. - Sinkhorn sorted-bucket heads — each query bucket attends to the concatenation of (a) one sorted /
reordered key bucket selected by a parameter-free attention-sort net (
differentiable_topk(R, k=1)) and (b) its own local bucket.
Because these two patterns operate on different key axes, there is no single dense (batch, heads,
sequence, sequence) tensor that faithfully represents the computation. Materialising a zero-filled
sequence x sequence grid would be a misleading interpretability artifact, so this model does not
expose one.
Instead, attention recording is opt-in and faithful. Passing output_attentions=True (or setting
config.output_attentions=True) returns, for every attention layer, a
SpTransformerAttentionMap holding the actual softmax
weights used in the forward pass plus the indexing/permutation needed to map them back to absolute sequence
positions:
local_attentions(B, num_local_heads, num_windows, W, (look_backward + 1 + look_forward) * W)— the real per-window softmax weights; padded look-around columns carry weight0.local_key_positions(num_windows, (look_backward + 1 + look_forward) * W)— absolute source position of every local key-axis column (-1marks padded columns).sinkhorn_attentions(B, num_sinkhorn_heads, num_buckets, W, 2 * W)— the real per-bucket softmax weights over the[reordered-bucket | own-bucket]key axis.sinkhorn_reorder(B, num_sinkhorn_heads, num_buckets, num_buckets)— the exact bucket-permutation matrix; for query bucketu, the nonzero columnvof rowusays the reordered key bucket (columns0:Wofsinkhorn_attentions) is source bucketv(absolute positionsv*W : v*W + W).- scalar metadata:
bucket_size,look_backward,look_forward,num_local_heads,num_sinkhorn_heads,sequence_length.
W is bucket_size; local heads come first along the head axis, Sinkhorn heads second. These are
structured block weights, not dense attention matrices — re-deriving the per-type attention output by
contracting these exact weights with the (block-gathered) values reproduces the layer output exactly.
Recording is opt-in, so the default forward path and its numerics are byte-for-byte unchanged.
Training Details¶
SpTransformer was trained to predict tissue-specific splicing from primary pre-mRNA sequence.
Training Data¶
SpTransformer was trained on splicing measurements derived from RNA-seq data across 15 human tissues, using gene annotations from GENCODE, together with multi-species sequence data.
The two convolutional feature extractors were pre-trained as SpliceAI-style splice-site predictors and remain trainable submodules for downstream fine-tuning.
For each predicted nucleotide, a sequence window centered on that nucleotide was used, with the flanking context padded with N (unknown nucleotide) when near transcript ends.
Training Procedure¶
Pre-training¶
The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over per-tissue splice-site usage, comparing predictions against measurements derived from RNA-seq.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the SpliceTransformer paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.sptransformer
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
SpTransformerConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
SpTransformerModel. It is used to instantiate a SpTransformer model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the SpliceTransformer
ShenLab-Genomics/SpliceTransformer architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the SpTransformer model. Defines the number of different tokens that can be represented
by the |
5
|
|
int
|
The length of the context window. The encoder consumes |
4000
|
|
int
|
Dimensionality of the trainable input-projection path. |
128
|
|
list[SpTransformerFeatureEncoderConfig] | None
|
Configuration for each SpliceAI-style convolutional feature encoder. Each encoder is a
[ |
None
|
|
int
|
Dimensionality of the Sinkhorn transformer attention block. |
256
|
|
int
|
Number of layers in the Sinkhorn transformer attention block. |
8
|
|
int
|
Number of attention heads in the Sinkhorn transformer attention block. |
8
|
|
int
|
Number of attention heads that use local (windowed) attention instead of Sinkhorn attention. |
2
|
|
int
|
Dimensionality of the feed-forward layers in the attention block. |
1024
|
|
int
|
Token bucket size for Sinkhorn / local attention. |
64
|
|
int
|
Maximum sequence length consumed by the attention block. The concatenated features are center-cropped or padded to this length before the attention block. |
8192
|
|
int
|
Number of splice-site score channels predicted by the original output head (no-splice, acceptor, donor). |
3
|
|
int
|
Number of tissues for which per-position splice-site usage is predicted by the original output head. |
15
|
|
list[str] | None
|
Names for the per-tissue splice-site usage channels. Defaults to |
None
|
|
str
|
The non-linear activation function (function or string) in the SpliceAI-style feature encoders. |
'relu'
|
|
str
|
The non-linear activation function (function or string) in the transformer feed-forward layers. |
'gelu'
|
|
float
|
The epsilon used by the batch normalization layers. |
1e-05
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of output labels for the [ |
15
|
|
HeadConfig | None
|
Configuration for the [ |
None
|
|
str | None
|
Problem type for the token prediction head. |
'regression'
|
|
bool
|
Whether to output the per-position attention-block representation. |
False
|
Examples:
Source code in multimolecule/models/sptransformer/configuration_sptransformer.py
| Python | |
|---|---|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
SpTransformerFeatureEncoderConfig
¶
Bases: FlatDict
Configuration for a single SpliceAI-style convolutional feature encoder used by SpTransformer.
SpTransformer reuses two pre-trained dilated-residual convolutional encoders to extract per-position sequence features. Each encoder is a stack of dilated residual blocks; the feature map is taken before the encoder’s own output projections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Number of channels in the encoder. |
required |
Source code in multimolecule/models/sptransformer/configuration_sptransformer.py
SpTransformerAttentionMap
dataclass
¶
Bases: ModelOutput
Faithful, structured attention weights for one SpTransformer attention layer.
SpTransformer’s attention layer (SpTransformerSelfAttention) is not dense self-attention. It
splits the heads into two groups with fundamentally different sparse-attention structures, so there is no
single dense (batch, heads, seq, seq) tensor that faithfully represents the computation. Fabricating one
(e.g. by scattering the block weights into a zero-filled seq x seq grid) would be a misleading
interpretability artifact. Instead, this object exposes the actual softmax weights computed in the
forward pass for each attention type, plus the indexing/permutation needed to map them back to absolute
sequence positions.
Conventions: B = batch, S = sequence length, W = window_size = bucket_size,
num_windows = S // W, num_buckets = S // W. Local heads come first along the head axis,
Sinkhorn heads second, matching the split inside SpTransformerSelfAttention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`int`
|
|
None
|
|
`int`
|
number of preceding windows each local window attends to ( |
None
|
|
`int`
|
number of following windows each local window attends to ( |
None
|
|
`int`
|
number of windowed-local heads (first heads along the head axis). |
None
|
|
`int`
|
number of Sinkhorn sorted-bucket heads (remaining heads). |
None
|
|
`int`
|
|
None
|
Faithfulness guarantee: re-deriving the per-type attention output by contracting these exact softmax weights with the (block-gathered) values reproduces the layer’s attention output bit-for-bit.
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
SpTransformerForTokenPrediction
¶
Bases: SpTransformerPreTrainedModel
Examples:
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
SpTransformerModel
¶
Bases: SpTransformerPreTrainedModel
Examples:
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
| Python | |
|---|---|
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
postprocess
¶
postprocess(
outputs: (
SpTransformerModelOutput | ModelOutput | Tensor
),
) -> tuple[Tensor, list[str]]
Return SpTransformer splice-site probabilities and tissue-usage scores with semantic channel names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
SpTransformerModelOutput | ModelOutput | Tensor
|
The output of |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A tuple of |
list[str]
|
are returned in the model’s native scale. |
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
SpTransformerModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the SpTransformer model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, attention_hidden_size)`
|
Per-position attention-block representation. Consumed by [ |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_splice_labels + num_tissues)`
|
Original SpTransformer per-position splice-site score (no-splice / acceptor / donor) and per-tissue splice-site usage score outputs. |
None
|
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
SpTransformerPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/sptransformer/modeling_sptransformer.py
SpTransformerTokenPredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of SpTransformer token prediction models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor`, *optional*, returned when `labels` is provided
|
Token prediction loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`
|
Per-nucleotide prediction outputs. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_contexts=True`
|
Per-position attention-block representations. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`
|
Attention-block hidden states before the first layer and after each layer. |
None
|
|
`tuple(SpTransformerAttentionMap)`, *optional*, returned when `output_attentions=True`
|
Structured sparse-attention weights for each attention layer. |
None
|