Pangolin¶
Convolutional neural network for predicting tissue-specific splice site strength from pre-mRNA sequences.
Disclaimer¶
This is an UNOFFICIAL implementation of Predicting RNA splicing from DNA sequence using Pangolin by Tony Zeng, et al.
The OFFICIAL repository of Pangolin is at tkzeng/Pangolin.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Pangolin did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Pangolin is a deep convolutional neural network (CNN) that predicts splice site strength from primary pre-mRNA sequence. It extends the dilated-residual SpliceAI architecture to predict tissue-specific splice site usage, and is trained on splicing measurements derived from RNA-seq data across multiple tissues. The network processes a one-hot encoded nucleotide sequence and, for each position, predicts a splice-site score and a splice-site usage score per tissue. Pangolin is typically used to estimate the effect of genetic variants on splicing by scoring reference and alternate sequences and taking the difference. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|
| 16 | 32 | 8.36 | 168.85 | 84.04 |
Links¶
- Code: multimolecule.pangolin
- Data: Cross-species RNA-seq splice-site usage from human, rhesus, rat, and mouse tissues
- Paper: Predicting RNA splicing from DNA sequence using Pangolin
- Developed by: Tony Zeng, Yang I. Li
- Model type: Dilated residual 1D CNN ensemble for per-nucleotide multi-tissue splice-site usage prediction
- Original Repository: tkzeng/Pangolin
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
RNA Splicing Site Prediction¶
You can use this model directly to predict per-nucleotide tissue-specific splice-site score and usage channels for a pre-mRNA sequence:
The probabilities tensor reproduces the original Pangolin output: for each of the four tissues, two splice-site score channels (softmax) and one splice-site usage channel (sigmoid).
Downstream Use¶
Token Prediction¶
You can fine-tune Pangolin for per-nucleotide splice site strength regression with PangolinForTokenPrediction, which adds a shared token prediction head on top of the backbone.
Interface¶
- Input length: variable pre-mRNA sequence
- Padding: flanking context padded with
Nnear transcript ends - Output: per-position tissue-specific channels — for each of 4 tissues, 2 splice-site score channels + 1 splice-site usage channel
Training Details¶
Pangolin was trained to predict tissue-specific splice site usage from primary pre-mRNA sequence.
Training Data¶
Pangolin was trained on splice site usage derived from RNA-seq data in heart, liver, brain, and testis tissues from human and three other species, using gene annotations from GENCODE.
For each nucleotide whose splicing status was predicted, a sequence window centered on that nucleotide was used, with the flanking context padded with N (unknown nucleotide) when near transcript ends.
Training Procedure¶
Pre-training¶
The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over splice-site usage, comparing predictions against measurements derived from RNA-seq.
- Optimizer: AdamW
- Learning rate scheduler: Step decay
Citation¶
| BibTeX | |
|---|---|
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Pangolin paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
API Reference¶
PangolinConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
PangolinModel. It is used to instantiate a Pangolin model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the Pangolin tkzeng/Pangolin architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Pangolin model. Defines the number of different tokens that can be represented by
the |
5
|
|
int
|
The length of the context window. The input sequence is padded with zeros of length |
10000
|
|
int
|
Dimensionality of the encoder layers. |
32
|
|
list[PangolinStageConfig] | None
|
Configuration for each stage in the Pangolin model. Each stage is a [ |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder. If string, |
'relu'
|
|
float
|
The epsilon used by the batch normalization layers. |
1e-05
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of replicate networks averaged inside each tissue-specific model group. The official Pangolin v2 release uses three replicates per tissue. |
3
|
|
int
|
Number of tissue-specific model groups. The official release predicts four tissues (heart, liver, brain,
testis), each with a splice-site score (2 channels) and a splice-site usage score (1 channel), for a total
of |
4
|
|
list[str] | None
|
Names for the tissue-specific output groups. Defaults to the official Pangolin v2 tissue order: heart, liver, brain, and testis. |
None
|
|
int
|
Number of output labels for the [ |
4
|
|
HeadConfig | None
|
Configuration for the [ |
None
|
|
str | None
|
Problem type for the token prediction head. |
'regression'
|
|
bool
|
Whether to output the context vectors for each stage. |
False
|
Examples:
Source code in multimolecule/models/pangolin/configuration_pangolin.py
| Python | |
|---|---|
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
PangolinStageConfig
¶
Bases: FlatDict
Configuration for a single Pangolin stage.
A stage is a contiguous group of dilated residual blocks that share a kernel size and dilation, followed by a skip-connection convolution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Number of dilated residual blocks in the stage. |
required | |
|
Convolution kernel size for the blocks in the stage. |
required | |
|
Dilation (atrous rate) for the blocks in the stage. |
required |
Source code in multimolecule/models/pangolin/configuration_pangolin.py
PangolinForTokenPrediction
¶
Bases: PangolinPreTrainedModel
Examples:
Source code in multimolecule/models/pangolin/modeling_pangolin.py
PangolinModel
¶
Bases: PangolinPreTrainedModel
Examples:
Source code in multimolecule/models/pangolin/modeling_pangolin.py
| Python | |
|---|---|
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
postprocess
¶
postprocess(
outputs: PangolinModelOutput | ModelOutput | Tensor,
) -> tuple[Tensor, list[str]]
Return Pangolin splice-site scores with semantic tissue channel names.
Pangolin’s outputs are already probability-like from the original head: two softmax splice-site channels and one sigmoid usage channel for each tissue. This method attaches the model-defined tissue channel names so direct model users and pipelines share the same output semantics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
PangolinModelOutput | ModelOutput | Tensor
|
The output of |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
A tuple of |
list[str]
|
and |
Source code in multimolecule/models/pangolin/modeling_pangolin.py
PangolinModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the Pangolin model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
Per-position encoder representation, averaged across ensemble members. Consumed by
[ |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_tissues * 3)`
|
Original Pangolin per-tissue splice-site score (softmax, 2 channels) and splice-site usage score (sigmoid, 1 channel) outputs, averaged across ensemble members. These are post-activation probabilities; Pangolin has no pre-softmax logit surface. |
None
|
Source code in multimolecule/models/pangolin/modeling_pangolin.py
PangolinPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/pangolin/modeling_pangolin.py
PangolinTokenPredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of Pangolin token prediction models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor`, *optional*, returned when `labels` is provided
|
Token prediction loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`
|
Per-nucleotide prediction outputs. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_contexts=True`
|
Per-stage context representations. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`
|
Per-stage context representations. |
None
|