APARENT2¶
APARENT2¶
Deep residual neural network for predicting human 3’ UTR Alternative Polyadenylation (APA) and cleavage magnitude at base-pair resolution, and for deciphering the impact of genetic variants on polyadenylation.
Disclaimer¶
This is an UNOFFICIAL implementation of Deciphering the impact of genetic variation on human polyadenylation using APARENT2 by Johannes Linder, Samantha E. Koplik et al.
The OFFICIAL repository of APARENT2 is at johli/aparent-resnet.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing APARENT2 did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
APARENT2 is a residual convolutional neural network (a ResNet successor to the original APARENT) trained on a 3’ UTR massively parallel reporter assay (MPRA). Given a fixed 205bp polyadenylation signal (PAS) sequence, it predicts a base-pair-resolution cleavage probability distribution as well as the overall isoform abundance. It is primarily used to score the effect of genetic variants on polyadenylation by comparing the predictions for a reference and an alternate sequence.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|
| 28 | 32 | 0.19 | 0.08 | 0.04 | 205 |
Links¶
- Code: multimolecule.aparent2
- Weights: multimolecule/aparent2
- Data: Massively-parallel polyadenylation MPRA with variant-effect evaluation data
- Paper: Deciphering the impact of genetic variation on human polyadenylation using APARENT2
- Developed by: Johannes Linder, Samantha E. Koplik, Anshul Kundaje, Georg Seelig
- Model type: 1D residual CNN successor to APARENT for polyadenylation isoform, cleavage, and variant-effect prediction
- Original Repository: johli/aparent-resnet
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Polyadenylation Cleavage Prediction¶
You can use this model directly to predict the cleavage distribution of a 205bp polyadenylation signal sequence (core hexamer starting at position 70):
Variant Effect Scoring¶
Score a reference and an alternate sequence separately, then compare:
Interface¶
- Input length: fixed 205 bp window
- Hexamer position: core hexamer (e.g.,
AATAAA) at position 70 (0-indexed) of the 205 bp window - Output: 206-dim cleavage distribution (one score per input position + trailing “no cleavage in window” bucket)
Variant Effect¶
- Score reference and alternate sequences separately and compare their cleavage / isoform predictions
- There is no separate ref/alt output dataclass
Training Details¶
APARENT2 was trained to predict base-pair-resolution cleavage and isoform abundance from 3’ UTR MPRA measurements.
Training Data¶
The model was trained on the 3’ UTR MPRA library used by the original APARENT, re-processed with additional improvements (exact cleavage positions for the Alien1 Random sublibrary and a 20 nt random barcode upstream of the USE in the Alien1 sublibrary). The measured variant data and processed data repository are available at the original APARENT GitHub.
Training Procedure¶
Pre-training¶
The model minimizes a combination of a sigmoid KL-divergence isoform loss and a KL-divergence cleavage loss, weighted equally. The released inference model corresponds to the residual-network model trained for 5 epochs on all sublibraries (excluding ClinVar wild-type sequences), with dropout disabled for inference.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the APARENT2 paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.aparent2
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Aparent2Config
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
Aparent2Model. It is used to instantiate a APARENT2 model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the APARENT2 johli/aparent-resnet
architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
APARENT2 is a residual convolutional network that predicts human 3’ UTR Alternative Polyadenylation (APA) and cleavage magnitude at base-pair resolution. The network is fully convolutional plus a position-wise locally-connected library-bias layer; it does not contain any flatten/dense layers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the APARENT2 model. Defines the number of one-hot input channels derived from
|
5
|
|
int
|
The fixed length of the polyadenylation signal sequence the model was trained on. APARENT2 expects a 205bp
window with the core hexamer (e.g. |
205
|
|
int
|
Number of feature channels used throughout the residual network. |
32
|
|
int
|
Number of residual-block groups. |
7
|
|
int
|
Number of residual blocks per group. |
4
|
|
int
|
Convolution kernel size used inside each residual block. |
3
|
|
list[int] | None
|
Dilation factor for each residual-block group. Must have |
None
|
|
int
|
Dimensionality of the one-hot training sub-library bias input. |
13
|
|
int
|
The training sub-library index used to construct the deterministic library-bias input. The upstream variant-effect workflow always uses index 11. |
11
|
|
str
|
The non-linear activation function used inside the residual blocks. |
'relu'
|
|
float
|
The epsilon used by the batch normalization layers. |
0.001
|
|
float
|
The momentum used by the batch normalization layers. |
0.99
|
|
int
|
Number of output labels. APARENT2 predicts a cleavage distribution over |
206
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/aparent2/configuration_aparent2.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
Aparent2ForSequencePrediction
¶
Bases: Aparent2PreTrainedModel
APARENT2 with a sequence-level prediction head.
The backbone already produces a sequence_length + 1 dimensional cleavage score (the APA cleavage distribution
before softmax), so this wrapper exposes those converted upstream scores directly and adds the shared
MultiMolecule regression loss.
Examples:
Source code in multimolecule/models/aparent2/modeling_aparent2.py
Aparent2Model
¶
Bases: Aparent2PreTrainedModel
The bare APARENT2 residual network.
APARENT2 predicts a base-pair-resolution cleavage distribution for a fixed 205bp polyadenylation signal window.
The core hexamer (e.g. AATAAA) is expected to start at position 70 (0-indexed). Variant effect is an
input-schema concern: score a reference and an alternate sequence separately and compare their cleavage /
isoform predictions; there is no separate ref/alt output dataclass.
Examples:
Source code in multimolecule/models/aparent2/modeling_aparent2.py
Aparent2ModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the APARENT2 model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(1,)`, *optional*
|
Not produced by the bare model; present for API compatibility. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length + 1)`
|
APA cleavage scores (before SoftMax) for each position plus a trailing “no cleavage in window” bucket. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length + 1)`
|
Same content as |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
The residual-network feature map before the final cleavage projection. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Hidden states of the model at the output of each layer. |
None
|
Source code in multimolecule/models/aparent2/modeling_aparent2.py
Aparent2PreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.