APARENT¶
APARENT¶
Convolutional neural network for predicting human 3’UTR Alternative Polyadenylation (APA) from sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation by Nicholas Bogard, Johannes Linder et al.
The OFFICIAL repository of APARENT is at johli/aparent.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing APARENT did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
APARENT (APA REgression NeT) is a convolutional neural network trained on more than 3.5 million randomized 3’UTR poly-A signals expressed on mini-gene reporters in HEK293. Given a fixed-length 205 nt 3’UTR/polyA sequence, APARENT predicts the alternative-polyadenylation isoform proportion (a scalar) and a positional cleavage distribution. The model is primarily used to score the impact of genetic variants on APA regulation and to engineer new polyadenylation signals. Please refer to the Training Details section for more information on the training process.
The base, non-normalised APARENT model is recommended by the original authors for isoform and variant-effect prediction.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|
| 4 | 256 | 6.43 | 0.03 | 0.01 | 205 |
Links¶
- Code: multimolecule.aparent
- Weights: multimolecule/aparent
- Data: Massively-parallel polyadenylation MPRA, GEO GSE113849
- Paper: A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation
- Developed by: Nicholas Bogard, Johannes Linder, Alexander B. Rosenberg, Georg Seelig
- Model type: 1D CNN for alternative polyadenylation isoform and cleavage prediction from 3’UTR sequence
- Original Repository: johli/aparent
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
APA Isoform Prediction¶
You can use this model directly to predict the APA isoform proportion of a 3’UTR/polyA sequence:
The full APARENT isoform and cleavage outputs are available on the backbone:
Interface¶
- Input length: fixed 205 nt 3’UTR / polyA sequence
- Output (
AparentModel):isoform_logits(scalar APA proportion) +cleavage_logits(206-dim positional cleavage distribution) - Output (
AparentForSequencePrediction): APA isoform scalar only (logits)
Training Details¶
APARENT was trained to jointly predict the APA isoform proportion and the positional cleavage distribution of randomized 3’UTR poly-A signals.
Training Data¶
APARENT was trained on more than 3.5 million randomized 3’UTR poly-A signal sequences expressed on mini-gene reporters in HEK293 cells (a massively parallel reporter assay, MPRA). The raw sequencing data for the 3’UTR MPRA libraries are available at GEO accession GSE113849.
This APARENT model was trained on all MPRA libraries (no libraries held out) to produce the best general-purpose APA predictor; it differs from the per-library held-out model evaluated in the paper.
Training Procedure¶
Pre-training¶
The model was trained to minimize a combined objective: a sigmoid KL-divergence on the isoform proportion and a KL-divergence on the positional cleavage distribution, weighted equally.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the APARENT paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.aparent
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
AparentConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
AparentModel. It is used to instantiate an APARENT model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the APARENT johli/aparent architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the APARENT model. Defines the number of input channels of the first convolution.
Defaults to 5 ( |
5
|
|
int
|
The fixed 3’UTR/polyA input sequence length APARENT was trained on (205 nt). |
205
|
|
int
|
Number of filters in the first convolution. |
96
|
|
int
|
Kernel size (sequence span) of the first convolution. The first convolution also spans the full nucleotide dimension. |
8
|
|
int
|
Pooling window of the max-pooling layer after the first convolution. |
2
|
|
int
|
Number of filters in the second convolution. |
128
|
|
int
|
Kernel size of the second convolution. |
6
|
|
list[int] | None
|
Sizes of the two fully connected layers after the convolutional stack. The second value is the size of
the shared sequence representation exposed as |
None
|
|
list[float] | None
|
Dropout probabilities applied after each fully connected layer. |
None
|
|
str
|
The non-linear activation function used by the convolutional and dense layers. |
'relu'
|
|
int
|
Dimension of the upstream isoform-proportion output (sigmoid). APARENT predicts a single scalar. |
1
|
|
int
|
Dimension of the upstream positional cleavage-distribution output (softmax). APARENT predicts 206 positions (205 sequence positions + 1 distal/library bias slot). |
206
|
|
int
|
Size of the upstream one-hot library-identity input concatenated before the output layers. The MultiMolecule API keeps this as a non-persistent zero feature, matching the upstream default encoder. |
13
|
|
HeadConfig | None
|
The configuration of the sequence-level prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/aparent/configuration_aparent.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
AparentForSequencePrediction
¶
Bases: AparentPreTrainedModel
APARENT model with a sequence-level prediction head.
APARENT’s primary sequence-level output is the alternative-polyadenylation isoform score. This wrapper exposes the
converted upstream isoform decoder directly. The upstream positional cleavage distribution is intentionally not
exposed by this head; it remains available on [AparentModel] as cleavage_logits.
Examples:
Source code in multimolecule/models/aparent/modeling_aparent.py
AparentModel
¶
Bases: AparentPreTrainedModel
The bare APARENT model outputting the shared sequence representation together with the upstream isoform and cleavage predictions.
Examples:
Source code in multimolecule/models/aparent/modeling_aparent.py
AparentModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the APARENT model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
The shared sequence representation after the two fully connected layers. Consumed by the MultiMolecule sequence-prediction head. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_isoform_labels)`
|
Pre-sigmoid logits of the upstream alternative-polyadenylation isoform-proportion output. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_cleavage_labels)`
|
Pre-softmax logits of the upstream positional cleavage distribution. |
None
|
Source code in multimolecule/models/aparent/modeling_aparent.py
AparentPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.