MPRA-DragoNN¶
Disclaimer¶
This is an UNOFFICIAL implementation of Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays by Rajiv Movva et al.
The OFFICIAL repository of MPRA-DragoNN is at kundajelab/MPRA-DragoNN.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing MPRA-DragoNN did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
MPRA-DragoNN is a convolutional neural network (CNN) trained to quantitatively predict Sharpr-MPRA reporter activity from 145 bp DNA sequences. The released ConvModel consists of three convolutional blocks (Conv1D + ReLU + BatchNorm + Dropout, 120 filters of width 5 with valid padding) followed by a flatten and a single fully-connected layer that emits 12 task outputs. Each task corresponds to a (cell line, reporter promoter, replicate) combination from the Sharpr-MPRA experiment: the K562 and HepG2 cell lines, each measured with both a minimal promoter (minP) and the strong SV40 promoter (SV40p), with two individual replicates plus a pooled average per condition. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|---|
| 3 | 1 | 15960 | 0.34 | 40.40 | 20.05 | 145 |
Links¶
- Code: multimolecule.mpradragonn
- Weights: multimolecule/mpradragonn
- Data: Sharpr-MPRA dataset (Ernst et al. 2016)
- Paper: Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
- Developed by: Rajiv Movva, Peyton Greenside, Georgi K. Marinov, Surag Nair, Avanti Shrikumar, Anshul Kundaje
- Model type: Three-layer 1D CNN over 145 bp DNA for multi-task Sharpr-MPRA activity regression
- Original Repository: kundajelab/MPRA-DragoNN
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
MPRA Activity Prediction¶
You can use this model directly to predict the Sharpr-MPRA activity of a 145 bp DNA sequence:
Interface¶
- Input length: fixed 145 bp DNA window
- Output: 12 MPRA activity scalars in the order
k562_minp_{rep1, rep2, avg},k562_sv40p_{rep1, rep2, avg},hepg2_minp_{rep1, rep2, avg},hepg2_sv40p_{rep1, rep2, avg}(z-scored log2 RNA/DNA ratios)
Training Details¶
MPRA-DragoNN was trained to predict quantitative Sharpr-MPRA reporter activity from DNA sequence.
Training Data¶
MPRA-DragoNN was trained on the Sharpr-MPRA dataset (Ernst et al. 2016, GEO accession GSE71279) which assays ~487K 145 bp candidate regulatory elements in K562 and HepG2 cell lines under two reporter promoters (a minimal promoter and the strong SV40 promoter) and provides two replicates plus a pooled count per condition (12 tasks total).
Raw counts were preprocessed by (1) computing log2((RNA + 1) / (DNA + 1)) per task, (2) column-wise z-score normalisation per task, and (3) augmenting with the reverse complement of every sequence. Chromosomes were split with chr8 held out as validation, chr18 held out as test, and all remaining chromosomes used for training (~900K training, ~30K validation, ~20K test sequences after the reverse-complement augmentation).
Training Procedure¶
Pre-training¶
The model was trained to minimise a task-wise mean-squared-error loss between predicted and measured MPRA activities and evaluated with Spearman correlation per task.
- Optimizer: Adam
- Loss: Mean Squared Error (task-wise, equally weighted)
- Regularization: Batch normalization and dropout (p=0.1) after every convolutional block
- Validation: chr8 sequences; Test: chr18 sequences
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the MPRA-DragoNN paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.mpradragonn
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
MpraDragoNnConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
MpraDragoNnModel. It is used to instantiate an MPRA-DragoNN model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the MPRA-DragoNN
kundajelab/MPRA-DragoNN ConvModel architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the MPRA-DragoNN model. Defines the number of feature channels in the one-hot encoded input fed to the first convolution. Defaults to 5. |
5
|
|
int
|
The fixed length (in base pairs) of the input DNA sequence. Defaults to 145. |
145
|
|
int
|
Number of convolutional blocks (Conv1D + BatchNorm + activation + Dropout). |
3
|
|
list[int] | None
|
Number of output channels for each convolutional block. |
None
|
|
list[int] | None
|
Convolution kernel size for each convolutional block. |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder. If string, |
'relu'
|
|
float
|
The dropout probability applied after each convolutional block. |
0.1
|
|
float
|
The epsilon used by the batch normalization layers. |
0.001
|
|
float
|
The momentum used by the batch normalization layers (PyTorch convention; equivalent to |
0.01
|
|
int
|
Number of regression outputs. MPRA-DragoNN predicts Sharpr-MPRA activity for 12 tasks: K562 / HepG2 cell lines, each with minP and SV40p reporter promoters, each measured as two replicates plus a pooled “avg” track (2 cells x 2 promoters x 3 measurements = 12 tasks). |
12
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/mpradragonn/configuration_mpradragonn.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | |
MpraDragoNnForSequencePrediction
¶
Bases: MpraDragoNnPreTrainedModel
Examples:
Source code in multimolecule/models/mpradragonn/modeling_mpradragonn.py
MpraDragoNnModel
¶
Bases: MpraDragoNnPreTrainedModel
Examples:
Source code in multimolecule/models/mpradragonn/modeling_mpradragonn.py
MpraDragoNnModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the MPRA-DragoNN model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, pooled_length * conv_channels[-1])`
|
Flattened feature map produced by the convolutional encoder. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, pooled_length * conv_channels[-1])`
|
Sequence-level representation. MPRA-DragoNN has no learned pooler, so this is the same flattened
convolutional feature map as |
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Always |
None
|
Source code in multimolecule/models/mpradragonn/modeling_mpradragonn.py
MpraDragoNnPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.