Malinois¶
Convolutional neural network for predicting cell-type-targeting cis-regulatory element (CRE) activity from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of Machine-guided design of cell-type-targeting cis-regulatory elements by Sager J. Gosai, Rodrigo I. Castro, et al.
The OFFICIAL repository of Malinois is at sjgosai/boda2.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Malinois did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Malinois is a deep convolutional neural network (a tuned Basset-style “branched” architecture) trained to quantitatively predict cell-type-informed CRE activity from ~200 bp DNA sequences measured by a massively parallel reporter assay (MPRA). The model emits three regression outputs, one per human cell line: K562, HepG2 and SK-N-SH (in that order).
The architecture consists of three convolutional blocks, one shared fully-connected block, and a branched grouped-linear tower with an independent parameter set per cell line. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|
| 8 | 420 | 4.11 | 332.95 | 165.70 | 600 |
Links¶
- Code: multimolecule.malinois
- Weights: multimolecule/malinois
- Data: MPRA libraries across K562, HepG2, and SK-N-SH human cell lines
- Paper: Machine-guided design of cell-type-targeting cis-regulatory elements
- Developed by: Sager J. Gosai, Rodrigo I. Castro, Natalia Fuentes, John C. Butts, Kousuke Mouri, Michael Alasoadura, Susan Kales, Thanh Thanh L. Nguyen, Ramil R. Noche, Arya S. Rao, Mary T. Joy, Pardis C. Sabeti, Steven K. Reilly, Ryan Tewhey
- Model type: 1D CNN with cell-type-specific grouped-linear output head for MPRA cis-regulatory element activity
- Original Repository: sjgosai/boda2
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
CRE Activity Prediction¶
You can use this model directly to predict the cell-type-informed CRE activity (K562, HepG2, SK-N-SH) of a sequence. Malinois pads each ~200 bp candidate to 600 bp with fixed MPRA plasmid flanks before inference; the example below uses a pre-padded 600 bp sequence:
Interface¶
- Input length: fixed 600 bp window
- Padding: each ~200 bp candidate CRE is centered and padded with fixed MPRA plasmid flanks (
MPRA_UPSTREAM/MPRA_DOWNSTREAM); flank padding is part of the data pipeline, not the model - Output: 3 cell-line CRE activity values (K562, HepG2, SK-N-SH)
Training Details¶
Malinois was trained to predict quantitative, cell-type-informed CRE activity from DNA sequence.
Training Data¶
Malinois was trained on a lentiMPRA dataset measuring the regulatory activity of ~200 bp sequences across three human cell lines (K562, HepG2 and SK-N-SH). Each training example is a sequence with three continuous activity values (log2 fold-change over input), one per cell line. Genomic sequences were split by chromosome into training, validation, and test sets to avoid sequence leakage.
Training Procedure¶
Pre-training¶
The model was trained to minimize an L1 + KL-divergence mixed loss between predicted and measured cell-type CRE activities, with the architecture and training hyperparameters selected by Bayesian optimization.
- Optimizer: Adam
- Loss: L1 + KL-divergence mixed loss
- Early stopping on validation loss
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Malinois paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.malinois
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
MalinoisConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
MalinoisModel. It is used to instantiate a Malinois model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the Malinois sjgosai/boda2 architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Malinois model. Defines the number of feature channels in the one-hot encoded input fed to the first convolution. Defaults to 5. |
5
|
|
int
|
The fixed length (in base pairs) of the input fed to the first convolution. Upstream Malinois pads each 200 bp candidate sequence with fixed MPRA plasmid flanks up to this length before the convolution stack. Defaults to 600. |
600
|
|
list[int] | None
|
Number of output channels for each convolutional block. |
None
|
|
list[int] | None
|
Convolution kernel size for each convolutional block. |
None
|
|
int
|
Number of fully-connected layers between the convolutional stack and the branched tower. |
1
|
|
int
|
Hidden size for each fully-connected layer. |
1000
|
|
str
|
The non-linear activation function (function or string) applied after the convolutional and linear
layers. If string, |
'relu'
|
|
float
|
The dropout probability for the fully-connected layers. |
0.11625456877954289
|
|
int
|
Number of grouped (branched) layers, one independent tower per output cell line. |
3
|
|
int
|
Hidden size for each branch in the branched tower. |
140
|
|
str
|
The non-linear activation function applied between branched layers. |
'relu'
|
|
float
|
The dropout probability for the branched tower. |
0.5757068086404574
|
|
float
|
The epsilon used by the batch normalization layers. |
1e-05
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of regression outputs. Malinois predicts cell-type-informed cis-regulatory activity for three human cell lines: K562, HepG2 and SK-N-SH (in that order). |
3
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/malinois/configuration_malinois.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
MalinoisForSequencePrediction
¶
Bases: MalinoisPreTrainedModel
Examples:
Source code in multimolecule/models/malinois/modeling_malinois.py
MalinoisModel
¶
Bases: MalinoisPreTrainedModel
Examples:
Source code in multimolecule/models/malinois/modeling_malinois.py
MalinoisModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the Malinois model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`
|
Flattened feature map produced by the convolutional encoder. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels * branched_channels)`
|
Branch-major sequence-level representation produced by the fully-connected and branched tower. The first
|
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Always |
None
|
Source code in multimolecule/models/malinois/modeling_malinois.py
MalinoisPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.