DeepCpG-DNA¶
DNA-only convolutional neural network from DeepCpG for predicting per-cell single-cell DNA methylation states from a CpG-centered sequence window.
Disclaimer¶
This is an UNOFFICIAL implementation of DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning by Christof Angermueller, et al.
The OFFICIAL repository of DeepCpG is at cangermueller/deepcpg.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing DeepCpG-DNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
DeepCpG-DNA is the DNA submodule of the DeepCpG joint model. It is a 1D convolutional neural network that predicts the per-cell methylation state of a CpG site from a fixed-length 1001 bp DNA window centered on the site. The model consumes a one-hot encoded sequence and applies valid-padded convolutional blocks (Conv1D + ReLU + MaxPool) followed by a dense bottleneck and one binary classification head per single cell in the training dataset. Please refer to the Training Details section for more information on the training process.
The full DeepCpG model combines this DNA submodule with a recurrent CpG-context submodule and a joint head; this model card covers the DNA submodule only.
Variants¶
The DeepCpG-DNA module is trained per single-cell dataset, so each variant predicts a different number of output cells.
| Dataset | Architecture | Cells | Hub repository |
|---|---|---|---|
| Smallwood 2014 serum mESC | CnnL2h128 | 18 | deepcpgdna-smallwood2014-serum |
| Smallwood 2014 2i mESC | CnnL3h128 | 12 | deepcpgdna-smallwood2014-2i |
| Hou 2016 HCC | CnnL2h128 | 25 | deepcpgdna-hou2016-hcc |
| Hou 2016 HepG2 | CnnL3h128 | 6 | deepcpgdna-hou2016-hepg2 |
| Hou 2016 mESC | CnnL2h128 | 6 | deepcpgdna-hou2016-mesc |
Model Specification¶
| Architecture | Num Conv Layers | Hidden Size | Num Cells | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|---|---|
| CnnL2h128 | 2 | 128 | 18 | 4.11 | 70.63 | 35.06 | 1001 |
| CnnL3h128 | 3 | 12 | 4.43 | 165.02 | 82.18 |
Links¶
- Code: multimolecule.deepcpgdna
- Data: scBS-seq (Smallwood 2014) and scRRBS-seq (Hou 2016) single-cell bisulfite sequencing datasets
- Paper: DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning
- Developed by: Christof Angermueller, Heather J. Lee, Wolf Reik, Oliver Stegle
- Model type: Two- or three-layer 1D CNN over a 1001 bp CpG-centered DNA window for per-cell binary methylation prediction
- Original Repository: cangermueller/deepcpg
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Single-Cell Methylation Prediction¶
You can use this model directly to predict the per-cell methylation state of a 1001 bp DNA window centered on a CpG site:
Each logit is a per-cell methylation score for one of the single cells in the chosen training dataset; apply a sigmoid to obtain methylation probabilities.
Interface¶
- Input length: fixed 1001 bp DNA window centered on a CpG site
- Padding: not supported; pad or crop genomic windows so they match
sequence_lengthexactly - Alphabet: DNA (
A,C,G,T);Nis encoded as an all-zero channel - Output: per-cell methylation logits; the number of cells is dataset-specific (see Variants table)
Training Details¶
DeepCpG-DNA was trained to predict the per-cell methylation state of CpG sites from their flanking DNA context.
Training Data¶
DeepCpG-DNA was trained on single-cell bisulfite sequencing datasets:
- Smallwood 2014: scBS-seq profiles of mouse embryonic stem cells, with 18 serum and 12 2i mESCs (excluding two serum cells whose methylation pattern deviated strongly from the remainder).
- Hou 2016: scRRBS-seq profiles of 25 human hepatocellular carcinoma (HCC) cells, 6 human heptoplastoma-derived (HepG2) cells, and 6 mESCs, restricted to CpG sites covered by at least four reads.
Each training example is a 1001 bp DNA window centered on a CpG site, with a per-cell binary methylation label (methylated, unmethylated, or missing). Chromosomes were split into training, validation, and test sets to avoid sequence leakage.
Training Procedure¶
Pre-training¶
The model was trained to minimize a per-cell binary cross-entropy loss, comparing its predicted per-cell methylation probabilities (sigmoid of the per-cell logits) against the observed single-cell bisulfite labels. Missing labels are masked out during training.
- Optimizer: Adam
- Loss: Per-cell binary cross-entropy
- Regularization: Dropout and L2 weight decay
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the DeepCpG paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
API Reference¶
DeepCpgDnaConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
DeepCpgDnaModel. It is used to instantiate a DeepCpG-DNA model according
to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the DeepCpG DNA submodule
(cangermueller/deepcpg) CnnL2h128 architecture as distributed for the
Smallwood2014 serum mESC checkpoint.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the DeepCpG-DNA model. DeepCpG consumes a one-hot encoding of DNA nucleotides, so this
also defines the number of input channels of the first convolution. Defaults to 5 to match the
MultiMolecule |
5
|
|
int
|
The fixed length of the DNA window (in base pairs) centered on a CpG site. Defaults to 1001. |
1001
|
|
list[int] | None
|
Number of filters for each convolutional layer. |
None
|
|
list[int] | None
|
Kernel size for each convolutional layer. |
None
|
|
list[int] | None
|
Max-pool size applied after each convolutional layer. |
None
|
|
int
|
Dimensionality of the dense bottleneck embedding. This is the model’s hidden size. Defaults to 128. |
128
|
|
str
|
The non-linear activation function (function or string) in the encoder. If string, |
'relu'
|
|
float
|
The dropout probability for the bottleneck. |
0.0
|
|
int
|
Number of output labels. DeepCpG-DNA predicts per-cell methylation state, so this equals the number of single cells in the training dataset and is dataset-specific. Defaults to 18 to match the Smallwood2014 serum mESC checkpoint. |
18
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a per-cell binary methylation head
( |
None
|
Examples:
Source code in multimolecule/models/deepcpgdna/configuration_deepcpgdna.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
DeepCpgDnaForSequencePrediction
¶
Bases: DeepCpgDnaPreTrainedModel
The per-cell methylation (final dense) layer of DeepCpG-DNA is dataset-specific: it has one output per single
cell in the training dataset. num_labels therefore equals the number of cells in the chosen dataset (18 for the
shipped Smallwood2014 serum mESC checkpoint) and is exposed through the shared
SequencePredictionHead decoder.
Examples:
Source code in multimolecule/models/deepcpgdna/modeling_deepcpgdna.py
DeepCpgDnaModel
¶
Bases: DeepCpgDnaPreTrainedModel
Examples:
Source code in multimolecule/models/deepcpgdna/modeling_deepcpgdna.py
DeepCpgDnaModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the DeepCpG-DNA backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Final bottleneck embedding produced by the DeepCpG-DNA encoder. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Same tensor as |
None
|
|
tuple[FloatTensor, ...] | None
|
Always |
None
|
Source code in multimolecule/models/deepcpgdna/modeling_deepcpgdna.py
DeepCpgDnaPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.