a2z-chromatin¶
Disclaimer¶
This is an UNOFFICIAL implementation of Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks by Travis Wrightsman et al.
The OFFICIAL repository of a2z-chromatin is at twrightsman/a2z-regulatory.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing a2z-chromatin did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
a2z-chromatin is a recurrent convolutional neural network (CNN+BLSTM, DanQ topology) trained to predict chromatin state from a fixed-length 600 bp one-hot encoded angiosperm DNA sequence. The single convolutional layer applies 320 filters with a kernel size of 26, followed by dropout and a max-pool over 13 positions; the resulting feature sequence is fed to a bidirectional LSTM (320 units per direction) whose final forward and backward hidden states are concatenated, projected through a 925-unit dense layer, and read out as a single per-window probability.
Two checkpoints are released by the authors: a2z-accessibility (predicts chromatin accessibility from leaf ATAC-seq) and a2z-methylation (predicts lack of CG/CHG/CHH DNA methylation). Both share the same architecture and differ only in the supervision used during training. The canonical MultiMolecule checkpoint is the accessibility model and is registered under regulatory-sequence-prediction; the methylation checkpoint can be converted with the same architecture but belongs to a DNA methylation task rather than the regulatory-sequence task.
Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Conv Layers | Num LSTM Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|---|
| 1 | 1 (bidirectional) | 925 | 1.23 | 14.61 | 7.30 | 600 |
Links¶
- Code: multimolecule.a2zchromatin
- Weights: multimolecule/a2zchromatin
- Data: Leaf ATAC-seq from 12 angiosperm species + unmethylated-region calls from 10 angiosperms
- Paper: Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks
- Developed by: Travis Wrightsman, Alexandre P. Marand, Peter A. Crisp, Nathan M. Springer, Edward S. Buckler
- Model type: 1D CNN + bidirectional LSTM over 600 bp angiosperm DNA for chromatin accessibility / methylation prediction
- Original Repository: twrightsman/a2z-regulatory
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Chromatin State Prediction¶
You can use this model directly to predict the chromatin accessibility (or lack of DNA methylation, for the methylation variant) of a 600 bp angiosperm DNA sequence:
Interface¶
- Input length: fixed 600 bp DNA window
- Alphabet: DNA IUPAC tokens; ambiguous bases use upstream fractional A/C/G/T mixtures, and non-IUPAC tokens map to zero
- Output: single per-window logit (binary chromatin accessibility for
a2z-accessibility, lack of DNA methylation fora2z-methylation)
Training Details¶
a2z-chromatin was trained to predict per-window chromatin state across angiosperms using a single shared cross-species DanQ topology.
Training Data¶
a2z-chromatin was trained on two cross-species data resources:
- Chromatin accessibility: leaf ATAC-seq peaks from 12 angiosperm species, with each 600 bp genomic interval labelled as accessible or inaccessible.
- DNA methylation: unmethylated-region (UMR) calls from 10 angiosperm species, with each 600 bp genomic interval labelled as unmethylated or methylated. Unmethylated regions overlap significantly with accessible chromatin in plants, so the two tasks share the same architecture.
Each training example is a 600 bp one-hot encoded DNA sequence with a single binary label.
Training Procedure¶
Pre-training¶
Each variant was trained to minimize a binary cross-entropy loss between its sigmoid-activated per-window prediction and the observed accessibility / unmethylation label, sweeping cross-species splits to evaluate generalization.
- Optimizer: Adam
- Loss: Binary cross-entropy
- Regularization: Dropout (0.2 after the convolution, 0.5 after the bidirectional LSTM)
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the a2z-chromatin paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.a2zchromatin
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
A2zChromatinConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of an
A2zChromatinModel. It is used to instantiate an a2z-chromatin model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the a2z-chromatin
twrightsman/a2z-regulatory architecture (DanQ topology trained on
angiosperm chromatin data, distributed via Kipoi as a2z-accessibility and a2z-methylation).
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the a2z-chromatin model. Upstream a2z-chromatin consumes four nucleotide channels, but the converted MultiMolecule checkpoint expands the first convolution to the DNA IUPAC tokenizer alphabet so ambiguity tokens reproduce upstream fractional one-hot encodings. Defaults to 16. |
16
|
|
int
|
The fixed length of the input DNA sequence in base pairs. Defaults to 600. |
600
|
|
int
|
Number of filters in the first (and only) 1D convolution. Defaults to 320. |
320
|
|
int
|
Kernel size of the 1D convolution. Defaults to 26. |
26
|
|
float
|
Dropout probability applied after the convolution. Defaults to 0.2. |
0.2
|
|
int
|
Max-pool window size and stride applied after the convolution. Defaults to 13. |
13
|
|
int
|
Hidden dimensionality of each direction of the bidirectional LSTM. Defaults to 320. |
320
|
|
float
|
Dropout probability applied after the bidirectional LSTM. Defaults to 0.5. |
0.5
|
|
int
|
Hidden dimensionality of the fully-connected layer between the LSTM and the prediction head. Defaults to 925. |
925
|
|
str
|
The non-linear activation function (function or string) applied after the convolution. If string, |
'relu'
|
|
int
|
Number of output labels. a2z-chromatin predicts a single binary target (chromatin accessibility for the
|
1
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a binary classification head
( |
None
|
Examples:
Source code in multimolecule/models/a2zchromatin/configuration_a2zchromatin.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
A2zChromatinForSequencePrediction
¶
Bases: A2zChromatinPreTrainedModel
Examples:
Source code in multimolecule/models/a2zchromatin/modeling_a2zchromatin.py
A2zChromatinModel
¶
Bases: A2zChromatinPreTrainedModel
Examples:
Source code in multimolecule/models/a2zchromatin/modeling_a2zchromatin.py
A2zChromatinModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the a2z-chromatin backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Sequence-level representation produced by the DanQ CNN+BLSTM encoder and dense projection. The upstream Keras model returns only this final feature vector rather than per-position hidden states. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Alias of |
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Always |
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Always |
None
|
Source code in multimolecule/models/a2zchromatin/modeling_a2zchromatin.py
A2zChromatinPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.