ChromBPNet¶
Bias-factorized, base-resolution convolutional neural network for predicting chromatin accessibility (ATAC-seq / DNase-seq) from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants by Anusri Pampari et al.
The OFFICIAL repository of ChromBPNet is at kundajelab/chrombpnet.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ChromBPNet did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
ChromBPNet is a convolutional neural network (CNN) trained to predict base-resolution chromatin accessibility (ATAC-seq or DNase-seq) from primary DNA sequence with explicit enzyme-bias correction. It builds on the BPNet architecture and internally composes a bias sub-model with an accessibility sub-model. The composed output is factorized into profile and count branches, and the usable base-resolution prediction is reconstructed by ChromBPNetForProfilePrediction.postprocess. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Input Length | Profile Length | Num Layers | Hidden Size | Bias Hidden Size | Num Parameters (M) |
|---|---|---|---|---|---|
| 2114 | 1000 | 9 + 5 | 512 | 128 | 5.5 |
The accessibility sub-model has 1 stem convolution + 8 dilated residual blocks (512 filters); the bias sub-model has 1 stem convolution + 4 dilated residual blocks (128 filters).
Links¶
- Code: multimolecule.chrombpnet
- Weights: multimolecule/chrombpnet
- Data: RoboATAC ChromBPNet Models
- Paper: ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants
- Developed by: Anusri Pampari, Anna Shcherbina, Evgeny Kvon, Michael Kosicki, Surag Nair, Soumya Kundu, Arwa S. Kathiria, Viviana I. Risca, Kristiina Simola, Melissa J. Funk, Eileen E. M. Furlong, Len A. Pennacchio, William J. Greenleaf, Anshul Kundaje
- Model type: BPNet-style 1D dilated CNN composed with an enzyme-bias model for bias-corrected chromatin-accessibility prediction
- Original Repository: kundajelab/chrombpnet
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Chromatin Accessibility Profile Prediction¶
You can use this model directly to predict base-resolution chromatin accessibility of a DNA sequence:
The recombined track is the usable, bias-corrected base-resolution accessibility prediction.
Interface¶
- Input length: 2114 bp DNA window
- Profile length: 1000 bp
- Output: factorized
(profile_logits, count_logits); recombine the bias-corrected base-resolution track viaChromBPNetForProfilePrediction.postprocess - Composition: profile logits added across bias + accessibility sub-models; counts combined via
logsumexp
Training Details¶
ChromBPNet was trained to predict base-resolution chromatin accessibility profiles from ATAC-seq / DNase-seq with explicit enzyme-bias correction.
Training Data¶
The default ChromBPNet variant is the HEK293T GFP-control model from the RoboATAC ChromBPNet Models release (an automated ATAC-seq dataset from the Kundaje/Greenleaf labs). The accessibility and scaled-bias sub-models are composed internally.
Training Procedure¶
Pre-training¶
The model was trained with a composite loss: a multinomial negative log-likelihood on the per-position profile shape plus a mean-squared-error regression on the log total counts.
- Optimizer: Adam
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ChromBPNet paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.chrombpnet
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
ChromBPNetConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
ChromBPNetModel. It is used to instantiate a ChromBPNet model according to
the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the ChromBPNet
HEK293T-GFP architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
ChromBPNet predicts base-resolution chromatin accessibility (ATAC-seq / DNase-seq) with explicit enzyme-bias correction. It internally composes two BPNet-style dilated-convolution sub-models:
- a bias sub-model that captures the Tn5/DNase enzyme cleavage bias on chromatin background;
- an accessibility sub-model that learns the bias-corrected accessibility signal.
The final prediction is a single base-resolution task whose output is factorized into two terminal branches that share their respective backbones:
- a profile branch producing per-position multinomial logits of shape
(batch_size, profile_length, num_tasks * num_strands); - a count branch producing a scalar per task and strand of shape
(batch_size, num_tasks * num_strands).
The bias and accessibility sub-models are composed internally: their profile logits are added, and their count
logits are combined in log/exp space (logsumexp). They are not a user-facing split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the ChromBPNet model. Defines the number of one-hot input channels derived from
|
5
|
|
int
|
The canonical input DNA sequence length in base pairs. Defaults to 2114. |
2114
|
|
int
|
The centered output profile length in base pairs. Defaults to 1000. |
1000
|
|
int
|
Number of channels in the convolutional backbone of the accessibility sub-model. |
512
|
|
int
|
Number of channels in the convolutional backbone of the bias sub-model. |
128
|
|
int
|
Kernel size of the first (motif) convolution. |
21
|
|
int
|
Number of dilated residual convolution blocks following the stem in the accessibility sub-model. |
8
|
|
int
|
Number of dilated residual convolution blocks following the stem in the bias sub-model. |
4
|
|
int
|
Kernel size of each dilated residual convolution. |
3
|
|
int
|
Kernel size of the wide convolution in the profile branch. |
75
|
|
int
|
Number of prediction tasks. |
1
|
|
int
|
Number of strands predicted per task. ChromBPNet ATAC/DNase predicts a single (unstranded) track. |
1
|
|
str
|
The non-linear activation function (function or string) in the backbones. |
'relu'
|
|
float
|
The weight applied to the count regression loss when combining it with the profile multinomial loss. |
1.0
|
|
HeadConfig | None
|
The configuration of the generic token prediction head. If not provided, it defaults to regression. |
None
|
|
bool
|
Whether to output the backbone hidden states. |
False
|
Examples:
Source code in multimolecule/models/chrombpnet/configuration_chrombpnet.py
| Python | |
|---|---|
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | |
ChromBPNetForProfilePrediction
¶
Bases: ChromBPNetPreTrainedModel
ChromBPNet with the factorized profile/count head for base-resolution chromatin-accessibility prediction.
This is a token/positional-prediction model: it is registered with the token AutoModel family and predicts a per-position value for every input nucleotide. The single base-resolution task is factorized into two terminal branches:
profile_logits: per-position multinomial logits of shape(batch_size, profile_length, num_labels);count_logits: a scalar per task and strand of shape(batch_size, num_labels),
where num_labels = num_tasks * num_strands. Use [postprocess][multimolecule.models.
ChromBPNetForProfilePrediction.postprocess] to recombine them into the usable base-resolution track.
The enzyme-bias correction (the internal bias + accessibility composition) is performed inside
ChromBPNetModel; the factorized head here mirrors BPNet and operates on
the already bias-corrected, composed profile and count logits.
Examples:
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
| Python | |
|---|---|
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
postprocess
¶
postprocess(
outputs: ChromBPNetProfilePredictorOutput | ModelOutput,
) -> Tensor
Recombine the factorized profile and count branches into the usable base-resolution track.
ChromBPNet does not predict the accessibility track directly; the profile branch predicts the shape (a
per-position multinomial distribution) and the count branch predicts the total magnitude (in log space).
The usable prediction recombines them as softmax(profile_logits, positions) * expm1(count_logits).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ChromBPNetProfilePredictorOutput | ModelOutput
|
The output of
|
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The predicted base-resolution track of shape |
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
ChromBPNetForTokenPrediction
¶
Bases: ChromBPNetPreTrainedModel
ChromBPNet accessibility backbone with a randomly initialized generic token-prediction head.
This class attaches the shared MultiMolecule token head to the accessibility sub-model representation and returns a
standard single-logits output for downstream fine-tuning. The published ChromBPNet profile/count task remains
exposed through ChromBPNetForProfilePrediction.
Examples:
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
ChromBPNetHeadOutput
dataclass
¶
Bases: ModelOutput
Output of the factorized ChromBPNet profile/count head.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, profile_length, num_labels)`
|
Per-position multinomial logits. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
Per task/strand log-count scalars. |
None
|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL + weighted count-MSE loss. |
None
|
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
ChromBPNetModel
¶
Bases: ChromBPNetPreTrainedModel
The bare ChromBPNet model: an enzyme-bias sub-model composed with a bias-corrected accessibility sub-model.
ChromBPNet predicts base-resolution chromatin accessibility (ATAC-seq / DNase-seq) with explicit enzyme-bias correction. It internally owns two BPNet-style dilated-convolution sub-models and composes them so the model exposes a single clean factorized profile/count output:
- the bias sub-model captures the Tn5/DNase enzyme cleavage bias on chromatin background;
- the accessibility sub-model learns the bias-corrected accessibility signal.
The two sub-models are composed internally: their per-position profile logits are added, and their count logits are
combined in log/exp space via logsumexp. The sub-model split is an implementation detail, not a user-facing API.
Examples:
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
| Python | |
|---|---|
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
ChromBPNetModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the ChromBPNet backbone.
The ChromBPNet backbone performs the bias + accessibility composition and exposes both the accessibility branch representation for generic fine-tuning and the composed factorized profile / count logits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
Accessibility branch backbone representation. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, profile_length, num_labels)`
|
Composed (bias-corrected) per-position multinomial logits. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
Composed per task/strand log-count scalars. |
None
|
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
ChromBPNetPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/chrombpnet/modeling_chrombpnet.py
ChromBPNetProfilePredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of
ChromBPNetForProfilePrediction.
The standard single-logits predictor dataclasses cannot express ChromBPNet’s factorized output, so this
model-local dataclass exposes the two terminal branches separately. Use
[postprocess][multimolecule.models.ChromBPNetForProfilePrediction.postprocess] to recombine them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL (profile) + weighted count-MSE (count) loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, profile_length, num_labels)`
|
Per-position multinomial logits, where |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
Per task/strand log-count scalars. |
None
|