BPNet¶
BPNet¶
Base-resolution convolutional neural network for predicting transcription-factor binding profiles from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of Base-resolution models of transcription-factor binding reveal soft motif syntax by Žiga Avsec, Melanie Weilert et al.
The OFFICIAL repository of BPNet is at kundajelab/bpnet.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing BPNet did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
BPNet is a convolutional neural network (CNN) trained to predict base-resolution transcription-factor binding signal (ChIP-nexus) from primary DNA sequence. It uses a convolutional motif stem followed by a stack of dilated residual convolutions that aggregate ~1 kb of genomic context. The output is factorized into profile and count branches, and the usable base-resolution prediction is reconstructed by BPNetForProfilePrediction.postprocess. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|
| 10 | 64 | 0.13 | 0.24 | 0.12 |
Links¶
- Code: multimolecule.bpnet
- Weights: multimolecule/bpnet
- Data: BPNet manuscript data
- Paper: Base-resolution models of transcription-factor binding reveal soft motif syntax
- Developed by: Žiga Avsec, Melanie Weilert, Avanti Shrikumar, Sabrina Krueger, Amr Alexandari, Khyati Dalal, Robin Fropf, Charles McAnany, Julien Gagneur, Anshul Kundaje, Julia Zeitlinger
- Model type: 1D dilated CNN with factorized profile-and-count heads for base-resolution transcription-factor binding prediction
- Original Repository: kundajelab/bpnet
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Transcription-Factor Binding Profile Prediction¶
You can use this model directly to predict transcription-factor binding profiles of a DNA sequence:
The recombined track is the usable base-resolution prediction. The last dimension stacks num_tasks (Oct4, Sox2, Nanog, Klf4) by num_strands (forward, reverse).
Interface¶
- Input length: 1000 bp DNA window
- Output: factorized
(profile_logits, count_logits); recombine the usable base-resolution track viaBPNetForProfilePrediction.postprocess - Output shape:
(batch_size, profile_length, num_tasks × num_strands); default Oct4 / Sox2 / Nanog / Klf4 × forward / reverse = 8 channels
Training Details¶
BPNet was trained to predict the base-resolution ChIP-nexus binding profiles of the pluripotency transcription factors Oct4, Sox2, Nanog and Klf4 in mouse embryonic stem cells.
Training Data¶
The published BPNet-OSKN model was trained on ChIP-nexus profiles for Oct4, Sox2, Nanog and Klf4, using 1 kb genomic windows centered on detected binding peaks. The training regions and trained Keras checkpoint are distributed as part of the BPNet manuscript data.
Training Procedure¶
Pre-training¶
The model was trained with a composite loss: a multinomial negative log-likelihood on the per-position profile shape plus a mean-squared-error regression on the log total counts.
- Optimizer: Adam
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the BPNet paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.bpnet
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
BPNetConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
BPNetModel. It is used to instantiate a BPNet model according to the specified
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
configuration to that of the BPNet BPNet-OSKN architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
BPNet predicts a single base-resolution signal task whose output is factorized into two terminal branches that share the dilated-convolution backbone:
- a profile branch producing per-position multinomial logits of shape
(batch_size, sequence_length, num_tasks * num_strands); - a count branch producing a scalar per task and strand of shape
(batch_size, num_tasks * num_strands).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the BPNet model. Defines the number of one-hot input channels derived from |
5
|
|
int
|
Number of channels in the convolutional backbone. |
64
|
|
int
|
Kernel size of the first (motif) convolution. |
25
|
|
int
|
Number of dilated residual convolution blocks following the stem. |
9
|
|
int
|
Kernel size of each dilated residual convolution. |
3
|
|
int
|
Kernel size of the transposed convolution in the profile branch. |
25
|
|
int
|
Number of prediction tasks (e.g. transcription factors). |
4
|
|
int
|
Number of strands predicted per task. |
2
|
|
str
|
The non-linear activation function (function or string) in the backbone. |
'relu'
|
|
float
|
The weight applied to the count regression loss when combining it with the profile multinomial loss. |
1.0
|
|
HeadConfig | None
|
The configuration of the generic token prediction head. If not provided, it defaults to regression. |
None
|
|
bool
|
Whether to output the backbone hidden states. |
False
|
Examples:
Source code in multimolecule/models/bpnet/configuration_bpnet.py
| Python | |
|---|---|
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
BPNetForProfilePrediction
¶
Bases: BPNetPreTrainedModel
BPNet with the factorized profile/count head for base-resolution signal prediction.
This is a token/positional-prediction model: it is registered with the token AutoModel family and predicts a per-position value for every input nucleotide. The single base-resolution task is factorized into two terminal branches sharing the backbone:
profile_logits: per-position multinomial logits of shape(batch_size, sequence_length, num_labels);count_logits: a scalar per task and strand of shape(batch_size, num_labels),
where num_labels = num_tasks * num_strands. Use [postprocess][multimolecule.models.BPNetForProfilePrediction.
postprocess] to recombine them into the usable base-resolution track.
Examples:
Source code in multimolecule/models/bpnet/modeling_bpnet.py
| Python | |
|---|---|
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
postprocess
¶
postprocess(
outputs: BPNetProfilePredictorOutput | ModelOutput,
) -> Tensor
Recombine the factorized profile and count branches into the usable base-resolution track.
BPNet does not predict the signal track directly; the profile branch predicts the shape (a per-position
multinomial distribution) and the count branch predicts the total magnitude (in log space). The usable
prediction recombines them as softmax(profile_logits, positions) * exp(count_logits).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
BPNetProfilePredictorOutput | ModelOutput
|
The output of |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The predicted base-resolution track of shape |
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetForTokenPrediction
¶
Bases: BPNetPreTrainedModel
BPNet backbone with a randomly initialized generic token-prediction head.
This class is intended for downstream fine-tuning from the BPNet backbone. It returns the standard
[TokenPredictorOutput][multimolecule.models.TokenPredictorOutput] with a single logits field, unlike
BPNetForProfilePrediction, which exposes the published
factorized profile_logits / count_logits task head.
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetHeadOutput
dataclass
¶
Bases: ModelOutput
Output of the factorized BPNet profile/count head.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`
|
Per-position multinomial logits. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
Per task/strand log-count scalars. |
None
|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL + weighted count-MSE loss. |
None
|
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetModel
¶
Bases: BPNetPreTrainedModel
The bare BPNet dilated-convolution backbone producing per-position features.
Examples:
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the BPNet backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
Per-position backbone features. |
None
|
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/bpnet/modeling_bpnet.py
BPNetProfilePredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of BPNetForProfilePrediction.
The standard single-logits predictor dataclasses cannot express BPNet’s factorized output, so this model-local
dataclass exposes the two terminal branches separately. Use
[postprocess][multimolecule.models.BPNetForProfilePrediction.postprocess] to recombine them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL (profile) + weighted count-MSE (count) loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`
|
Per-position multinomial logits, where |
None
|
|
`torch.FloatTensor` of shape `(batch_size, num_labels)`
|
Per task/strand log-count scalars. |
None
|