Basenji¶
Deep convolutional neural network for predicting genomic coverage tracks across chromosomes.
Disclaimer¶
This is an UNOFFICIAL implementation of Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks by David R. Kelley, Yakir A. Reshef et al.
The OFFICIAL repository of Basenji is at calico/basenji.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Basenji did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Basenji is a deep convolutional neural network trained to predict genomic regulatory activity from long DNA sequences. It consumes a long DNA window (~131 kb), passes it through a convolution + pooling stem that downsamples the sequence, and then through a tower of dilated residual convolutional blocks that expand the receptive field. A pointwise output head predicts a vector of genomic coverage tracks for each output bin. Because the stem downsamples the input, the prediction is binned: the output has shape (batch_size, num_bins, num_tracks) where each bin summarizes 128 bp of sequence and num_tracks is the number of genomic coverage experiments.
Model Specification¶
| Input Length | Bin Size | Output Bins | Hidden Size | Dilated Blocks | Num Labels |
|---|---|---|---|---|---|
| 131,072 | 128 | 896 | 768 | 11 | 5,313 |
Links¶
- Code: multimolecule.basenji
- Weights: multimolecule/basenji
- Data: ENCODE, FANTOM5, GTEx, and related genomic coverage tracks aligned to human and mouse genomes
- Paper: Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks
- Developed by: David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, Jasper Snoek
- Model type: 1D dilated residual CNN with pre-activation blocks for binned multi-track genomic coverage prediction
- Original Repository: calico/basenji
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Genomic Coverage Prediction¶
You can use this model to predict binned genomic coverage tracks from a DNA sequence:
The binned positional axis is treated as the “token” axis: each output position corresponds to one genomic bin rather than a single nucleotide.
Interface¶
- Input length: fixed 131,072 bp DNA window
- Output binning: 128 bp per output bin; 896 output bins per window (after
Cropping1D(64)on each side) - Output:
(batch_size, num_bins, num_tracks);num_tracksdefaults to 5,313 human coverage experiments
Training Details¶
Basenji was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from the human and mouse reference genomes.
Training Data¶
The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each window the per-128-bp coverage of every experiment served as the regression target.
Training Procedure¶
Pre-training¶
The model was trained to minimize a Poisson regression loss between predicted and observed coverage.
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Basenji paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.basenji
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
BasenjiBlockConfig
¶
Bases: FlatDict
Configuration for the dilated residual tower of the Basenji2 trunk.
Basenji2 stacks num_blocks dilated residual units. Each unit runs on a hidden_size-channel
residual stream and internally bottlenecks to bottleneck_size channels for the dilated
convolution before projecting back. The dilation factor starts at dilation and is multiplied
by dilation_rate after every block (rounded to the nearest integer when round_dilation is
set), which is how Basenji2 reaches the receptive field needed for ~131 kb input windows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Number of dilated residual blocks in the tower. |
required | |
|
Kernel size of the dilated (bottleneck) convolution. |
required | |
|
Channel count of the dilated convolution bottleneck. |
required | |
|
Dilation factor of the first block. |
required | |
|
Multiplicative factor applied to the dilation after each block. |
required | |
|
Whether to round the running dilation to the nearest integer after each multiply
(upstream Basenji2 uses |
required | |
|
Dropout probability applied to the projected (return) convolution of every block. |
required |
Source code in multimolecule/models/basenji/configuration_basenji.py
BasenjiConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
BasenjiModel. It is used to instantiate a Basenji model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
configuration that faithfully reproduces the upstream Basenji2 human graph
(calico/basenji, manuscripts/cross2020/params_human.json).
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Basenji2 predicts genomic coverage tracks at a binned resolution. A long DNA window of
sequence_length base pairs is downsampled by the convolution + pooling stem and tower, then
cropped by crop_bins bins on each side, so the output has shape
(batch_size, num_bins, num_labels) where num_labels is the number of coverage tracks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Basenji model. Defines the number of input feature channels
derived from the MultiMolecule DNA token order.
Defaults to 5 ( |
5
|
|
int
|
The length, in base pairs, of the input DNA window. Defaults to 131072 (~131 kb). |
131072
|
|
int
|
Number of channels produced by the first (stem) convolution. |
288
|
|
int
|
Kernel size of the first (stem) convolution. |
15
|
|
int
|
Pooling size applied after every convolution block in the stem and tower. |
2
|
|
list[int] | None
|
Explicit per-stage output channel schedule of the reducing convolution tower. Basenji2
grows the width as |
None
|
|
int
|
Kernel size used by every convolution in the reducing tower. |
5
|
|
BasenjiBlockConfig | None
|
Configuration of the dilated residual tower. A single [ |
None
|
|
int
|
Number of bins trimmed from each side of the binned axis after the dilated tower
(upstream |
64
|
|
int
|
Channel count of the final pointwise convolution block feeding the track head. |
1536
|
|
str
|
The non-linear activation used throughout the network. Basenji2 uses the
tanh-approximation GELU ( |
'gelu_new'
|
|
str
|
The activation applied to the final track projection. Basenji2 uses |
'softplus'
|
|
float
|
Dropout probability of the final pointwise convolution block. |
0.05
|
|
float
|
The epsilon used by the batch normalization layers. |
0.001
|
|
float
|
The momentum used by the batch normalization layers (PyTorch convention; upstream Keras momentum 0.9 corresponds to PyTorch momentum 0.1). |
0.1
|
|
int
|
Number of genomic coverage tracks predicted per bin. Defaults to 5313 (the human track set released with Basenji2). |
5313
|
|
HeadConfig | None
|
The configuration of the binned track prediction head. Defaults to a regression head
( |
None
|
|
bool
|
Whether to output the context vectors for each tower block. |
False
|
Examples:
Source code in multimolecule/models/basenji/configuration_basenji.py
| Python | |
|---|---|
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
BasenjiForTokenPrediction
¶
Bases: BasenjiPreTrainedModel
Basenji2 with a pointwise regression head over genomic coverage tracks.
The binned positional axis is treated as the “token” axis: logits have shape
(batch_size, num_bins, num_labels) where num_labels is the number of coverage tracks.
Examples:
Source code in multimolecule/models/basenji/modeling_basenji.py
| Python | |
|---|---|
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
BasenjiModel
¶
Bases: BasenjiPreTrainedModel
The bare Basenji2 backbone. Consumes a long DNA window and returns binned hidden states.
The architecture faithfully reproduces the upstream Basenji2 trunk: a pre-activation
convolution stem (GELU -> Conv -> BatchNorm -> MaxPool), a width-growing reducing tower, a
dilated residual tower on a wide stream with a narrow bottleneck, a Cropping1D, and a final
pointwise convolution block. The positional axis of the output is binned: a window of
config.sequence_length base pairs is downsampled by the stem/tower and cropped, so
last_hidden_state has shape (batch_size, num_bins, head_hidden_size).
Examples:
Source code in multimolecule/models/basenji/modeling_basenji.py
BasenjiPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.