Basset¶
Deep convolutional neural network for predicting chromatin accessibility (DNase I hypersensitivity) from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks by David R. Kelley et al.
The OFFICIAL repository of Basset is at davek44/Basset.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Basset did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Basset is a convolutional neural network (CNN) trained to predict the chromatin accessibility (DNase I hypersensitivity) of a DNA sequence across 164 cell types. The model consumes a fixed-length 600 bp one-hot encoded DNA sequence and applies three convolutional blocks (convolution, batch normalization, ReLU, and max pooling) followed by two fully-connected blocks before a multi-label binary classification head. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|---|
| 3 | 2 | 1000 | 4.14 | 0.30 | 0.15 | 600 |
Links¶
- Code: multimolecule.basset
- Weights: multimolecule/basset
- Data: ENCODE and Roadmap Epigenomics DNase-seq accessibility compendium across 164 cell types
- Paper: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks
- Developed by: David R. Kelley, Jasper Snoek, John L. Rinn
- Model type: Three-layer 1D CNN over 600 bp DNA for multi-task chromatin-accessibility prediction
- Original Repository: davek44/Basset
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Chromatin Accessibility Prediction¶
You can use this model directly to predict the DNase I hypersensitivity of a DNA sequence:
Interface¶
- Input length: fixed 600 bp DNA window
- Output: 164 per-cell-type accessibility logits (multi-label binary)
Training Details¶
Basset was trained to predict the chromatin accessibility of DNA sequences across a panel of cell types.
Training Data¶
Basset was trained on DNase I hypersensitivity peaks from ENCODE and the Roadmap Epigenomics project, covering 164 cell types. Each 600 bp genomic interval is labeled with a binary vector indicating which of the 164 cell types show an accessibility peak overlapping that interval.
Training Procedure¶
Pre-training¶
The model was trained to minimize a multi-label binary cross-entropy loss, comparing its predicted per-cell-type accessibility probabilities against the observed DNase I hypersensitivity labels.
- Optimizer: RMSprop
- Loss: Multi-label binary cross-entropy
- Regularization: Batch normalization and dropout
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Basset paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.basset
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
BassetConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
BassetModel. It is used to instantiate a Basset model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the Basset davek44/Basset architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Basset model. Basset consumes a one-hot encoding of the four DNA nucleotides, so this also defines the number of input channels of the first convolution. Defaults to 4. |
4
|
|
int
|
The fixed length of the input DNA sequence in base pairs. Defaults to 600. |
600
|
|
int
|
Number of convolutional layers in the encoder. |
3
|
|
list[int] | None
|
Number of filters for each convolutional layer. |
None
|
|
list[int] | None
|
Kernel size for each convolutional layer. |
None
|
|
list[int] | None
|
Max-pool size applied after each convolutional layer. |
None
|
|
list[int] | None
|
Hidden dimensionality of each fully-connected layer. |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder. If string, |
'relu'
|
|
float
|
The dropout probability for the fully-connected layers. |
0.3
|
|
float
|
The epsilon used by the batch normalization layers. |
1e-05
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of output labels. Basset predicts DNase I hypersensitivity across 164 cell types. Defaults to 164. |
164
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a multi-label binary classification head
( |
None
|
Examples:
Source code in multimolecule/models/basset/configuration_basset.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
BassetForSequencePrediction
¶
Bases: BassetPreTrainedModel
Examples:
Source code in multimolecule/models/basset/modeling_basset.py
BassetModel
¶
Bases: BassetPreTrainedModel
Examples:
Source code in multimolecule/models/basset/modeling_basset.py
BassetPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.