DeepSTARR¶
Convolutional neural network for predicting enhancer activity directly from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers by Bernardo P. de Almeida, Franziska Reiter, et al.
The OFFICIAL repository of DeepSTARR is at bernardo-de-almeida/DeepSTARR.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing DeepSTARR did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
DeepSTARR is a convolutional neural network (CNN) trained to quantitatively predict enhancer activity from 249 bp DNA sequences. The model was trained on genome-wide STARR-seq data from Drosophila melanogaster S2 cells and predicts two regression outputs: developmental and housekeeping enhancer activity. The architecture consists of four convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool) followed by two fully-connected layers. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
|---|---|---|---|---|---|---|
| 4 | 2 | 256 | 0.62 | 21.03 | 10.26 | 249 |
Links¶
- Code: multimolecule.deepstarr
- Weights: multimolecule/deepstarr
- Data: Drosophila S2 UMI-STARR-seq enhancer-activity data
- Paper: DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers
- Developed by: Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark
- Model type: Four-block 1D CNN over 249 bp DNA for developmental and housekeeping enhancer-activity regression
- Original Repository: bernardo-de-almeida/DeepSTARR
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Enhancer Activity Prediction¶
You can use this model directly to predict the developmental and housekeeping enhancer activity of a 249 bp DNA sequence:
Interface¶
- Input length: fixed 249 bp DNA window
- Output: 2 regression outputs (developmental and housekeeping enhancer activity, log2 enrichment over input)
Training Details¶
DeepSTARR was trained to predict quantitative enhancer activity from DNA sequence.
Training Data¶
DeepSTARR was trained on genome-wide UMI-STARR-seq data from Drosophila melanogaster S2 cells, measuring enhancer activity under two transcriptional programs: a developmental program (driven by a developmental core promoter) and a housekeeping program (driven by a housekeeping core promoter).
Each training example is a 249 bp genomic sequence with two continuous activity values (developmental and housekeeping, log2 enrichment over input). Chromosomes were split into training, validation, and test sets to avoid sequence leakage.
Training Procedure¶
Pre-training¶
The model was trained to minimize a mean-squared-error loss between predicted and measured enhancer activities.
- Optimizer: Adam
- Learning rate: 2e-3
- Loss: Mean Squared Error
- Early stopping on validation loss
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the DeepSTARR paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.deepstarr
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
DeepStarrConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
DeepStarrModel. It is used to instantiate a DeepSTARR model according to
the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the DeepSTARR
bernardo-de-almeida/DeepSTARR architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the DeepSTARR model. Defines the number of feature channels in the one-hot encoded input fed to the first convolution. Defaults to 5. |
5
|
|
int
|
The fixed length (in base pairs) of the input DNA sequence. Defaults to 249. |
249
|
|
int
|
Number of convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool). |
4
|
|
list[int] | None
|
Number of output channels for each convolutional block. |
None
|
|
list[int] | None
|
Convolution kernel size for each convolutional block. |
None
|
|
int
|
Max pooling window applied after every convolutional block. |
2
|
|
int
|
Number of fully-connected layers between the convolutional stack and the prediction head. |
2
|
|
list[int] | None
|
Hidden size for each fully-connected layer. |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder. If string, |
'relu'
|
|
float
|
The dropout probability for the fully-connected layers. |
0.4
|
|
float
|
The epsilon used by the batch normalization layers. |
0.001
|
|
float
|
The momentum used by the batch normalization layers. |
0.1
|
|
int
|
Number of regression outputs. DeepSTARR predicts developmental and housekeeping enhancer activity. |
2
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/deepstarr/configuration_deepstarr.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
DeepStarrForSequencePrediction
¶
Bases: DeepStarrPreTrainedModel
Examples:
Source code in multimolecule/models/deepstarr/modeling_deepstarr.py
DeepStarrModel
¶
Bases: DeepStarrPreTrainedModel
Examples:
Source code in multimolecule/models/deepstarr/modeling_deepstarr.py
DeepStarrModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of DeepSTARR model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, flattened_conv_features)`
|
Flattened feature map produced by the convolutional encoder. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Sequence-level representation produced by the fully-connected pooler. |
None
|
|
`tuple(torch.FloatTensor)`, *optional*
|
Always |
None
|
Source code in multimolecule/models/deepstarr/modeling_deepstarr.py
DeepStarrPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.