ProCapNet¶
ProCapNet¶
Base-resolution convolutional neural network for predicting PRO-cap transcription-initiation signal from DNA sequence.
Disclaimer¶
This is an UNOFFICIAL implementation of Dissecting the cis-regulatory syntax of transcription initiation with deep learning by Kelly Cochran et al.
The OFFICIAL repository of ProCapNet is at kundajelab/ProCapNet.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ProCapNet did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
ProCapNet is a convolutional neural network (CNN) trained to predict base-resolution PRO-cap transcription-initiation signal from primary DNA sequence. Its architecture is largely adapted from Jacob Schreiber’s bpnet-lite and shares BPNet’s dilated-convolution backbone and profile/count factorization. The output is two-stranded (plus / minus strand), mappability-aware, and reconstructed by ProCapNetForProfilePrediction.postprocess. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Input Length | Profile Length | Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|---|---|
| 2114 | 1000 | 9 | 512 | 6.43 | 27.17 | 13.58 |
FLOPs and MACs are measured on the canonical 2114 bp ProCapNet input window.
Links¶
- Code: multimolecule.procapnet
- Weights: multimolecule/procapnet
- Data: K562 PRO-cap (ENCODE ENCSR261KBX)
- Paper: Dissecting the cis-regulatory syntax of transcription initiation with deep learning
- Developed by: Kelly Cochran, Melody Yin, Anika Mantripragada, Jacob Schreiber, Georgi K. Marinov, Sagar R. Shah, Haiyuan Yu, John T. Lis, Anshul Kundaje
- Model type: BPNet-derived 1D dilated CNN with two-stranded factorized profile-and-count heads for PRO-cap transcription-initiation prediction
- Original Repository: kundajelab/ProCapNet
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Transcription-Initiation Profile Prediction¶
You can use this model directly to predict PRO-cap transcription-initiation profiles of a DNA sequence:
The recombined track is the usable base-resolution prediction. The last dimension stacks the num_strands (plus, minus) PRO-cap signal predictions.
Interface¶
- Input length: 2114 bp DNA window
- Profile length: 1000 bp, two-stranded (plus / minus)
- Output: factorized
(profile_logits, count_logits); recombine the base-resolution PRO-cap track viaProCapNetForProfilePrediction.postprocess
Training Details¶
ProCapNet was trained to predict the base-resolution, two-stranded PRO-cap transcription-initiation signal in human cell lines. The default model is the K562 model.
Training Data¶
The published ProCapNet models were trained on PRO-cap signal using ~2 kb genomic windows. The default K562 model was trained on K562 PRO-cap experiment ENCSR261KBX. Training and test regions, observed signal tracks, and contribution scores are distributed through the same ENCODE release.
Training Procedure¶
Pre-training¶
The model was trained with a composite loss: a (strand-merged) multinomial negative log-likelihood on the per-position, two-stranded profile shape plus a mean-squared-error regression on log(count + 1) total counts.
- Optimizer: Adam
- Training is mappability-aware
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ProCapNet paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.procapnet
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
ProCapNetConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
ProCapNetModel. It is used to instantiate a ProCapNet model according to
the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will
yield a similar configuration to that of the published ProCapNet
K562 PRO-cap architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
ProCapNet predicts the base-resolution PRO-cap transcription-initiation signal whose output is factorized into two terminal branches that share the dilated-convolution backbone:
- a profile branch producing per-position, two-stranded multinomial logits of shape
(batch_size, profile_length, num_strands); - a count branch producing a single strand-merged log-count scalar of shape
(batch_size, 1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the ProCapNet model. Defines the number of one-hot input channels derived from
|
5
|
|
int
|
The canonical input DNA sequence length in base pairs. Defaults to 2114. |
2114
|
|
int
|
The centered output profile length in base pairs. Defaults to 1000. |
1000
|
|
int
|
Number of channels in the convolutional backbone. |
512
|
|
int
|
Kernel size of the first (motif) convolution. |
21
|
|
int
|
Number of dilated residual convolution blocks following the stem. |
8
|
|
int
|
Kernel size of each dilated residual convolution. |
3
|
|
int
|
Kernel size of the profile-branch convolution. |
75
|
|
int
|
Number of strands predicted per position (plus / minus). ProCapNet is a two-stranded model. |
2
|
|
str
|
The non-linear activation function (function or string) in the backbone. |
'relu'
|
|
float
|
The weight applied to the count regression loss when combining it with the profile multinomial loss. |
1.0
|
|
HeadConfig | None
|
The configuration of the generic token prediction head. If not provided, it defaults to regression. |
None
|
|
bool
|
Whether to output the backbone hidden states. |
False
|
Examples:
Source code in multimolecule/models/procapnet/configuration_procapnet.py
| Python | |
|---|---|
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
ProCapNetForProfilePrediction
¶
Bases: ProCapNetPreTrainedModel
ProCapNet with the factorized profile/count head for base-resolution PRO-cap signal prediction.
This is a token/positional-prediction model: it is registered with the token AutoModel family and predicts a per-position value for every input nucleotide. The single base-resolution PRO-cap transcription-initiation task is factorized into two terminal branches sharing the backbone:
profile_logits: per-position, two-stranded multinomial logits of shape(batch_size, profile_length, num_strands);count_logits: a single strand-merged log-count scalar of shape(batch_size, 1).
Unlike single-stranded BPNet, the ProCapNet profile is a joint multinomial over both strands and all
positions (the plus / minus strand share one count), so [postprocess][multimolecule.models.
ProCapNetForProfilePrediction.postprocess] normalizes the profile over the strand-and-position axes jointly.
Examples:
Source code in multimolecule/models/procapnet/modeling_procapnet.py
| Python | |
|---|---|
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
postprocess
¶
postprocess(
outputs: ProCapNetProfilePredictorOutput | ModelOutput,
) -> Tensor
Recombine the factorized profile and count branches into the usable base-resolution track.
ProCapNet does not predict the signal track directly; the profile branch predicts the shape and the count
branch predicts the total magnitude (in log space). Because ProCapNet is two-stranded with a single
strand-merged count, the profile is a joint multinomial over both strands and all positions. The usable
prediction recombines them as softmax(profile_logits, strands & positions) * exp(count_logits).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ProCapNetProfilePredictorOutput | ModelOutput
|
The output of
|
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The predicted base-resolution track of shape |
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetForTokenPrediction
¶
Bases: ProCapNetPreTrainedModel
ProCapNet backbone with a randomly initialized generic token-prediction head.
This class is intended for downstream fine-tuning from the ProCapNet backbone. It returns the standard
[TokenPredictorOutput][multimolecule.models.TokenPredictorOutput] with a single logits field, unlike
ProCapNetForProfilePrediction, which exposes the
published factorized profile_logits / count_logits task head.
Examples:
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetHeadOutput
dataclass
¶
Bases: ModelOutput
Output of the factorized ProCapNet profile/count head.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, profile_length, num_strands)`
|
Per-position, two-stranded multinomial logits. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, 1)`
|
Strand-merged log-count scalar. |
None
|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL + weighted count-MSE loss. |
None
|
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetModel
¶
Bases: ProCapNetPreTrainedModel
The bare ProCapNet dilated-convolution backbone producing per-position features.
Examples:
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the ProCapNet backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`
|
Per-position backbone features. |
None
|
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code in multimolecule/models/procapnet/modeling_procapnet.py
ProCapNetProfilePredictorOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of
ProCapNetForProfilePrediction.
The standard single-logits predictor dataclasses cannot express ProCapNet’s factorized output, so this
model-local dataclass exposes the two terminal branches separately. Use
[postprocess][multimolecule.models.ProCapNetForProfilePrediction.postprocess] to recombine them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided
|
Composite multinomial-NLL (profile) + weighted count-MSE (count) loss. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, profile_length, num_strands)`
|
Per-position, two-stranded multinomial logits. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, 1)`
|
Strand-merged log-count scalar. |
None
|