Xpresso¶
Disclaimer¶
This is an UNOFFICIAL implementation of Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks by Vikram Agarwal et al.
The OFFICIAL repository of Xpresso is at vagarwal87/Xpresso.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Xpresso did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
Xpresso is a deep convolutional neural network (CNN) that predicts steady-state mRNA expression level directly from genomic sequence. It consumes a promoter window of roughly 10.5 kb centered on the transcription start site (TSS), processes it through a stack of 1D convolution + max-pooling blocks, flattens the result, concatenates a small set of auxiliary numeric mRNA half-life features, and passes the combined representation through fully-connected layers to predict a single scalar expression value. Please refer to the Training Details section for more information on the training process.
Model Specification¶
| Input Length | Conv Blocks | Hidden Size | Auxiliary Features | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|---|---|
| 10,500 | 2 | 2 | 6 | 0.11 | 0.11 | 0.05 | 10,500 |
Links¶
- Code: multimolecule.xpresso
- Weights: multimolecule/xpresso
- Data: Roadmap Epigenomics gene-expression data with promoter sequence and mRNA half-life features
- Paper: Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks
- Developed by: Vikram Agarwal, Jay Shendure
- Model type: 1D CNN over promoter DNA combined with auxiliary mRNA half-life features for mRNA-abundance regression
- Original Repository: vagarwal87/Xpresso
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
mRNA Expression Prediction¶
You can use this model directly to predict the mRNA expression of a promoter sequence together with its auxiliary mRNA half-life features:
The auxiliary half-life features are passed through the features argument as a float tensor of shape (batch_size, num_features). Models configured with a non-zero num_features require this tensor; models configured with num_features=0 do not accept it.
Interface¶
- Input length: fixed 10,500 bp promoter window centered on the TSS
- Padding: shorter inputs right-padded; longer inputs center-cropped to
input_length - Auxiliary inputs:
featurestensor of shape(batch_size, num_features)required whennum_features > 0; not accepted whennum_features = 0 - Output: scalar mRNA expression
Training Details¶
Xpresso was trained to predict steady-state mRNA expression levels (median across tissues/cell lines) from genomic promoter sequence.
Training Data¶
Xpresso was trained on human and mouse genes, using promoter sequences (~10.5 kb windows centered on the TSS) together with mRNA half-life features derived from gene-body and UTR properties. Expression targets are log-transformed median mRNA levels across tissues.
The default Xpresso model is the published humanMedian model. Other published variants (K562, GM12878, mESC, mouseMedian) share the same architecture but are not exposed as separate default model variants.
Training Procedure¶
Pre-training¶
The model was trained to minimize a mean-squared-error loss between predicted and observed log mRNA expression values.
- Optimizer: Adam
- Loss: Mean squared error
Citation¶
| BibTeX | |
|---|---|
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Xpresso paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
multimolecule.models.xpresso
¶
DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
|
bool
|
Whether to replace U with T. |
True
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
XpressoConfig
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a
XpressoModel. It is used to instantiate a Xpresso model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
similar configuration to that of the Xpresso
vagarwal87/Xpresso architecture.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the Xpresso model. Defines the number of feature channels derived from |
5
|
|
int
|
The length of the promoter sequence window (centered on the TSS) consumed by the convolutional stack. |
10500
|
|
int
|
Number of convolutional blocks in the encoder. |
2
|
|
list[int] | None
|
Number of output channels for each convolutional block. Length must equal |
None
|
|
list[int] | None
|
Convolution kernel size for each convolutional block. Length must equal |
None
|
|
list[int] | None
|
Dilation factor for each convolutional block. Length must equal |
None
|
|
list[int] | None
|
Max-pooling window for each convolutional block. Length must equal |
None
|
|
int
|
Number of auxiliary numeric mRNA half-life features concatenated with the convolutional representation before the fully-connected head. |
6
|
|
list[int] | None
|
Dimensionality of each fully-connected layer in the head. |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder and the head. If string, |
'relu'
|
|
float
|
The dropout probability applied after each fully-connected layer. |
0.00099
|
|
int
|
Number of output labels. Xpresso predicts a single scalar mRNA expression value. |
1
|
|
HeadConfig | None
|
The configuration of the prediction head. Defaults to a regression head
( |
None
|
Examples:
Source code in multimolecule/models/xpresso/configuration_xpresso.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
XpressoForSequencePrediction
¶
Bases: XpressoPreTrainedModel
Examples:
Source code in multimolecule/models/xpresso/modeling_xpresso.py
XpressoModel
¶
Bases: XpressoPreTrainedModel
Examples:
Source code in multimolecule/models/xpresso/modeling_xpresso.py
| Python | |
|---|---|
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | |
XpressoModelOutput
dataclass
¶
Bases: ModelOutput
Base class for outputs of the Xpresso backbone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
`torch.FloatTensor` of shape `(batch_size, flattened_conv_size)`
|
Flattened convolutional representation of the promoter sequence. |
None
|
|
`torch.FloatTensor` of shape `(batch_size, hidden_size)`
|
Final fully-connected representation, with the auxiliary mRNA half-life features fused in. This is the
tensor consumed by |
None
|
|
always `None`
|
Xpresso is a purely convolutional architecture and has no attention; this field is always |
None
|
Source code in multimolecule/models/xpresso/modeling_xpresso.py
XpressoPreTrainedModel
¶
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.