ProGen2¶
Pre-trained model on protein sequences using a causal language modeling (CLM) objective.
Disclaimer¶
This is an UNOFFICIAL implementation of the ProGen2: Exploring the Boundaries of Protein Language Models by Erik Nijkamp, Jeffrey A. Ruffolo, et al.
The OFFICIAL repository of ProGen2 is at enijkamp/progen.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ProGen2 did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details¶
ProGen2 is a GPT-J-style model pre-trained on a large corpus of protein sequences in a self-supervised fashion. This means that the model was trained on the raw amino acids of protein sequences only, with an automatic process to generate inputs and labels from those sequences. Please refer to the Training Details section for more information on the training process.
Variants¶
- multimolecule/progen2-xlarge: The ProGen2 model pre-trained on Uniref90 and BFD30 with 6.4 billion parameters.
- multimolecule/progen2-large: The ProGen2 model pre-trained on Uniref90 and BFD30 with 2.7 billion parameters.
- multimolecule/progen2-bfd90: The ProGen2 model pre-trained on Uniref90 and BFD90 with 2.7 billion parameters.
- multimolecule/progen2-base: The ProGen2 model pre-trained on Uniref90 and BFD30 with 764 million parameters.
- multimolecule/progen2-oas: The ProGen2 model pre-trained on OAS with 764 million parameters.
- multimolecule/progen2-medium: The ProGen2 model pre-trained on Uniref90 and BFD30 with 764 million parameters.
- multimolecule/progen2-small: The ProGen2 model pre-trained on Uniref90 and BFD30 with 151 million parameters.
Model Specification¶
| Variants | Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|---|---|---|
| ProGen2-xlarge | 32 | 4096 | 16 | 16384 | 6443.66 | 6735.76 | 3367.27 | 1024 |
| ProGen2-large | 2560 | 32 | 10240 | 2517.34 | 2664.21 | 1331.45 | ||
| ProGen2-bfd90 | ||||||||
| ProGen2-base | 27 | 1536 | 16 | 6144 | 764.81 | 826.85 | 413.12 | 2048 |
| ProGen2-oas | 1024 | |||||||
| ProGen2-medium | ||||||||
| ProGen2-small | 12 | 1024 | 4096 | 151.15 | 167.74 | 83.75 |
Links¶
- Code: multimolecule.progen2
- Weights: multimolecule/progen2
- Data: UniRef, BFD
- Paper: ProGen2: Exploring the Boundaries of Protein Language Models
- Developed by: Erik Nijkamp, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali Madani
- Model type: GPT-J
- Original Repository: enijkamp/progen2
Usage¶
The model file depends on the multimolecule library. You can install it using pip:
| Bash | |
|---|---|
Direct Use¶
Text Generation¶
You can use this model directly with a pipeline for text generation:
| Python | |
|---|---|
Downstream Use¶
Extract Features¶
Here is how to use this model to get the features of a given sequence in PyTorch:
Sequence Classification / Regression¶
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Token Classification / Regression¶
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:
Training Details¶
ProGen2 used Causal Language Modeling (CLM) as the pre-training objective: given a protein sequence, the model is trained to predict the next amino acid token autoregressively.
Training Data¶
The ProGen2 models were pre-trained on protein sequence databases:
- Uniref90: A clustered version of the UniProt database at 90% sequence identity, containing approximately 135 million sequences.
- BFD30: The Big Fantastic Database clustered at 30% sequence identity, approximately one-third the size of Uniref90.
- BFD90: The Big Fantastic Database clustered at 90% sequence identity, approximately twice the size of Uniref90.
- OAS: The Observed Antibody Space database, clustered at 85% sequence identity.
Different model variants were trained on different combinations:
- progen2-small, progen2-medium, progen2-base, progen2-large, progen2-xlarge: Trained on Uniref90 and BFD30.
- progen2-bfd90: Trained on Uniref90 and BFD90.
- progen2-oas: Trained on the OAS database.
Training Procedure¶
ProGen2 used causal language modeling (CLM) as the pre-training objective.
Pre-training¶
The model was trained on Google TPU-v3 pods using JAX.
- Batch size: 500,000 – 1,000,000
- Steps: 350,000 – 400,000
- Optimizer: Adam(β1=0.9, β2=0.999, ε=1e-8)
- Learning rate: 1e-5 – 6e-4
- Learning rate scheduler: Cosine
- Learning rate warm-up: 3,000 – 10,000 steps
- Weight decay: 0.1
- Maximum Gradient Norm: 0.8 – 1.0
Citation¶
Note
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
| BibTeX | |
|---|---|
Contact¶
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ProGen2 paper for questions or comments on the paper/model.
License¶
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only | |
|---|---|
API Reference¶
ProGen2Config
¶
Bases: PreTrainedConfig
This is the configuration class to store the configuration of a ProGen2Model.
It is used to instantiate a ProGen2 model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the ProGen2
salesforce/progen2 architecture, which follows the GPT-J style transformer.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Vocabulary size of the ProGen2 model. Defines the number of different tokens that can be represented by the
|
35
|
|
int
|
Dimensionality of the encoder layers and the pooler layer. |
1536
|
|
int
|
Number of hidden layers in the Transformer encoder. |
27
|
|
int
|
Number of attention heads for each attention layer in the Transformer encoder. |
16
|
|
int | None
|
Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. |
None
|
|
str
|
The non-linear activation function (function or string) in the encoder and pooler. If string, |
'gelu_new'
|
|
float
|
The dropout probability for the embedding layer. |
0.0
|
|
float
|
The dropout probability for residual connections and fully connected layers in the decoder. |
0.0
|
|
float
|
The dropout ratio for the attention probabilities. |
0.0
|
|
int
|
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). |
2048
|
|
float
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. |
0.02
|
|
float
|
The epsilon used by the layer normalization layers. |
1e-05
|
|
int | None
|
Dimensionality of rotary position embeddings. If |
48
|
|
bool
|
Whether to scale attention weights by sqrt(head_dim). |
True
|
|
bool
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if |
True
|
|
bool
|
Whether the model is used as a decoder or not. If |
True
|
Examples:
Source code in multimolecule/models/progen2/configuration_progen2.py
| Python | |
|---|---|
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
ProGen2ForCausalLM
¶
Bases: ProGen2PreTrainedModel, GenerationMixin
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/progen2/modeling_progen2.py
ProGen2ForSequencePrediction
¶
Bases: ProGen2PreTrainedModel
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/progen2/modeling_progen2.py
ProGen2ForTokenPrediction
¶
Bases: ProGen2PreTrainedModel
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/progen2/modeling_progen2.py
ProGen2Model
¶
Bases: ProGen2PreTrainedModel
Note
When gradient checkpointing is enabled (model.gradient_checkpointing_enable()), use_cache is
incompatible with recomputation and should be set to False; past key-value caching will not function
correctly under gradient checkpointing.
Examples:
| Python Console Session | |
|---|---|
Source code in multimolecule/models/progen2/modeling_progen2.py
| Python | |
|---|---|
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |