AbLang
Pre-trained antibody language model using a masked language modeling (MLM) objective.
Disclaimer
This is an UNOFFICIAL implementation of AbLang: an antibody language model for completing antibody sequences by Tobias H. Olsen, et al.
The OFFICIAL repository of AbLang is at oxpig/AbLang.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing AbLang did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
AbLang v1 is an encoder-only Transformer trained on antibody sequences from the Observed Antibody Space (OAS). The official release provides separate heavy-chain and light-chain checkpoints. Both variants use the same architecture and vocabulary, but they were trained on chain-specific data and are represented as separate MultiMolecule variants.
Variants
Model Specification
| Variant |
Chain Type |
Num Layers |
Hidden Size |
Num Heads |
Intermediate Size |
Num Parameters (M) |
FLOPs (G) |
MACs (G) |
Max Num Tokens |
| AbLang-Heavy |
Heavy |
12 |
768 |
12 |
3072 |
85.83 |
28.18 |
14.06 |
159 |
| AbLang-Light |
Light |
Links
Usage
The model file depends on the multimolecule library. You can install it using pip:
| Bash |
|---|
| pip install multimolecule
|
Direct Use
Masked Language Modeling
You can use this model directly with a pipeline for masked language modeling:
| Python |
|---|
| import multimolecule # you must import multimolecule to register models
from transformers import pipeline
predictor = pipeline("fill-mask", model="multimolecule/ablang-heavy")
output = predictor("EVQLVESGGGLVQPGGSLRLSCAASGFTFSSY<mask>MSWVRQAPGKGLEWVSA")
|
Downstream Use
Here is how to use this model to get the features of a given antibody sequence in PyTorch:
| Python |
|---|
| from multimolecule import AbLangModel, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-heavy")
model = AbLangModel.from_pretrained("multimolecule/ablang-heavy")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
output = model(**input)
|
Sequence Classification / Regression
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
| Python |
|---|
| import torch
from multimolecule import AbLangForSequencePrediction, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-heavy")
model = AbLangForSequencePrediction.from_pretrained("multimolecule/ablang-heavy")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])
output = model(**input, labels=label)
|
Token Classification / Regression
Note
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:
| Python |
|---|
| import torch
from multimolecule import AbLangForTokenPrediction, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-heavy")
model = AbLangForTokenPrediction.from_pretrained("multimolecule/ablang-heavy")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))
output = model(**input, labels=label)
|
Training Details
AbLang was trained with masked language modeling (MLM) as the pre-training objective.
Training Data
AbLang was trained on antibody sequences from the Observed Antibody Space.
The heavy-chain model was trained on 14,126,724 sequences, and the light-chain model was trained on 187,068 sequences.
Training Procedure
Pre-training
The heavy-chain and light-chain checkpoints were trained separately on chain-specific OAS sequences.
Please refer to the original paper for details on the training setup.
Citation
| BibTeX |
|---|
| @article{olsen2022ablang,
title = {AbLang: an antibody language model for completing antibody sequences},
author = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.},
journal = {Bioinformatics Advances},
volume = {2},
number = {1},
pages = {vbac046},
year = {2022},
doi = {10.1093/bioadv/vbac046},
url = {https://doi.org/10.1093/bioadv/vbac046},
}
|
Note
The artifacts distributed in this repository are part of the MultiMolecule project.
If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
| BibTeX |
|---|
| @software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
|
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the AbLang paper for questions or comments on the paper/model.
License
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
| Text Only |
|---|
| SPDX-License-Identifier: AGPL-3.0-or-later
|
API Reference
AbLangConfig
Bases: PreTrainedConfig
This is the configuration class to store the configuration of an
AbLangModel. It is used to instantiate an AbLang v1 model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
configuration similar to the official AbLang v1 heavy/light checkpoints.
Configuration objects inherit from PreTrainedConfig and can be used to
control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
| Name |
Type |
Description |
Default |
vocab_size
|
int
|
Vocabulary size of the AbLang model. Defines the number of different tokens that can be represented by the
input_ids passed when calling [AbLangModel].
|
37
|
hidden_size
|
int
|
Dimensionality of the encoder layers and the pooler output.
|
768
|
num_hidden_layers
|
int
|
Number of hidden layers in the Transformer encoder.
|
12
|
num_attention_heads
|
int
|
Number of attention heads for each attention layer in the Transformer encoder.
|
12
|
|
|
int
|
Dimensionality of the feed-forward layer in the Transformer encoder.
|
3072
|
hidden_act
|
str
|
Non-linear activation function used by the feed-forward layer and masked language modeling head.
|
'gelu'
|
hidden_dropout
|
float
|
Dropout probability applied after embeddings, self-attention, and feed-forward projections.
|
0.1
|
attention_dropout
|
float
|
Dropout probability applied to attention probabilities.
|
0.1
|
max_position_embeddings
|
int
|
Size of the learned absolute position embedding table. Position id 0 is reserved for padding.
|
160
|
initializer_range
|
float
|
Standard deviation of the normal initializer for embedding and linear layers.
|
0.02
|
layer_norm_eps
|
float
|
Epsilon used by layer normalization layers.
|
1e-12
|
chain
|
str | None
|
Optional antibody chain label for converted checkpoints. AbLang v1 provides separate heavy and light
checkpoints trained on different data.
|
None
|
head
|
HeadConfig | None
|
The configuration of the downstream prediction head.
|
None
|
lm_head
|
MaskedLMHeadConfig | None
|
The configuration of the masked language model head.
|
None
|
Examples:
| Python Console Session |
|---|
| >>> from multimolecule.models.ablang import AbLangConfig, AbLangModel
>>> configuration = AbLangConfig()
>>> model = AbLangModel(configuration)
>>> configuration = model.config
|
Source code in multimolecule/models/ablang/configuration_ablang.py
| Python |
|---|
| class AbLangConfig(PreTrainedConfig):
r"""
This is the configuration class to store the configuration of an
[`AbLangModel`][multimolecule.models.AbLangModel]. It is used to instantiate an AbLang v1 model according to the
specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a
configuration similar to the official AbLang v1 heavy/light checkpoints.
Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
for more information.
Args:
vocab_size:
Vocabulary size of the AbLang model. Defines the number of different tokens that can be represented by the
`input_ids` passed when calling [`AbLangModel`].
hidden_size:
Dimensionality of the encoder layers and the pooler output.
num_hidden_layers:
Number of hidden layers in the Transformer encoder.
num_attention_heads:
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size:
Dimensionality of the feed-forward layer in the Transformer encoder.
hidden_act:
Non-linear activation function used by the feed-forward layer and masked language modeling head.
hidden_dropout:
Dropout probability applied after embeddings, self-attention, and feed-forward projections.
attention_dropout:
Dropout probability applied to attention probabilities.
max_position_embeddings:
Size of the learned absolute position embedding table. Position id `0` is reserved for padding.
initializer_range:
Standard deviation of the normal initializer for embedding and linear layers.
layer_norm_eps:
Epsilon used by layer normalization layers.
chain:
Optional antibody chain label for converted checkpoints. AbLang v1 provides separate `heavy` and `light`
checkpoints trained on different data.
head:
The configuration of the downstream prediction head.
lm_head:
The configuration of the masked language model head.
Examples:
>>> from multimolecule.models.ablang import AbLangConfig, AbLangModel
>>> configuration = AbLangConfig()
>>> model = AbLangModel(configuration)
>>> configuration = model.config
"""
model_type = "ablang"
position_embedding_type = "absolute"
def __init__(
self,
vocab_size: int = 37,
hidden_size: int = 768,
num_hidden_layers: int = 12,
num_attention_heads: int = 12,
intermediate_size: int = 3072,
hidden_act: str = "gelu",
hidden_dropout: float = 0.1,
attention_dropout: float = 0.1,
max_position_embeddings: int = 160,
initializer_range: float = 0.02,
layer_norm_eps: float = 1.0e-12,
chain: str | None = None,
pad_token_id: int = 0,
bos_token_id: int = 1,
eos_token_id: int = 2,
unk_token_id: int = 3,
mask_token_id: int = 4,
null_token_id: int = 5,
head: HeadConfig | None = None,
lm_head: MaskedLMHeadConfig | None = None,
**kwargs,
):
kwargs.setdefault("tie_word_embeddings", False)
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
unk_token_id=unk_token_id,
mask_token_id=mask_token_id,
null_token_id=null_token_id,
**kwargs,
)
validate_attention_dimensions(hidden_size, num_attention_heads)
hidden_act = hidden_act.lower()
if max_position_embeddings <= 1:
raise ValueError("max_position_embeddings must be greater than 1 because position id 0 is padding.")
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout = hidden_dropout
self.attention_dropout = attention_dropout
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.chain = chain
self.position_embedding_type = "absolute"
self.head = HeadConfig(**head) if head is not None else None
self.lm_head = (
MaskedLMHeadConfig(**lm_head)
if lm_head is not None
else MaskedLMHeadConfig(
transform="nonlinear",
transform_act=hidden_act,
bias=True,
layer_norm_eps=layer_norm_eps,
)
)
|
Bases: AbLangPreTrainedModel
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForMaskedLM
>>> config = AbLangConfig()
>>> model = AbLangForMaskedLM(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=input_ids)
>>> output["logits"].shape
torch.Size([1, 6, 37])
|
Source code in multimolecule/models/ablang/modeling_ablang.py
| Python |
|---|
| class AbLangForMaskedLM(AbLangPreTrainedModel):
"""
Examples:
>>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForMaskedLM
>>> config = AbLangConfig()
>>> model = AbLangForMaskedLM(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=input_ids)
>>> output["logits"].shape
torch.Size([1, 6, 37])
"""
_tied_weights_keys = {
"lm_head.decoder.bias": "lm_head.bias",
}
def get_expanded_tied_weights_keys(self, all_submodels: bool = False) -> dict:
tied_weights = super().get_expanded_tied_weights_keys(all_submodels=all_submodels)
if all_submodels:
return tied_weights
return tied_weights | self._tied_weights_keys
def __init__(self, config: AbLangConfig):
super().__init__(config)
self.model = AbLangModel(config, add_pooling_layer=False)
self.lm_head = MaskedLMHead(config)
# Initialize weights and apply final processing
self.post_init()
def get_output_embeddings(self):
return self.lm_head.decoder
def set_output_embeddings(self, embeddings):
self.lm_head.decoder = embeddings
if hasattr(self.lm_head, "bias"):
self.lm_head.bias = embeddings.bias
@can_return_tuple
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
labels: Tensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple[Tensor, ...] | MaskedLMOutput:
outputs = self.model(
input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
return_dict=True,
**kwargs,
)
output = self.lm_head(outputs, labels)
logits, loss = output.logits, output.loss
return MaskedLMOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
|
AbLangForSequencePrediction
Bases: AbLangPreTrainedModel
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForSequencePrediction
>>> config = AbLangConfig()
>>> model = AbLangForSequencePrediction(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=torch.tensor([[1]]))
>>> output["logits"].shape
torch.Size([1, 1])
|
Source code in multimolecule/models/ablang/modeling_ablang.py
| Python |
|---|
| class AbLangForSequencePrediction(AbLangPreTrainedModel):
"""
Examples:
>>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForSequencePrediction
>>> config = AbLangConfig()
>>> model = AbLangForSequencePrediction(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=torch.tensor([[1]]))
>>> output["logits"].shape
torch.Size([1, 1])
"""
def __init__(self, config: AbLangConfig):
super().__init__(config)
self.model = AbLangModel(config)
self.num_labels = config.num_labels
self.sequence_head = SequencePredictionHead(config)
self.head_config = self.sequence_head.config
# Initialize weights and apply final processing
self.post_init()
@can_return_tuple
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
labels: Tensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple[Tensor, ...] | SequencePredictorOutput:
outputs = self.model(
input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
return_dict=True,
**kwargs,
)
output = self.sequence_head(outputs, labels)
logits, loss = output.logits, output.loss
return SequencePredictorOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
|
AbLangForTokenPrediction
Bases: AbLangPreTrainedModel
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForTokenPrediction
>>> config = AbLangConfig()
>>> model = AbLangForTokenPrediction(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=torch.randint(2, (1, 4)))
>>> output["logits"].shape
torch.Size([1, 4, 1])
|
Source code in multimolecule/models/ablang/modeling_ablang.py
| Python |
|---|
| class AbLangForTokenPrediction(AbLangPreTrainedModel):
"""
Examples:
>>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangForTokenPrediction
>>> config = AbLangConfig()
>>> model = AbLangForTokenPrediction(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids, labels=torch.randint(2, (1, 4)))
>>> output["logits"].shape
torch.Size([1, 4, 1])
"""
def __init__(self, config: AbLangConfig):
super().__init__(config)
self.model = AbLangModel(config, add_pooling_layer=False)
self.num_labels = config.num_labels
self.token_head = TokenPredictionHead(config)
self.head_config = self.token_head.config
# Initialize weights and apply final processing
self.post_init()
@can_return_tuple
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
labels: Tensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple[Tensor, ...] | TokenPredictorOutput:
outputs = self.model(
input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
return_dict=True,
**kwargs,
)
output = self.token_head(outputs, attention_mask, input_ids, labels)
logits, loss = output.logits, output.loss
return TokenPredictorOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
|
AbLangModel
Bases: AbLangPreTrainedModel
Examples:
| Python Console Session |
|---|
| >>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangModel
>>> config = AbLangConfig()
>>> model = AbLangModel(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids)
>>> output["last_hidden_state"].shape
torch.Size([1, 6, 768])
>>> output["pooler_output"].shape
torch.Size([1, 768])
|
Source code in multimolecule/models/ablang/modeling_ablang.py
| Python |
|---|
| class AbLangModel(AbLangPreTrainedModel):
"""
Examples:
>>> import torch
>>> from multimolecule.models.ablang import AbLangConfig, AbLangModel
>>> config = AbLangConfig()
>>> model = AbLangModel(config)
>>> input_ids = torch.tensor([[1, 9, 23, 21, 15, 2]])
>>> output = model(input_ids)
>>> output["last_hidden_state"].shape
torch.Size([1, 6, 768])
>>> output["pooler_output"].shape
torch.Size([1, 768])
"""
def __init__(self, config: AbLangConfig, add_pooling_layer: bool = True):
super().__init__(config)
self.pad_token_id = config.pad_token_id
self.embeddings = AbLangEmbeddings(config)
self.encoder = AbLangEncoder(config)
self.pooler = AbLangPooler(config) if add_pooling_layer else None
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
@merge_with_config_defaults
def forward(
self,
input_ids: Tensor | NestedTensor | None = None,
attention_mask: Tensor | None = None,
inputs_embeds: Tensor | NestedTensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:
if isinstance(input_ids, NestedTensor):
if attention_mask is None:
attention_mask = input_ids.mask
input_ids = input_ids.tensor
if isinstance(inputs_embeds, NestedTensor):
if attention_mask is None:
attention_mask = inputs_embeds.mask
inputs_embeds = inputs_embeds.tensor
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
if attention_mask is None:
if input_ids is not None and self.pad_token_id is not None:
attention_mask = input_ids.ne(self.pad_token_id)
else:
if inputs_embeds is None:
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
input_shape = inputs_embeds.shape[:2]
attention_mask = torch.ones(input_shape, dtype=torch.bool, device=inputs_embeds.device)
else:
attention_mask = attention_mask.to(torch.bool)
embedding_output = self.embeddings(
input_ids=input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=attention_mask,
output_hidden_states=kwargs.get("output_hidden_states", self.config.output_hidden_states),
output_attentions=kwargs.get("output_attentions", self.config.output_attentions),
)
sequence_output = encoder_outputs.last_hidden_state
pooled_output = (
self.pooler(sequence_output, attention_mask=attention_mask, input_ids=input_ids) if self.pooler else None
)
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
|
AbLangPreTrainedModel
Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
Source code in multimolecule/models/ablang/modeling_ablang.py
| Python |
|---|
| class AbLangPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = AbLangConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_can_record_outputs: dict[str, Any] | None = None
_no_split_modules = ["AbLangLayer"]
@torch.no_grad()
def _init_weights(self, module: nn.Module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
init.normal_(module.weight, mean=0.0, std=std)
if module.bias is not None:
init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
init.normal_(module.weight, mean=0.0, std=std)
if module.padding_idx is not None and not getattr(module.weight, "_is_hf_initialized", False):
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
init.ones_(module.weight)
init.zeros_(module.bias)
|