Ab-RoBERTa
Ab-RoBERTa is a pretrained masked language model (MLM) built on the RoBERTa architecture, trained using antibody sequences from the Observed Antibody Space (OAS) database. The model was trained on amino acid sequences written in uppercase letters with no spaces between them, so it only supports inputs in this specific format. Ab-RoBERTa is descrived in detail in this paper, and originally released at this location.
Model Description
- Developed by: Eunna Huh, Hyeonsu Lee, Hyunjin Shin
- Funded by : Mogam institute for biomedical research
- Model type: RoBERTa
- Trained Database: Observed Antibody Space (OAS)
- License: MIT License
Main configuration
hidden_size | num_hidden_layers | num_attention_heads | intermediate_size | total_parameters |
---|---|---|---|---|
768 | 12 | 12 | 3,072 | 125M |
Uses
This model can be utilized to extract features from antibody sequences or fine-tuned for various downstream tasks. It is compatible with the Transformers library for easy loading and integration.
Example usage
from transformers import (
RobertaTokenzier,
RobertaModel,
RobertaForMaskedLM,
RobertaForSequenceClassification
)
# Load tokenizer (No need to add spaces to the sequence)
tokenizer = RobertaTokenizer.from_pretrained("mogam-ai/Ab-RoBERTa", do_lower_case=False)
# Load pre-trained model (exclude mlm head)
model = RobertaModel.from_pretrained("mogam-ai/Ab-RoBERTa", add_pooling_layer=False)
# Load pre-trained model (include mlm head)
mlm_model = RoberetaForMaskedLM.from_pretrained("mogam-ai/Ab-RoBERTa")
The tokenizer is designed to process batch inputs without requiring spaces between characters.
The tokenizer adds a start token ("<s>", token ID 0) at the beginning of each sequence and an end token ("</s>", token ID 2) at the end of each sequence.
To standardize sequence lengths within a batch, padding tokens ("<pad>", token ID 1) are added following the end token, extending each sequence to the maximum length observed in the batch.
example_sequences = [
"QVQLVQSGPEVRKPGASEKVSCKASGYTFTNFYLHWVRQAPGQGLEWMGIINPSDGSTKFSRKFEGRVAMTRDTYTRTVYMELSSLRSEDTAVYYCTRCQDVVLLPAAQPENYYYGLDVWGQGTTVTVS", "QDQLVQSGAEVKNPGASVKVSCKASGYTFTSYGISLVRQAPGQGLEWMGWISAYNGNTNDAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVNSGSGWYFVPEEYYYYYYGMDVWGQGTTVTVSS"
]
tokens = tokenizer.batch_encode_plus(
example_sequences, add_special_tokens=True,
max_length=150,
padding=True,
truncation=True,
return_tensors="pt",
return_special_tokens_mask=False,
)
"""
Output
{
'input_ids': tensor(
[
[ 0, 18, 22, 18, 14, ..., 2, 1, 1, 1],
[ 0, 18, 7, 18, 14, 22, 18, ..., 20, 2]
]
),
'attention_mask': tensor(
[
[1, 1, 1, 1, 1, ..., 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, ..., 1, 1]
]
)
}
"""
- To extract sequence embeddings from the model, use the code snippet below.
output = model(**tokens).last_hidden_state
Citation
BibTeX:
@misc{huh2025antibodyfoundationalmodel, title={Antibody Foundational Model : Ab-RoBERTa}, author={Eunna Huh and Hyeonsu Lee and Hyunjin Shin}, year={2025}, eprint={2506.13006}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.13006}, }
- Downloads last month
- 54
Model tree for mogam-ai/Ab-RoBERTa
Base model
FacebookAI/roberta-base