Ab-RoBERTa

Ab-RoBERTa is a pretrained masked language model (MLM) built on the RoBERTa architecture, trained using antibody sequences from the Observed Antibody Space (OAS) database. The model was trained on amino acid sequences written in uppercase letters with no spaces between them, so it only supports inputs in this specific format. Ab-RoBERTa is descrived in detail in this paper, and originally released at this location.

Model Description

Developed by: Eunna Huh, Hyeonsu Lee, Hyunjin Shin
Funded by : Mogam institute for biomedical research
Model type: RoBERTa
Trained Database: Observed Antibody Space (OAS)
License: MIT License

Main configuration

hidden_size	num_hidden_layers	num_attention_heads	intermediate_size	total_parameters
768	12	12	3,072	125M

Uses

This model can be utilized to extract features from antibody sequences or fine-tuned for various downstream tasks. It is compatible with the Transformers library for easy loading and integration.

Example usage

from transformers import (
    RobertaTokenzier,
    RobertaModel,
    RobertaForMaskedLM,
    RobertaForSequenceClassification
)

# Load tokenizer (No need to add spaces to the sequence)
tokenizer = RobertaTokenizer.from_pretrained("mogam-ai/Ab-RoBERTa", do_lower_case=False)
# Load pre-trained model (exclude mlm head)
model = RobertaModel.from_pretrained("mogam-ai/Ab-RoBERTa", add_pooling_layer=False)
# Load pre-trained model (include mlm head)
mlm_model = RoberetaForMaskedLM.from_pretrained("mogam-ai/Ab-RoBERTa")

The tokenizer is designed to process batch inputs without requiring spaces between characters.
The tokenizer adds a start token ("<s>", token ID 0) at the beginning of each sequence and an end token ("</s>", token ID 2) at the end of each sequence.
To standardize sequence lengths within a batch, padding tokens ("<pad>", token ID 1) are added following the end token, extending each sequence to the maximum length observed in the batch.

example_sequences = [
    "QVQLVQSGPEVRKPGASEKVSCKASGYTFTNFYLHWVRQAPGQGLEWMGIINPSDGSTKFSRKFEGRVAMTRDTYTRTVYMELSSLRSEDTAVYYCTRCQDVVLLPAAQPENYYYGLDVWGQGTTVTVS", "QDQLVQSGAEVKNPGASVKVSCKASGYTFTSYGISLVRQAPGQGLEWMGWISAYNGNTNDAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVNSGSGWYFVPEEYYYYYYGMDVWGQGTTVTVSS"
]

tokens = tokenizer.batch_encode_plus(
    example_sequences, add_special_tokens=True,
    max_length=150,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_special_tokens_mask=False,
)
"""
Output
{
    'input_ids': tensor(
        [
            [ 0, 18, 22, 18, 14, ...,  2,  1,  1,  1],
            [ 0, 18,  7, 18, 14, 22, 18, ..., 20,  2]
        ]
    ), 
    'attention_mask': tensor(
        [
            [1, 1, 1, 1, 1, ..., 1, 0, 0, 0],
            [1, 1, 1, 1, 1, 1, 1, ..., 1, 1]
        ]
    )
}
"""

To extract sequence embeddings from the model, use the code snippet below.

output = model(**tokens).last_hidden_state

Citation

BibTeX:

@misc{huh2025antibodyfoundationalmodel, title={Antibody Foundational Model : Ab-RoBERTa}, author={Eunna Huh and Hyeonsu Lee and Hyunjin Shin}, year={2025}, eprint={2506.13006}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.13006}, }