Ab-RoBERTa

Ab-RoBERTa is a pretrained masked language model (MLM) built on the RoBERTa architecture, trained using antibody sequences from the Observed Antibody Space (OAS) database. The model was trained on amino acid sequences written in uppercase letters with no spaces between them, so it only supports inputs in this specific format. Ab-RoBERTa is descrived in detail in this paper, and originally released at this location.

Model Description

  • Developed by: Eunna Huh, Hyeonsu Lee, Hyunjin Shin
  • Funded by : Mogam institute for biomedical research
  • Model type: RoBERTa
  • Trained Database: Observed Antibody Space (OAS)
  • License: MIT License

Main configuration

hidden_size num_hidden_layers num_attention_heads intermediate_size total_parameters
768 12 12 3,072 125M

Uses

This model can be utilized to extract features from antibody sequences or fine-tuned for various downstream tasks. It is compatible with the Transformers library for easy loading and integration.

Example usage

from transformers import (
    RobertaTokenzier,
    RobertaModel,
    RobertaForMaskedLM,
    RobertaForSequenceClassification
)

# Load tokenizer (No need to add spaces to the sequence)
tokenizer = RobertaTokenizer.from_pretrained("mogam-ai/Ab-RoBERTa", do_lower_case=False)
# Load pre-trained model (exclude mlm head)
model = RobertaModel.from_pretrained("mogam-ai/Ab-RoBERTa", add_pooling_layer=False)
# Load pre-trained model (include mlm head)
mlm_model = RoberetaForMaskedLM.from_pretrained("mogam-ai/Ab-RoBERTa")
  • The tokenizer is designed to process batch inputs without requiring spaces between characters.

  • The tokenizer adds a start token ("<s>", token ID 0) at the beginning of each sequence and an end token ("</s>", token ID 2) at the end of each sequence.

  • To standardize sequence lengths within a batch, padding tokens ("<pad>", token ID 1) are added following the end token, extending each sequence to the maximum length observed in the batch.

example_sequences = [
    "QVQLVQSGPEVRKPGASEKVSCKASGYTFTNFYLHWVRQAPGQGLEWMGIINPSDGSTKFSRKFEGRVAMTRDTYTRTVYMELSSLRSEDTAVYYCTRCQDVVLLPAAQPENYYYGLDVWGQGTTVTVS", "QDQLVQSGAEVKNPGASVKVSCKASGYTFTSYGISLVRQAPGQGLEWMGWISAYNGNTNDAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCARVNSGSGWYFVPEEYYYYYYGMDVWGQGTTVTVSS"
]

tokens = tokenizer.batch_encode_plus(
    example_sequences, add_special_tokens=True,
    max_length=150,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_special_tokens_mask=False,
)
"""
Output
{
    'input_ids': tensor(
        [
            [ 0, 18, 22, 18, 14, ...,  2,  1,  1,  1],
            [ 0, 18,  7, 18, 14, 22, 18, ..., 20,  2]
        ]
    ), 
    'attention_mask': tensor(
        [
            [1, 1, 1, 1, 1, ..., 1, 0, 0, 0],
            [1, 1, 1, 1, 1, 1, 1, ..., 1, 1]
        ]
    )
}
"""
  • To extract sequence embeddings from the model, use the code snippet below.
output = model(**tokens).last_hidden_state

Citation

BibTeX:

@misc{huh2025antibodyfoundationalmodel, title={Antibody Foundational Model : Ab-RoBERTa}, author={Eunna Huh and Hyeonsu Lee and Hyunjin Shin}, year={2025}, eprint={2506.13006}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.13006}, }

Downloads last month
54
Safetensors
Model size
85.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mogam-ai/Ab-RoBERTa

Finetuned
(1870)
this model