You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

NanoBodyBERT

NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.

Model Description

This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.

Intended Use

This model is designed for:

Nanobody sequence analysis
CDR region reconstruction
Sequence embedding generation
Antibody design applications

How to Use

Installation

First, install the required dependencies:

pip install transformers torch

Loading the Model

import torch
from transformers import BertForMaskedLM
import sys
import os

# Load custom tokenizer (AATokenizer)
# You need to have the tokenizer.py file in your project
from tokenizer import AATokenizer

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

Inference Example

# Example nanobody sequence
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"

# Encode sequence
input_ids = tokenizer.encode(sequence, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Get embeddings
with torch.no_grad():
    outputs = model.bert(input_ids=input_ids, return_dict=True)
    embeddings = outputs.last_hidden_state
    cls_embedding = embeddings[0, 0, :]  # [CLS] token embedding

print(f"CLS embedding shape: {cls_embedding.shape}")

Masked Prediction Example

# Create a masked sequence (mask CDR3 region for example)
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"

# Tokenize
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
input_ids = tokenizer.encode(tokens, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Predict
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
print(f"Predicted: {predicted_sequence}")

Model Architecture

Architecture: BERT (Bidirectional Encoder Representations from Transformers)
Vocabulary: 26 tokens (20 amino acids + special tokens)
Max sequence length: 256
Special tokens:
- [PAD]: Padding token
- [CLS]: Classification token (sequence start)
- [SEP]: Separator token (sequence end)
- [MASK]: Mask token for MLM
- [UNK]: Unknown token

Training Data

The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.

Citation

If you use this model in your research, please cite:

@misc{nanobodybert,
  title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
  author={Ling Luo},
  year={2025},
  howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
}

License

Apache 2.0

Contact

For questions and feedback, please open an issue on the repository.

Downloads last month: 11

Safetensors

Model size

38.2M params

Tensor type

F32