File size: 6,162 Bytes
bb74251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d873a76
bb74251
 
 
 
 
 
5a9ba24
 
 
 
 
bb74251
 
5a9ba24
 
e1df3c0
bb74251
3d3c39b
bb74251
3d3c39b
 
bb74251
5a9ba24
bb74251
3d3c39b
bb74251
 
3d3c39b
5a9ba24
 
3d3c39b
 
 
 
 
bb74251
 
 
3d3c39b
 
 
 
5a9ba24
3d3c39b
 
 
 
 
 
5a9ba24
3d3c39b
 
 
 
 
 
 
 
e1df3c0
bb74251
 
 
e1df3c0
 
bb74251
 
 
 
 
 
 
 
e1df3c0
 
 
 
 
 
 
 
bb74251
 
3d3c39b
bb74251
3d3c39b
 
bb74251
5a9ba24
3d3c39b
 
 
 
 
5a9ba24
 
3d3c39b
 
bb74251
 
3d3c39b
bb74251
 
3d3c39b
bb74251
 
 
 
 
3d3c39b
bb74251
 
 
 
3d3c39b
 
bb74251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
#language:
#- en
license: mit
tags:
- biology
- protein
- antibody
- ablang
- transformers
- pytorch
- chemistry
- oas
- cdr
- ablang2 hf implementation
- roberta
- ESM
- ablang2
- antibody-design

# datasets:
# - oas
metrics:
- sequence modeling
- protein language model
library_name: transformers
pipeline_tag: fill-mask
---

# 🧬 AbLang2: Transformer-based Antibody Language Model

This repository provides HuggingFace-compatible πŸ€— implementation of the AbLang2 language model for antibodies. The original AbLang2 model was developed by the [Oxford Protein Informatics Group (OPIG)](https://opig.stats.ox.ac.uk/) and is available at:
- **AbLang2**: [https://github.com/TobiasHeOl/AbLang2](https://github.com/TobiasHeOl/AbLang2)

## 🎯 Model Available

- **ablang2**: AbLang2 model for antibody sequences

## πŸ“¦ Installation

Install the required dependencies:

```bash
# Install core dependencies
pip install transformers numpy pandas rotary-embedding-torch

# Install ANARCI from bioconda (required for antibody numbering)
conda install -c bioconda anarci
```

**Note**: ANARCI is required for antibody sequence numbering and alignment features. It must be installed from the bioconda channel.

## πŸš€ Loading Model from Hugging Face Hub

### Method 1: Load Model and Tokenizer, then Import Adapter
```python
import sys
import os
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model and tokenizer from Hugging Face Hub
model = AutoModel.from_pretrained("hemantn/ablang2", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hemantn/ablang2", trust_remote_code=True)

# Download adapter and add to path
adapter_path = hf_hub_download(repo_id="hemantn/ablang2", filename="adapter.py")
cached_model_dir = os.path.dirname(adapter_path)
sys.path.insert(0, cached_model_dir)

# Import and create the adapter
from adapter import AbLang2PairedHuggingFaceAdapter
ablang = AbLang2PairedHuggingFaceAdapter(model=model, tokenizer=tokenizer)
```

### Method 2: Using importlib (Alternative)
```python
import importlib.util
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model and tokenizer
model = AutoModel.from_pretrained("hemantn/ablang2", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hemantn/ablang2", trust_remote_code=True)

# Load adapter dynamically
adapter_path = hf_hub_download(repo_id="hemantn/ablang2", filename="adapter.py")
spec = importlib.util.spec_from_file_location("adapter", adapter_path)
adapter_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(adapter_module)

# Create the adapter
ablang = adapter_module.AbLang2PairedHuggingFaceAdapter(model=model, tokenizer=tokenizer)
```

**Note**: Model automatically use GPU when available, otherwise fall back to CPU.

## βš™οΈ Available Utilities

This wrapper translates between HuggingFace's model format and AbLang2's expected input/output structure, making it easy to use AbLang2's powerful antibody analysis tools with model loaded from HuggingFace.

- **seqcoding**: Sequence-level representations (averaged across residues)
- **rescoding**: Residue-level representations (per-residue embeddings)
- **likelihood**: Raw logits for amino acid prediction at each position
- **probability**: Normalized probabilities for amino acid prediction
- **pseudo_log_likelihood**: Uncertainty scoring with stepwise masking (masks each residue)
- **confidence**: Fast uncertainty scoring (single forward pass, no masking)
- **restore**: Restore masked residues (*) with predicted amino acids

All these utilities work seamlessly with the HuggingFace-loaded model, maintaining the same API as the original AbLang2 implementation.

The `AbLang2PairedHuggingFaceAdapter` class is a wrapper that lets you use AbLang2 model utilities after loading the model from HuggingFace. This class enables you to:

- **Access all AbLang2 utilities** (seqcoding, rescoding, likelihood, probability, etc.) with the same interface as the original implementation
- **Work with antibody sequences** (heavy and light chains) seamlessly
- **Maintain compatibility** with the original AbLang2 API while leveraging HuggingFace's model loading and caching capabilities

## πŸ’‘ Examples

### πŸ”— AbLang2 (Paired Sequences) - Restore Example
```python
import sys
import os
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

# 1. Load model and tokenizer from Hugging Face Hub
model = AutoModel.from_pretrained("hemantn/ablang2", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hemantn/ablang2", trust_remote_code=True)

# 2. Download adapter and add to path
adapter_path = hf_hub_download(repo_id="hemantn/ablang2", filename="adapter.py")
cached_model_dir = os.path.dirname(adapter_path)
sys.path.insert(0, cached_model_dir)
from adapter import AbLang2PairedHuggingFaceAdapter

# 3. Create adapter
ablang = AbLang2PairedHuggingFaceAdapter(model=model, tokenizer=tokenizer)

# 4. Restore masked sequences
masked_seqs = [
    ['EVQ***SGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCAR**PGHGAAFMDVWGTGTTVTVSS',
     'DIQLTQSPLSLPVTLGQPASISCRSS*SLEASDTNIYLSWFQQRPGQSPRRLIYKI*NRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK']
]
restored = ablang(masked_seqs, mode='restore')
print(f"Restored sequences: {restored}")
```

## πŸ“š Detailed Usage

For comprehensive examples of all utilities (seqcoding, rescoding, likelihood, probability, pseudo_log_likelihood, confidence, and more), see:
- **[`test_ablang2_HF_implementation.ipynb`](test_ablang2_HF_implementation.ipynb)** - Complete notebook with all utilities and advanced usage patterns

## πŸ“– Citation

If you use these models in your research, please cite the original AbLang2 paper:

**AbLang2:**
```
@article{Olsen2024,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2024.02.02.578678},
  year={2024}
}
```