File size: 5,037 Bytes
4d76917
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - patent-classification
  - green-technology
  - text-classification
  - mpnet
  - sentence-transformers
  - fine-tuned
datasets:
  - CTB2001/patents-green-50k
metrics:
  - f1
  - accuracy
  - precision
  - recall
base_model: AI-Growth-Lab/PatentSBERTa
pipeline_tag: text-classification
model-index:
  - name: PatentSBERTa-green-classifier
    results:
      - task:
          type: text-classification
          name: Green Patent Classification
        dataset:
          type: CTB2001/patents-green-50k
          name: patents-green-50k (eval split)
          split: eval
        metrics:
          - type: f1
            value: 0.8104
            name: Green-class F1
          - type: accuracy
            value: 0.8097
          - type: precision
            value: 0.8073
          - type: recall
            value: 0.8136
---

# PatentSBERTa — Green Patent Classifier

A fine-tuned [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) model for **binary classification** of patent claims as *green technology* (1) or *not green* (0).

Developed as part of the **Applied Deep Learning (AAU, Spring 2025)** exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.

## Model Details

| Property | Value |
|---|---|
| Architecture | MPNetForSequenceClassification (12 layers, 768 hidden) |
| Parameters | 109.5 M (all trainable) |
| Base model | [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) |
| Max sequence length | 512 tokens |
| Labels | `0` — not green, `1` — green |
| Framework | Transformers 5.2.0, PyTorch |

## Training

### Pipeline overview

1. **Part A–B:** Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
2. **Part C:** QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
3. **Part D:** Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa

### Training data

| Split | Rows | Source |
|---|---|---|
| train_silver | 25,000 | Silver labels from Parts A–C |
| gold_labels | 100 (× 25 upsampled = 2,500) | HITL-verified labels |
| **Total training** | **27,500** | Combined |
| eval_silver | 10,000 | Held-out balanced evaluation set |

### Hyperparameters

| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 5 |
| Effective batch size | 128 (4 × 16 × grad_accum 2) |
| LR scheduler | Cosine with 6% warmup |
| Weight decay | 0.01 |
| Label smoothing | 0.05 |
| Gold upsample factor | 25× |
| Early stopping patience | 3 |
| Precision | bf16 |
| Seed | 42 |

### Hardware

- 4 × NVIDIA L4 (24 GB each), DDP via `torchrun`
- AAU AI-Lab (SLURM cluster)
- Wall-clock time: ~23 minutes

## Evaluation

Evaluated on the held-out `eval_silver` split (10,000 samples, balanced).

|  | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| not-green (0) | 0.8121 | 0.8058 | 0.8090 | 5,000 |
| **green (1)** | **0.8073** | **0.8136** | **0.8104** | **5,000** |
| **Accuracy** | | | **0.8097** | 10,000 |

### Confusion Matrix

|  | Pred not-green | Pred green |
|---|---|---|
| **Actual not-green** | 4,029 | 971 |
| **Actual green** | 932 | 4,068 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()

print("green" if pred == 1 else "not-green")
```

## Intended Use

- **Primary:** Classifying patent claims as green / not-green technology
- **Domain:** Patent text (US/EP/WO first claims)
- **Not suitable for:** General-purpose NLI, legal advice, or production patent screening without additional validation

## Limitations

- Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
- Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
- Performance on out-of-domain patent offices or languages is unknown

## Citation

```bibtex
@misc{trost-bertelsen2025patentsberta-green,
  author       = {Trøst-Bertelsen, Christian},
  title        = {PatentSBERTa Green Patent Classifier},
  year         = {2025},
  howpublished = {Hugging Face Model Hub},
  url          = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}
```

## Author

**Christian Trøst-Bertelsen** — Aalborg University, Student ID 20224083
Course: Applied Deep Learning, 8th semester, Spring 2025