Text Classification
Transformers
Safetensors
sentence-transformers
English
mpnet
patent-classification
green-technology
fine-tuned
Eval Results (legacy)
text-embeddings-inference
Instructions to use CTB2001/PatentSBERTa-green-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CTB2001/PatentSBERTa-green-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="CTB2001/PatentSBERTa-green-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("CTB2001/PatentSBERTa-green-classifier") model = AutoModelForSequenceClassification.from_pretrained("CTB2001/PatentSBERTa-green-classifier") - sentence-transformers
How to use CTB2001/PatentSBERTa-green-classifier with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("CTB2001/PatentSBERTa-green-classifier") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 5,037 Bytes
4d76917 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- patent-classification
- green-technology
- text-classification
- mpnet
- sentence-transformers
- fine-tuned
datasets:
- CTB2001/patents-green-50k
metrics:
- f1
- accuracy
- precision
- recall
base_model: AI-Growth-Lab/PatentSBERTa
pipeline_tag: text-classification
model-index:
- name: PatentSBERTa-green-classifier
results:
- task:
type: text-classification
name: Green Patent Classification
dataset:
type: CTB2001/patents-green-50k
name: patents-green-50k (eval split)
split: eval
metrics:
- type: f1
value: 0.8104
name: Green-class F1
- type: accuracy
value: 0.8097
- type: precision
value: 0.8073
- type: recall
value: 0.8136
---
# PatentSBERTa — Green Patent Classifier
A fine-tuned [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) model for **binary classification** of patent claims as *green technology* (1) or *not green* (0).
Developed as part of the **Applied Deep Learning (AAU, Spring 2025)** exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.
## Model Details
| Property | Value |
|---|---|
| Architecture | MPNetForSequenceClassification (12 layers, 768 hidden) |
| Parameters | 109.5 M (all trainable) |
| Base model | [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) |
| Max sequence length | 512 tokens |
| Labels | `0` — not green, `1` — green |
| Framework | Transformers 5.2.0, PyTorch |
## Training
### Pipeline overview
1. **Part A–B:** Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
2. **Part C:** QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
3. **Part D:** Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa
### Training data
| Split | Rows | Source |
|---|---|---|
| train_silver | 25,000 | Silver labels from Parts A–C |
| gold_labels | 100 (× 25 upsampled = 2,500) | HITL-verified labels |
| **Total training** | **27,500** | Combined |
| eval_silver | 10,000 | Held-out balanced evaluation set |
### Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 5 |
| Effective batch size | 128 (4 × 16 × grad_accum 2) |
| LR scheduler | Cosine with 6% warmup |
| Weight decay | 0.01 |
| Label smoothing | 0.05 |
| Gold upsample factor | 25× |
| Early stopping patience | 3 |
| Precision | bf16 |
| Seed | 42 |
### Hardware
- 4 × NVIDIA L4 (24 GB each), DDP via `torchrun`
- AAU AI-Lab (SLURM cluster)
- Wall-clock time: ~23 minutes
## Evaluation
Evaluated on the held-out `eval_silver` split (10,000 samples, balanced).
| | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| not-green (0) | 0.8121 | 0.8058 | 0.8090 | 5,000 |
| **green (1)** | **0.8073** | **0.8136** | **0.8104** | **5,000** |
| **Accuracy** | | | **0.8097** | 10,000 |
### Confusion Matrix
| | Pred not-green | Pred green |
|---|---|---|
| **Actual not-green** | 4,029 | 971 |
| **Actual green** | 932 | 4,068 |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print("green" if pred == 1 else "not-green")
```
## Intended Use
- **Primary:** Classifying patent claims as green / not-green technology
- **Domain:** Patent text (US/EP/WO first claims)
- **Not suitable for:** General-purpose NLI, legal advice, or production patent screening without additional validation
## Limitations
- Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
- Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
- Performance on out-of-domain patent offices or languages is unknown
## Citation
```bibtex
@misc{trost-bertelsen2025patentsberta-green,
author = {Trøst-Bertelsen, Christian},
title = {PatentSBERTa Green Patent Classifier},
year = {2025},
howpublished = {Hugging Face Model Hub},
url = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}
```
## Author
**Christian Trøst-Bertelsen** — Aalborg University, Student ID 20224083
Course: Applied Deep Learning, 8th semester, Spring 2025
|