File size: 5,037 Bytes

4d76917

---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - patent-classification
  - green-technology
  - text-classification
  - mpnet
  - sentence-transformers
  - fine-tuned
datasets:
  - CTB2001/patents-green-50k
metrics:
  - f1
  - accuracy
  - precision
  - recall
base_model: AI-Growth-Lab/PatentSBERTa
pipeline_tag: text-classification
model-index:
  - name: PatentSBERTa-green-classifier
    results:
      - task:
          type: text-classification
          name: Green Patent Classification
        dataset:
          type: CTB2001/patents-green-50k
          name: patents-green-50k (eval split)
          split: eval
        metrics:
          - type: f1
            value: 0.8104
            name: Green-class F1
          - type: accuracy
            value: 0.8097
          - type: precision
            value: 0.8073
          - type: recall
            value: 0.8136
---

# PatentSBERTa — Green Patent Classifier

A fine-tuned [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) model for **binary classification** of patent claims as *green technology* (1) or *not green* (0).

Developed as part of the **Applied Deep Learning (AAU, Spring 2025)** exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.

## Model Details

| Property | Value |
|---|---|
| Architecture | MPNetForSequenceClassification (12 layers, 768 hidden) |
| Parameters | 109.5 M (all trainable) |
| Base model | [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) |
| Max sequence length | 512 tokens |
| Labels | `0` — not green, `1` — green |
| Framework | Transformers 5.2.0, PyTorch |

## Training

### Pipeline overview

1. **Part A–B:** Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
2. **Part C:** QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
3. **Part D:** Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa

### Training data

| Split | Rows | Source |
|---|---|---|
| train_silver | 25,000 | Silver labels from Parts A–C |
| gold_labels | 100 (× 25 upsampled = 2,500) | HITL-verified labels |
| **Total training** | **27,500** | Combined |
| eval_silver | 10,000 | Held-out balanced evaluation set |

### Hyperparameters

| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 5 |
| Effective batch size | 128 (4 × 16 × grad_accum 2) |
| LR scheduler | Cosine with 6% warmup |
| Weight decay | 0.01 |
| Label smoothing | 0.05 |
| Gold upsample factor | 25× |
| Early stopping patience | 3 |
| Precision | bf16 |
| Seed | 42 |

### Hardware

- 4 × NVIDIA L4 (24 GB each), DDP via `torchrun`
- AAU AI-Lab (SLURM cluster)
- Wall-clock time: ~23 minutes

## Evaluation

Evaluated on the held-out `eval_silver` split (10,000 samples, balanced).

|  | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| not-green (0) | 0.8121 | 0.8058 | 0.8090 | 5,000 |
| **green (1)** | **0.8073** | **0.8136** | **0.8104** | **5,000** |
| **Accuracy** | | | **0.8097** | 10,000 |

### Confusion Matrix

|  | Pred not-green | Pred green |
|---|---|---|
| **Actual not-green** | 4,029 | 971 |
| **Actual green** | 932 | 4,068 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()

print("green" if pred == 1 else "not-green")
```

## Intended Use

- **Primary:** Classifying patent claims as green / not-green technology
- **Domain:** Patent text (US/EP/WO first claims)
- **Not suitable for:** General-purpose NLI, legal advice, or production patent screening without additional validation

## Limitations

- Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
- Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
- Performance on out-of-domain patent offices or languages is unknown

## Citation

```bibtex
@misc{trost-bertelsen2025patentsberta-green,
  author       = {Trøst-Bertelsen, Christian},
  title        = {PatentSBERTa Green Patent Classifier},
  year         = {2025},
  howpublished = {Hugging Face Model Hub},
  url          = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}
```

## Author

**Christian Trøst-Bertelsen** — Aalborg University, Student ID 20224083
Course: Applied Deep Learning, 8th semester, Spring 2025