--- language: - en license: apache-2.0 library_name: transformers tags: - patent-classification - green-technology - text-classification - mpnet - sentence-transformers - fine-tuned datasets: - CTB2001/patents-green-50k metrics: - f1 - accuracy - precision - recall base_model: AI-Growth-Lab/PatentSBERTa pipeline_tag: text-classification model-index: - name: PatentSBERTa-green-classifier results: - task: type: text-classification name: Green Patent Classification dataset: type: CTB2001/patents-green-50k name: patents-green-50k (eval split) split: eval metrics: - type: f1 value: 0.8104 name: Green-class F1 - type: accuracy value: 0.8097 - type: precision value: 0.8073 - type: recall value: 0.8136 --- # PatentSBERTa — Green Patent Classifier A fine-tuned [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) model for **binary classification** of patent claims as *green technology* (1) or *not green* (0). Developed as part of the **Applied Deep Learning (AAU, Spring 2025)** exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification. ## Model Details | Property | Value | |---|---| | Architecture | MPNetForSequenceClassification (12 layers, 768 hidden) | | Parameters | 109.5 M (all trainable) | | Base model | [AI-Growth-Lab/PatentSBERTa](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) | | Max sequence length | 512 tokens | | Labels | `0` — not green, `1` — green | | Framework | Transformers 5.2.0, PyTorch | ## Training ### Pipeline overview 1. **Part A–B:** Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection 2. **Part C:** QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents 3. **Part D:** Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa ### Training data | Split | Rows | Source | |---|---|---| | train_silver | 25,000 | Silver labels from Parts A–C | | gold_labels | 100 (× 25 upsampled = 2,500) | HITL-verified labels | | **Total training** | **27,500** | Combined | | eval_silver | 10,000 | Held-out balanced evaluation set | ### Hyperparameters | Parameter | Value | |---|---| | Learning rate | 2e-5 | | Epochs | 5 | | Effective batch size | 128 (4 × 16 × grad_accum 2) | | LR scheduler | Cosine with 6% warmup | | Weight decay | 0.01 | | Label smoothing | 0.05 | | Gold upsample factor | 25× | | Early stopping patience | 3 | | Precision | bf16 | | Seed | 42 | ### Hardware - 4 × NVIDIA L4 (24 GB each), DDP via `torchrun` - AAU AI-Lab (SLURM cluster) - Wall-clock time: ~23 minutes ## Evaluation Evaluated on the held-out `eval_silver` split (10,000 samples, balanced). | | Precision | Recall | F1-score | Support | |---|---|---|---|---| | not-green (0) | 0.8121 | 0.8058 | 0.8090 | 5,000 | | **green (1)** | **0.8073** | **0.8136** | **0.8104** | **5,000** | | **Accuracy** | | | **0.8097** | 10,000 | ### Confusion Matrix | | Pred not-green | Pred green | |---|---|---| | **Actual not-green** | 4,029 | 971 | | **Actual green** | 932 | 4,068 | ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "CTB2001/PatentSBERTa-green-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..." inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits pred = torch.argmax(logits, dim=-1).item() print("green" if pred == 1 else "not-green") ``` ## Intended Use - **Primary:** Classifying patent claims as green / not-green technology - **Domain:** Patent text (US/EP/WO first claims) - **Not suitable for:** General-purpose NLI, legal advice, or production patent screening without additional validation ## Limitations - Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy - Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal - Performance on out-of-domain patent offices or languages is unknown ## Citation ```bibtex @misc{trost-bertelsen2025patentsberta-green, author = {Trøst-Bertelsen, Christian}, title = {PatentSBERTa Green Patent Classifier}, year = {2025}, howpublished = {Hugging Face Model Hub}, url = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier} } ``` ## Author **Christian Trøst-Bertelsen** — Aalborg University, Student ID 20224083 Course: Applied Deep Learning, 8th semester, Spring 2025