Post
228
# PatenTEB: A Comprehensive Benchmark for Patent Text Embeddings ๐ฏ
Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!
## ๐ What's Released
### ๐ฆ 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:
- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection
๐ **All datasets**: [huggingface.co/datalyes]( @datalyes )
### ๐ค 13 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)
๐ **All models**: [huggingface.co/datalyes]( @datalyes )
## ๐ Resources
- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes]( @datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes]( @datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)
## ๐ Acknowledgments
Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library
#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers
Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!
## ๐ What's Released
### ๐ฆ 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:
- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection
๐ **All datasets**: [huggingface.co/datalyes]( @datalyes )
### ๐ค 13 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)
๐ **All models**: [huggingface.co/datalyes]( @datalyes )
## ๐ Resources
- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes]( @datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes]( @datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)
## ๐ Acknowledgments
Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library
#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers