arxiv:2510.22264

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Published on Oct 25

· Submitted by

iliass ayaou on Oct 29

Upvote

Authors:

Iliass Ayaou ,

Abstract

PatenTEB is a comprehensive benchmark for patent text embeddings with 15 tasks, and the patembed model family demonstrates strong generalization across various patent-specific challenges.

AI-generated summary

Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.

View arXiv page View PDF Add to collection

Community

datalyes

Paper author Paper submitter about 6 hours ago

PatenTEB addresses a critical gap in patent text understanding by providing the first comprehensive benchmark specifically designed for patent text embeddings, along with a family of state-of-the-art models.

Key Highlights

🎯 15 benchmark tasks across retrieval, classification, paraphrase, and clustering
📊 319K test examples (publicly released) + 1.74M train/val (planned release)
🚀 12 trained models (67M-344M parameters) achieving SOTA on patenTEB and other patent external benchmarks
🔄 MTEB integration upcoming

Paper • Models • Datasets • GitHub