Papers
arxiv:2510.22264

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Published on Oct 25
· Submitted by iliass ayaou on Oct 29
Authors:

Abstract

PatenTEB is a comprehensive benchmark for patent text embeddings with 15 tasks, and the patembed model family demonstrates strong generalization across various patent-specific challenges.

AI-generated summary

Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.

Community

Paper author Paper submitter

PatenTEB addresses a critical gap in patent text understanding by providing the first comprehensive benchmark specifically designed for patent text embeddings, along with a family of state-of-the-art models.

Key Highlights

  • 🎯 15 benchmark tasks across retrieval, classification, paraphrase, and clustering
  • 📊 319K test examples (publicly released) + 1.74M train/val (planned release)
  • 🚀 12 trained models (67M-344M parameters) achieving SOTA on patenTEB and other patent external benchmarks
  • 🔄 MTEB integration upcoming

Paper • Models • Datasets • GitHub

Sign up or log in to comment

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 15

Browse 15 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22264 in a Space README.md to link it from this page.

Collections including this paper 1