Abstract
A new benchmark evaluates embedding models' ability to handle long-horizon memory retrieval tasks, revealing that performance in traditional passage retrieval does not generalize to complex memory retrieval scenarios.
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.
Community
Welcome to the Long-horizon Memory Embedding Benchmark (LMEB)! Unlike existing text embedding benchmarks that narrowly focus on passage retrieval, LLMEB is designed to evaluate embedding models' ability to handle complex, long-horizon memory retrieval tasks, focusing on fragmented, context-dependent, and temporally distant information. LMEB spans 22 diverse datasets and 193 retrieval tasks across 4 memory types.
By evaluating the memory retrieval capabilities of embedding models, a crucial ability for memory-augmented systems like OpenClaw, LMEB helps OpenClaw identify the most suitable embedding models, enhancing its ability to adapt, remember, and make personalized, user-aware decisions.
The code is being prepared.
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/lmeb-long-horizon-memory-embedding-benchmark-6649-f33fe845
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- According to Me: Long-Term Personalized Referential Memory QA (2026)
- ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support (2026)
- HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model (2026)
- Query-focused and Memory-aware Reranker for Long Context Processing (2026)
- Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning (2026)
- Diffusion-Pretrained Dense and Contextual Embeddings (2026)
- HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper