arxiv:2604.22436

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Published on Apr 24

· Submitted by

taesiri on Apr 27

University College London

Upvote

Authors:

Bin Wu ,

Arastun Mammadli ,

Abstract

AgentSearchBench presents a large-scale benchmark for agent search that addresses the challenge of identifying suitable AI agents for complex tasks by evaluating performance through execution-grounded signals rather than textual descriptions alone.

AI-generated summary

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.