Abstract
AgentSearchBench presents a large-scale benchmark for agent search that addresses the challenge of identifying suitable AI agents for complex tasks by evaluating performance through execution-grounded signals rather than textual descriptions alone.
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation (2026)
- AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation (2026)
- Learning to Retrieve from Agent Trajectories (2026)
- Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent (2026)
- A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search (2026)
- AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web (2026)
- LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.22436 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper