Papers
arxiv:2511.21689

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Published on Nov 26
· Submitted by Shizhe Diao on Dec 3
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A small orchestrator using ToolOrchestra method coordinates various intelligent tools with reinforcement learning, achieving higher accuracy and efficiency in solving complex tasks like Humanity's Last Exam compared to larger models.

AI-generated summary

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

Community

image

🔥 Agent fine-tuning is back— an 8B orchestrator carries GPT-5, hitting 37.1 on HLE
Over the past year, we’ve seen an explosion of “AI agents” built by chaining tools, APIs, and LLMs together — with ever-more-sophisticated routing logic and workflows.
But here’s the problem:
Wiring components together ≠ intelligence.
Workflows alone don’t learn, don’t adapt, and don’t optimize.

At some point, agents have to learn how to reason, not just be told when to call a model or how to retrieve a document.
That’s where fine-tuning comes back — this time, reinvented through reinforcement learning (RL).
đź§  From orchestration to optimization
We at NVIDIA Research introduced ToolOrchestra, a framework that uses long-horizon RL to train small models (“orchestrators”) to manage big ones.
Instead of hand-coded heuristics or fixed workflows, the Orchestrator learns through RL how to decide:
→ Which model to use including powerful ones (GPT-5, Claude, etc.)
→ When to invoke a tool (search, code interpreter, API call)
→ How long to reason before acting

The orchestrator is rewarded not just for accuracy, but also for efficiency: balancing accuracy, latency, and cost.
This makes it a truly adaptive controller, not just a scripted pipeline.
⚙️ RL optimizing your workflow
This is a direction the community has barely explored: reinforcement learning applied to orchestration itself.
It’s long-horizon, multi-objective RL — optimizing workflows, not just single-step predictions.
And the results are striking.
Our Orchestrator-8B outperforms frontier LLMs like GPT-5, Claude Opus 4.1, and Llama-3.3-70B on hard reasoning benchmarks (Humanity’s Last Exam, FRAMES, τ²-Bench) while being cheaper and faster. It outperforms GPT-5 on HLE (37.1% v.s. 35.1%) while being 2.5× faster and 70% cheaper.

💡 “Fine-tuning is dead”? Think again.
There’s been a popular narrative lately — that fine-tuning is over, and that prompt engineering or workflow composition are enough.
Our work proves that fine-tuning didn’t die — it evolved.
Now it’s RL-based, multi-objective, and long-horizon.

ToolOrchestra marks a shift from “monolithic LLMs” to compound AI systems — modular, adaptive, and self-optimizing.

🚀 The rise of compound AI systems
We’re entering a new phase of AI system design:
→ From monolithic LLMs to compound AI systems — modular, adaptive, and self-optimizing.
→ From static workflows to RL-trained orchestration.
→ From brute-force scale to intelligent coordination.
Orchestration is no longer just a systems problem — it’s a learning problem.

Paper: https://lnkd.in/grjWkh6J
Homepage: https://lnkd.in/gkyiZEaJ
Model: https://lnkd.in/gBmtif_f
Data: https://lnkd.in/g2HbQbcg
Code: https://lnkd.in/gWF_m6Je

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.21689 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.