MATPO: Multi-Agent Tool-Integrated Policy Optimization
Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.
![]() GAIA Results |
![]() FRAMES Results |
![]() WebWalkerQA Results |
MATPO allows planner and worker agents to coexist within a single LLM and be trained via RL, achieving an 18.38% relative improvement over single-agent baselines on GAIA-text, FRAMES, and WebWalker-QA.
News & Updates
- [2025-Oct-08] MATPO-Qwen3-14B checkpoints and rollouts released
- [2025-Oct-08] Code and training scripts released
- [2025-Oct-06] Arxiv Paper released
Overview
MATPO (Multi-Agent Tool-Integrated Policy Optimization) is a novel reinforcement learning framework that enables training multiple specialized agent roles (planner and worker agents) within a single large language model.
The Problem
Current single-agent approaches for multi-turn tool-integrated planning face critical limitations:
- Context Length Bottleneck: Tool responses (e.g., web scraping) consume excessive tokens, making long-range planning prohibitive
- Noisy Tool Responses: Raw tool responses interfere with the model's attention and planning capabilities
Our Solution
MATPO introduces a multi-agent-in-one-model architecture where:
- A planner-agent orchestrates high-level planning and delegates subtasks
- Worker-agents handle specific browsing and search tasks with isolated contexts
- Both roles are trained within a single LLM using role-specific prompts via reinforcement learning
Key Features
- Multi-Agent-in-One-Model: Train planner and worker agents within a single LLM using role-specific system prompts
- Principled Credit Assignment: Extends GRPO with theoretically grounded reward distribution across planner and worker rollouts
- Easy Integration: Built on top of veRL, compatible with existing RL training frameworks
- Robust Training: More stable learning curves compared to single-agent approaches, especially with noisy tool responses
- Infrastructure Efficient: No need for deployment of separate models or additional rollout engines
MATPO Architecture
MATPO employs a hierarchical multi-agent framework where a single LLM serves multiple roles:
User Query β Planner Agent β Subtask 1 β Worker Agent β Result 1
β Subtask 2 β Worker Agent β Result 2
β ...
β Final Answer
Comparison between the rollout trajectories between the single-agent GRPO (top) and the multi-agent MATPO (bottom).
Multi-Agent Rollout Process
Planner Agent:
- Receives user query with planner-specific system prompt
- Generates high-level plan and decomposes it into subtasks
- Delegates subtasks to worker agents
- Synthesizes worker responses into final answer
Worker Agent:
- Receives subtask with worker-specific system prompt
- Performs multi-turn tool-integrated planning (search, scrape, analyze)
- Returns summarized result to planner
- Maintains isolated context to prevent token overflow
Credit Assignment:
- Final answer accuracy determines the reward
- Reward is normalized across all planner-worker rollout groups
- Gradient flows to both planner actions and worker actions proportionally
Visualization of MATPO implementation.
Quick Start
Prerequisites:
- Python 3.10 or higher
- CUDA 12.4+ (for GPU support)
- 16 x (8 x 80G-A800) GPUs (for training with Qwen3-14B-base)
Clone the repository.
git clone https://github.com/mzf666/MATPO.git
cd MATPO
For prerequisites installation (CUDA, cuDNN, Apex), we recommend following the verl prerequisites guide which provides detailed instructions for:
- CUDA: Version >= 12.4
- cuDNN: Version >= 9.8.0
- Apex
Setup environment and install dependencies.
conda create -n matpo python==3.10 -y
conda activate matpo
bash examples/sglang_multiturn/install.sh
Setup Node.js for Serper API support.
MCP (Model Context Protocol) requires Node.js to run MCP servers. Node.js version 18+ is recommended for optimal compatibility with MCP tools.
target_path=YOUR_TARGET_PATH
# Download Node.js binary (example for Linux x64)
wget https://nodejs.org/dist/v24.2.0/node-v24.2.0-linux-x64.tar.xz
# Extract to your target path
tar -xf node-v24.2.0-linux-x64.tar.xz -C $target_path
# Add to PATH
export NODEJS_HOME=$target_path/node-v24.2.0-linux-x64
export PATH=$NODEJS_HOME/bin:$PATH
export NODE_SHARED=$target_path/node-shared/node_modules
export PATH=$NODE_SHARED/.bin:$PATH
# Verify installation
node --version
npm --version
# Install serper mcp server
mkdir -p $target_path/node-shared
cd $target_path/node-shared
npm init -y
npm install serper-search-scrape-mcp-server
Configure the Node.js paths and HTTP / HTTPS proxies (if necessary) in the examples/sglang_multiturn/launch.sh
script properly.
Download the training and testing datasets to the data
directory. The prerpocessed datasets can be downloaded here.
Train a Qwen3-14B-base model with MATPO on the MuSiQue dataset and evaluate on the GAIA-text datasets:
# tested on 16 x (8 x 80G-A800) nodes
export SERPER_API_KEY="YOUR_SERPER_API_KEY" && \
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY" && \
export WANDB_API_KEY="YOUR_WANDB_API_KEY" && \
export SINGLENODE=true && \
export RAY_DEBUG=legacy && \
export HYDRA_FULL_ERROR=1 && \
source YOUR_CONDA_PATH activate matpo && \
cd YOUR_PROJECT_PATH && \
bash examples/sglang_multiturn/launch.sh \
examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
Experiments and Results
Main Results
MATPO consistently outperforms single-agent GRPO baselines across all benchmarks:
Method | GAIA-text | WebWalkerQA | FRAMES | Relative Average Improvement |
---|---|---|---|---|
Single-Agent GRPO | 32.16% | 30.14% | 56.22% | - |
MATPO (Ours) | 42.60% | 33.00% | 63.64% | +18.38% |
Training Configuration
- Base Model: Qwen3-14B-base
- Training Dataset: Filtered MuSiQue dataset.
- Training Steps: 180 steps
- Rollouts per Query: 8 (for group normalization)
- Reward Function: 0.9 Γ accuracy + 0.1 Γ tool_format_reward
Model Checkpoints and Rollouts
We release the trained Qwen3-14B-base model checkpoints at the 180th training step of both single-agent GRPO and MATPO.
The associated model rollouts across various training steps can be found here.
Key Findings
More Stable Training: MATPO exhibits more stable learning curves and avoids catastrophic performance drops observed in single-agent training
Robustness to Noise: Multi-agent decomposition effectively isolates noisy tool responses, preventing them from interfering with high-level planning
Better Credit Assignment: Principled reward distribution across planner and worker rollouts leads to more effective learning
Practical Implementation Tips
Based on our experiments, we recommend:
- Final Summary: Final summaries from worker agents are critical for clean planner-worker interfaces
- Query Recap: Recapping original user query in worker prompt significantly improves performance
- URL Blocking: Remember to blocking HuggingFace search results to avoid data leakage
Citation
If you find MATPO helpful in your research, please consider citing our paper:
@misc{mo2025multiagenttoolintegratedpolicyoptimization,
title={Multi-Agent Tool-Integrated Policy Optimization},
author={Zhanfeng Mo and Xingxuan Li and Yuntao Chen and Lidong Bing},
year={2025},
eprint={2510.04678},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.04678},
}
Acknowledgments
We would like to thank:
- VolcEngine for developing and open-sourcing veRL, the RL training framework that powers MATPO
- Alibaba Cloud for the Qwen3 model series
- Google for the Serper API that enables web search capabilities
- The authors of GAIA, WebWalkerQA, FRAMES, and MuSiQue datasets
- The open-source community for valuable feedback and contributions
FAQ
Q: What's the difference between MATPO and traditional multi-agent systems?
MATPO uses a single LLM to play multiple agent roles via different system prompts, rather than deploying separate models. This offers:
- Lower infrastructure complexity
- Better parameter efficiency
- Easier deployment and maintenance
- Compatible with existing RL frameworks
Q: Can I use MATPO with models other than Qwen3?
Yes! MATPO is model-agnostic. You can use any decoder-only LLM that supports tool calling and multi-turn conversations. We've tested with Qwen3-14B-base, but models like Llama 3, Mistral, or other reasoning-capable LLMs should work.
Q: How many GPUs do I need for training?
For Qwen3-14B-base, we recommend:
- Training: 8x A100/A800 GPUs (80GB)
- Inference: 1-2x A100/A800 GPUs (40GB/80GB)
Q: How does MATPO handle credit assignment?
MATPO extends GRPO with principled credit assignment:
- The planner's final answer determines the accuracy reward
- This reward is normalized across all rollouts in a group
- Gradients flow proportionally to both planner and worker actions
- Worker agents receive the same advantage value as their parent planner rollout
See our paper for more details.
Q: Can I use MATPO for tasks other than web search?
Absolutely! While our paper focuses on web search, MATPO's framework is general. You can extend it to:
- Code generation with execution feedback
- Scientific reasoning with calculator tools
- Data analysis with pandas/SQL tools
- Any multi-turn task with verifiable rewards
Q: How stable is MATPO training compared to single-agent RL?
MATPO is significantly more stable. Our experiments show:
- Single-agent GRPO often suffers catastrophic drops after step 120
- MATPO maintains steady improvement throughout training
- Multi-agent structure isolates noisy tool responses, preventing interference
See Figure 4 in our paper for training curves.
Q: Do I need to block HuggingFace URLs during training?
For research integrity, yes - especially if your evaluation benchmarks are hosted on HuggingFace. This prevents models from "cheating" by finding ground-truth answers online.
For production systems with no data leakage concerns, this is optional.
Star β this repository if you find it helpful!
- Downloads last month
- 19