Abstract
LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
Community
Code and Data are released here: https://github.com/Gen-Verse/LatentMAS
X/Twitter Cover: https://x.com/LingYang_PU/status/1993510834245714001
LinkedIn Cover: https://www.linkedin.com/feed/update/urn:li:activity:7399636490164559872
Very exciting work!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?
Hi Michael,
Thanks for your excellent question on our LatentMAS work. I will provide a detailed response below. Let me know if you want to discuss more!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?
TL;DR
Short answer: No. LatentMAS does not trade token efficiency for bandwidth inefficiency. Its “bandwidth” is in latent working memory, and since each latent step is far more expressive than a token, LatentMAS requires many fewer steps, making it both token-efficient and bandwidth-efficient, with faster inference.
Detailed Response
In normal TextMAS, bandwidth = #tokens × |V| (vocabulary-level information throughput)
In LatentMAS, bandwidth = #latent steps × dₕ × L (hidden-state KV transfer).
Note: Here, “bandwidth” refers to internal GPU memory movement of latent working memory (stored in KV caches) between agents.
As we know from Theorem 3.1:
This means:
- One latent step carries the semantic information of hundreds of tokens.
- You only need m≪T_tokens to reach the same reasoning depth.
Thus, while each latent step transmits dense vectors, each step carries far more information than a token, drastically reducing the number of required communication steps. According to the paper, both theoretical complexity analysis and empirical results (70–80% fewer tokens, 4× speedup) later demonstrate that LatentMAS is strictly more bandwidth-efficient at the system level than text-based multi-agent communication.
Thank you for the detailed response. Let's say we can do the latent reasoning in 20 steps, I'm assuming this is 20 forward passes minus the last layer with a bit kv cache magic to plug the output into the input and so on, then after these 20 steps it switches to standard autoregessive token generation for the final answer (which could be several thousand tokens) . If my understanding is correct then i'm wondering whether each step is faster than a single token generation, more akin to prefill, or is each step slower than a standard single pass?
Also, have you considered reinforcement learning over the top of this, if we don't care about the "reasoning" being interpretable, this might prove advantagous for RL considering it doesn't need to decode at every step. I'm wondering how many steps we can go before it plateaus.
Finally, do you share the latent representations at each step or at the end of the last step, I'm wondering whether for parallel agents model b could coerce model a mid-flight? Therefore saving even more tokens
Sorry for all the questions, but this is incredible, thankyou.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning (2025)
- Thought Communication in Multiagent Collaboration (2025)
- LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning (2025)
- Exploring System 1 and 2 communication for latent reasoning in LLMs (2025)
- Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation (2025)
- CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning (2025)
- Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/latent-collaboration-in-multi-agent-systems
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper