Papers
arxiv:2511.20639

Latent Collaboration in Multi-Agent Systems

Published on Nov 25
· Submitted by Jiaru Zou on Nov 27
#2 Paper of the day
Authors:
,
,
,
,
Pan Lu ,
,
,
,
,
,
,

Abstract

LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.

AI-generated summary

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

Community

Very exciting work!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

Paper author Paper submitter
edited about 20 hours ago

Hi Michael,

Thanks for your excellent question on our LatentMAS work. I will provide a detailed response below. Let me know if you want to discuss more!

How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

TL;DR

Short answer: No. LatentMAS does not trade token efficiency for bandwidth inefficiency. Its “bandwidth” is in latent working memory, and since each latent step is far more expressive than a token, LatentMAS requires many fewer steps, making it both token-efficient and bandwidth-efficient, with faster inference.

Detailed Response

In normal TextMAS, bandwidth = #tokens × |V| (vocabulary-level information throughput)

In LatentMAS, bandwidth = #latent steps × dₕ × L (hidden-state KV transfer).
Note: Here, “bandwidth” refers to internal GPU memory movement of latent working memory (stored in KV caches) between agents.

As we know from Theorem 3.1:
Latent expressiveness=Ω ⁣(dhlogV)×Text \text{Latent expressiveness} = \Omega\!\left(\frac{d_h}{\log |V|}\right) \times \text{Text}

This means:

  • One latent step carries the semantic information of hundreds of tokens.
  • You only need m≪T_tokens to reach the same reasoning depth.

Thus, while each latent step transmits dense vectors, each step carries far more information than a token, drastically reducing the number of required communication steps. According to the paper, both theoretical complexity analysis and empirical results (70–80% fewer tokens, 4× speedup) later demonstrate that LatentMAS is strictly more bandwidth-efficient at the system level than text-based multi-agent communication.

·

Thank you for the detailed response. Let's say we can do the latent reasoning in 20 steps, I'm assuming this is 20 forward passes minus the last layer with a bit kv cache magic to plug the output into the input and so on, then after these 20 steps it switches to standard autoregessive token generation for the final answer (which could be several thousand tokens) . If my understanding is correct then i'm wondering whether each step is faster than a single token generation, more akin to prefill, or is each step slower than a standard single pass?

Also, have you considered reinforcement learning over the top of this, if we don't care about the "reasoning" being interpretable, this might prove advantagous for RL considering it doesn't need to decode at every step. I'm wondering how many steps we can go before it plateaus.

Finally, do you share the latent representations at each step or at the end of the last step, I'm wondering whether for parallel agents model b could coerce model a mid-flight? Therefore saving even more tokens

Sorry for all the questions, but this is incredible, thankyou.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.20639 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.20639 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.20639 in a Space README.md to link it from this page.

Collections including this paper 6