arxiv:2511.20639

Latent Collaboration in Multi-Agent Systems

Published on Nov 25

· Submitted by

Jiaru Zou on Nov 27

#2 Paper of the day

Princeton-AI

Upvote

Authors:

Jiaru Zou ,

Pan Lu ,

Abstract

LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.

AI-generated summary

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

View arXiv page View PDF GitHub 115 Add to collection

Community

jiaruz2

Paper author Paper submitter 1 day ago

•

edited about 7 hours ago

Code and Data are released here: https://github.com/Gen-Verse/LatentMAS
X/Twitter Cover: https://x.com/LingYang_PU/status/1993510834245714001
LinkedIn Cover: https://www.linkedin.com/feed/update/urn:li:activity:7399636490164559872

MichaelBarryUK

about 22 hours ago

Very exciting work!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

jiaruz2

Paper author Paper submitter about 20 hours ago

•

edited about 20 hours ago

Hi Michael,

Thanks for your excellent question on our LatentMAS work. I will provide a detailed response below. Let me know if you want to discuss more!

How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

TL;DR

Short answer: No. LatentMAS does not trade token efficiency for bandwidth inefficiency. Its “bandwidth” is in latent working memory, and since each latent step is far more expressive than a token, LatentMAS requires many fewer steps, making it both token-efficient and bandwidth-efficient, with faster inference.

Detailed Response

In normal TextMAS, bandwidth = #tokens × |V| (vocabulary-level information throughput)

In LatentMAS, bandwidth = #latent steps × dₕ × L (hidden-state KV transfer).
Note: Here, “bandwidth” refers to internal GPU memory movement of latent working memory (stored in KV caches) between agents.

As we know from Theorem 3.1:
$\text{Latent expressiveness} = \Omega\!\left(\frac{d_h}{\log |V|}\right) \times \text{Text}$

This means:

One latent step carries the semantic information of hundreds of tokens.
You only need m≪T_tokens to reach the same reasoning depth.

Thus, while each latent step transmits dense vectors, each step carries far more information than a token, drastically reducing the number of required communication steps. According to the paper, both theoretical complexity analysis and empirical results (70–80% fewer tokens, 4× speedup) later demonstrate that LatentMAS is strictly more bandwidth-efficient at the system level than text-based multi-agent communication.

MichaelBarryUK

about 13 hours ago

•

edited about 13 hours ago

Thank you for the detailed response. Let's say we can do the latent reasoning in 20 steps, I'm assuming this is 20 forward passes minus the last layer with a bit kv cache magic to plug the output into the input and so on, then after these 20 steps it switches to standard autoregessive token generation for the final answer (which could be several thousand tokens) . If my understanding is correct then i'm wondering whether each step is faster than a single token generation, more akin to prefill, or is each step slower than a standard single pass?

Also, have you considered reinforcement learning over the top of this, if we don't care about the "reasoning" being interpretable, this might prove advantagous for RL considering it doesn't need to decode at every step. I'm wondering how many steps we can go before it plateaus.

Finally, do you share the latent representations at each step or at the end of the last step, I'm wondering whether for parallel agents model b could coerce model a mid-flight? Therefore saving even more tokens

Sorry for all the questions, but this is incredible, thankyou.