Papers
arxiv:2511.20639

Latent Collaboration in Multi-Agent Systems

Published on Nov 25
· Submitted by Jiaru Zou on Nov 27
#1 Paper of the day
Authors:
,
,
,
,
Pan Lu ,
,
,
,
,
,
,

Abstract

LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.

AI-generated summary

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

Community

Very exciting work!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

Paper author Paper submitter
edited 2 days ago

Hi Michael,

Thanks for your excellent question on our LatentMAS work. I will provide a detailed response below. Let me know if you want to discuss more!

How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

TL;DR

Short answer: No. LatentMAS does not trade token efficiency for bandwidth inefficiency. Its “bandwidth” is in latent working memory, and since each latent step is far more expressive than a token, LatentMAS requires many fewer steps, making it both token-efficient and bandwidth-efficient, with faster inference.

Detailed Response

In normal TextMAS, bandwidth = #tokens × |V| (vocabulary-level information throughput)

In LatentMAS, bandwidth = #latent steps × dₕ × L (hidden-state KV transfer).
Note: Here, “bandwidth” refers to internal GPU memory movement of latent working memory (stored in KV caches) between agents.

As we know from Theorem 3.1:
Latent expressiveness=Ω ⁣(dhlogV)×Text \text{Latent expressiveness} = \Omega\!\left(\frac{d_h}{\log |V|}\right) \times \text{Text}

This means:

  • One latent step carries the semantic information of hundreds of tokens.
  • You only need m≪T_tokens to reach the same reasoning depth.

Thus, while each latent step transmits dense vectors, each step carries far more information than a token, drastically reducing the number of required communication steps. According to the paper, both theoretical complexity analysis and empirical results (70–80% fewer tokens, 4× speedup) later demonstrate that LatentMAS is strictly more bandwidth-efficient at the system level than text-based multi-agent communication.

·

Thank you. So we're looking at perhaps 50 latent steps, which ought to take about the same time as 50 tokens, perhaps ~10% less. Then it will output perhaps 1,000 tokens which would be the answer minus the verbose reasoning, reaching a solution that would ordinarily take 8,000+ tokens. So without a multi-agent setup it's effectively a Latent Recurrent Transformer.

Assuming my understanding is correct I have a few more questions if you don't mind answering.

Can we mix and match LLM's?
Qwen 3 4b with Qwen 3 14b? How about mixing with other families? Does it support quantization? Does it work with Lora?

Is the hierarchical agent setup concurrent? I.e can two small models work together in realtime, with instant access to each other's steps, or does it need to fully complete all the steps before it is shared?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

·
Paper author

Wow, that's fantastic, thanks for the support!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.20639 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.20639 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.20639 in a Space README.md to link it from this page.

Collections including this paper 9