arxiv:2511.20639

Latent Collaboration in Multi-Agent Systems

Published on Nov 25

· Submitted by

Jiaru Zou on Nov 27

#1 Paper of the day

Princeton-AI

Upvote

Authors:

Jiaru Zou ,

Pan Lu ,

Ling Yang

Abstract

LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.

AI-generated summary

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

View arXiv page View PDF GitHub 190 Add to collection

Community

jiaruz2

Paper author Paper submitter 3 days ago

•

edited 2 days ago

Code and Data are released here: https://github.com/Gen-Verse/LatentMAS
X/Twitter Cover: https://x.com/LingYang_PU/status/1993510834245714001
LinkedIn Cover: https://www.linkedin.com/feed/update/urn:li:activity:7399636490164559872

MichaelBarryUK

3 days ago

Very exciting work!
How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

jiaruz2

Paper author Paper submitter 2 days ago

•

edited 2 days ago

Hi Michael,

Thanks for your excellent question on our LatentMAS work. I will provide a detailed response below. Let me know if you want to discuss more!

How does this effect bandwidth? Does it trade token efficiency for bandwidth inefficiency?

TL;DR

Short answer: No. LatentMAS does not trade token efficiency for bandwidth inefficiency. Its “bandwidth” is in latent working memory, and since each latent step is far more expressive than a token, LatentMAS requires many fewer steps, making it both token-efficient and bandwidth-efficient, with faster inference.

Detailed Response

In normal TextMAS, bandwidth = #tokens × |V| (vocabulary-level information throughput)

In LatentMAS, bandwidth = #latent steps × dₕ × L (hidden-state KV transfer).
Note: Here, “bandwidth” refers to internal GPU memory movement of latent working memory (stored in KV caches) between agents.

As we know from Theorem 3.1:
$\text{Latent expressiveness} = \Omega\!\left(\frac{d_h}{\log |V|}\right) \times \text{Text}$

This means:

One latent step carries the semantic information of hundreds of tokens.
You only need m≪T_tokens to reach the same reasoning depth.

Thus, while each latent step transmits dense vectors, each step carries far more information than a token, drastically reducing the number of required communication steps. According to the paper, both theoretical complexity analysis and empirical results (70–80% fewer tokens, 4× speedup) later demonstrate that LatentMAS is strictly more bandwidth-efficient at the system level than text-based multi-agent communication.

MichaelBarryUK

2 days ago

•

edited 1 day ago

Thank you. So we're looking at perhaps 50 latent steps, which ought to take about the same time as 50 tokens, perhaps ~10% less. Then it will output perhaps 1,000 tokens which would be the answer minus the verbose reasoning, reaching a solution that would ordinarily take 8,000+ tokens. So without a multi-agent setup it's effectively a Latent Recurrent Transformer.

Assuming my understanding is correct I have a few more questions if you don't mind answering.

Can we mix and match LLM's?
Qwen 3 4b with Qwen 3 14b? How about mixing with other families? Does it support quantization? Does it work with Lora?

Is the hierarchical agent setup concurrent? I.e can two small models work together in realtime, with instant access to each other's steps, or does it need to fully complete all the steps before it is shared?