Papers
arxiv:2510.03215

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Published on Oct 3
· Submitted by Tianyu Fu on Oct 9
#1 Paper of the day
Authors:
,
,

Abstract

Cache-to-Cache (C2C) enables direct semantic communication between LLMs using neural network projections, improving accuracy and reducing latency compared to text-based communication.

AI-generated summary

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Community

Paper author Paper submitter

Can LLMs communicate beyond text? We explore Cache-to-Cache (C2C) as a new multi-LLM communication paradigm. It directly projects and fuses KV-caches between models to transfer semantics, achieving ~8.5–10.5% average accuracy gains over single models, ~3.0–5.0% over text-based exchange, and ~2× lower latency. Code is open-sourced.

Paper author Paper submitter
edited 8 days ago

idea

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I love the concept of learnable gating mechanism for selective layer fusion. 2times latency reduction is great progress.

·
Paper author

Thank you so much for your interest and kind words about our work! 😊
We’re really excited to see how the community will build on Cache-to-Cache communication and push this direction further.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.03215 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.03215 in a Space README.md to link it from this page.

Collections including this paper 12