arxiv:2510.03215

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Published on Oct 3

· Submitted by

Tianyu Fu on Oct 9

#1 Paper of the day

Tsinghua-NICS-EFC

Upvote

Authors:

Tianyu Fu ,

Zihan Min ,

Hanling Zhang ,

Jichao Yan ,

Abstract

Cache-to-Cache (C2C) enables direct semantic communication between LLMs using neural network projections, improving accuracy and reducing latency compared to text-based communication.

AI-generated summary

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

View arXiv page View PDF Project page GitHub 71 Add to collection

Community

fuvty

Paper author Paper submitter 8 days ago

Can LLMs communicate beyond text? We explore Cache-to-Cache (C2C) as a new multi-LLM communication paradigm. It directly projects and fuses KV-caches between models to transfer semantics, achieving ~8.5–10.5% average accuracy gains over single models, ~3.0–5.0% over text-based exchange, and ~2× lower latency. Code is open-sourced.

fuvty

Paper author Paper submitter 8 days ago

•

edited 8 days ago

librarian-bot

8 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Anand-vash-03

7 days ago

I love the concept of learnable gating mechanism for selective layer fusion. 2times latency reduction is great progress.

fuvty

Paper author 7 days ago

Thank you so much for your interest and kind words about our work! 😊
We’re really excited to see how the community will build on Cache-to-Cache communication and push this direction further.