Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Abstract
Cache-to-Cache (C2C) enables direct semantic communication between LLMs using neural network projections, improving accuracy and reducing latency compared to text-based communication.
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.
Community
Can LLMs communicate beyond text? We explore Cache-to-Cache (C2C) as a new multi-LLM communication paradigm. It directly projects and fuses KV-caches between models to transfer semantics, achieving ~8.5–10.5% average accuracy gains over single models, ~3.0–5.0% over text-based exchange, and ~2× lower latency. Code is open-sourced.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EpiCache: Episodic KV Cache Management for Long Conversational Question Answering (2025)
- SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching (2025)
- PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference (2025)
- ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models (2025)
- KaVa: Latent Reasoning via Compressed KV-Cache Distillation (2025)
- Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction (2025)
- CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I love the concept of learnable gating mechanism for selective layer fusion. 2times latency reduction is great progress.
Thank you so much for your interest and kind words about our work! 😊
We’re really excited to see how the community will build on Cache-to-Cache communication and push this direction further.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper