MarkTechPost

Cache‑to‑Cache: LLMs Share Knowledge Without Text Tokens

11 days agoRead original →

Cache‑to‑Cache (C2C) is a groundbreaking paradigm that lets large language models (LLMs) share knowledge without exchanging token‑level text. Instead of sending strings of words, each model piggybacks on the key‑value cache that stores intermediate activations during inference. By fusing the caches of two or more LLMs, the models can directly ingest each other’s semantic representations, effectively creating a low‑latency, bandwidth‑efficient communication channel. This approach sidesteps the traditional overhead of token generation and decoding, making collaborative inference faster and more scalable, especially in edge or distributed settings where network constraints are a concern.

The study, conducted by a consortium of researchers from Tsinghua University, Infinigence AI, The Chinese University of Hong Kong, Shanghai AI Laboratory, and Shanghai Jiao Tong University, demonstrates how KV‑Cache fusion can be orchestrated through lightweight protocol exchanges that annotate cache slots with metadata. The authors implemented a prototype where two GPT‑style models jointly generate a paragraph; one model’s cache is merged into the other’s, resulting in a 30% reduction in token‑level communication and a 15% decrease in overall inference time. Beyond speed, the approach preserves privacy, as no raw text is transmitted—only abstract activations. The team also envisions broader applications, such as collaborative fine‑tuning across institutions, real‑time multi‑model dialogue systems, and even federated learning scenarios where bandwidth and privacy constraints dominate.

While C2C showcases impressive gains, the authors acknowledge several open challenges. First, aligning cache formats across heterogeneous architectures requires standardization; mismatched tokenizers or layer depths can lead to noisy fusion. Second, security concerns arise when caching activations, which might leak sensitive information if intercepted. The team proposes lightweight encryption and differential‑privacy masks as mitigations. Finally, scaling C2C beyond two models will demand efficient scheduler logic to orchestrate cache swaps without disrupting inference pipelines. Despite these hurdles, the research opens a fresh avenue for LLM ecosystems, hinting at a future where models collaborate as semantic partners rather than text generators. As the field moves toward more efficient, privacy‑preserving AI, Cache‑to‑Cache communication could become a cornerstone technique for distributed intelligence.

Want the full story?

Read on MarkTechPost