MarkTechPost

kvcached: Elastic KV Cache for LLM Serving on Shared GPUs

19 days agoRead original →

Large language model (LLM) inference workloads typically reserve a large, static key‑value (KV) cache region on the GPU for each model, even when requests are bursty or idle. This strategy leads to significant memory waste and limits the number of models that can be served concurrently on a shared GPU. UC Berkeley’s Sky Computing Lab has addressed this inefficiency with kvcached, a lightweight library that introduces a virtualized, elastic KV cache layer.

kvcached works by abstracting the physical GPU memory into a virtual cache pool that multiple models can share. When a request arrives, the library checks the requested KV size and allocates the necessary memory from the pool, deallocating it once the inference completes. This dynamic allocation ensures that memory is used only when needed, allowing a single GPU to host more models or larger batch sizes without exceeding its physical limits. The implementation leverages CUDA’s memory management APIs and integrates seamlessly with popular inference engines like Hugging Face’s Transformers and NVIDIA’s TensorRT, requiring only minimal changes to existing deployment pipelines.

The impact of kvcached extends beyond memory savings. By reducing the overhead of pre‑reserved cache space, inference latency can improve due to better cache utilization, and overall GPU throughput rises. Early benchmarks show up to a 30% increase in the number of concurrent LLM requests on a 16‑GB GPU, while maintaining comparable or better inference times compared to static caching. Moreover, kvcached’s architecture makes it suitable for both cloud data centers and edge devices, where GPU resources are scarce and cost-sensitive. Future work aims to incorporate predictive cache sizing based on traffic patterns and to explore multi‑GPU scaling with hierarchical cache layers.

In summary, kvcached represents a practical step toward more elastic, efficient LLM serving infrastructures, enabling developers to maximize GPU utilization without compromising performance.

Want the full story?

Read on MarkTechPost