Stochastic KV Routing: Enables Intelligent Deep Cache Sharing

Rendering high-throughput converter language models requires caching of Key Values (KVs) to avoid unnecessary computation during automated generation. KV's short-term memory is critical and has a significant impact on provisioning costs. This work proposes to reduce these memory requirements. Although recent work has mainly addressed the reduction of KV storage by compression and release in a region parallel to the temporal axis, we argue that the depth dimension provides an orthogonal and robust approach to optimization. Although previous research suggests that full cache of all layers is no longer practical, implementing cross-layer cache sharing remains a practical challenge; Existing methods often suffer from reduced output or increased time-to-start tokens. In this paper, we show that discarding the layer cache provides better performance without losing information. We propose a simple training method: cross-layer random attention. During training, the layers randomly choose to go to their KV states or those of the previous layer. This stochastic process adapts the model to be robust to various intelligent cache sharing strategies, ensuring flexibility to unknown hardware constraints at deployment time. Our experiments show that using this system during pre-training or fine-tuning allows deep cache sharing for different model families. In addition, for larger models in data-compressed settings, this approach suggests a more general-like effect, often saving or improving performance while significantly reducing cache memory.



