Learning Key-Value Caching

nimda February 23, 2026

0 4 1 minute read

The increasing size of Large-Language Models (LLMs) makes efficient expression challenging, primarily due to the memory requirements of the Autoregressive Key-Value (KV) cache. Existing methods of eviction or suppression reduce costs but rely on heuristics, such as recent or previous attention points, which only serve as indirect proxies for the future use of the token and introduce computational overhead. We reformulate KV cache extraction as a reinforcement learning (RL) problem: learning to rate tokens by their predicted utility in future encodings. To this end, we present the KV Policy (KVP), a framework for lightweight per-head RL agents trained on computerized production tracking using only key and value vectors. Each agent learns a special output policy guided by the future utility, which evaluates the quality of the level across all cache budgets, requiring no modification to the basic LLM or additional considerations. Tested on two different model families in the long content benchmark RULER and the multi-variable chat benchmark OASST2-4k, KVP outperforms the basics significantly. Furthermore, zero-shot tests on low-level tasks (eg, LongBench, BOOLQ, ARC) show that KVP is more generalizable over its training distribution and longer context lengths. These results show that learning to predict future token utility is a powerful and risky paradigm for dynamic KV cache management.

Source link

nimda February 23, 2026

0 4 1 minute read