KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

0 1 5 minutes read

KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

Large-scale linguistic models (LLMs) face a memory problem unrelated to model weights. During decoding, transformers store key and value (KV) vectors for all tokens in all layers so that they cannot be re-identified. This cache grows linearly with the length of the sequence and the size of the cluster, and in the long case where the consensus is large, the trace of the model itself can be small.

Consider Llama-3.1-70B in BF16. Its KV cache costs about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dims × 2 tensors × two bytes). For 128K tokens that is ~40 GB; for 1M tokens exceeds 300 GB – more than 140 GB of weights themselves. Worse, each newly decoded token must stream the entire cache out of high-bandwidth memory (HBM), making decoding bandwidth-bound rather than computationally bound. Reducing the KV cache is therefore a very specific factor for cutting both costs and determining latency.

Current methods fall into five families: issuance of tokens (H2O, SnapKV), quantization (KIVI, GEAR), low-level speculation (Palu), to combine (KVMerger), and sharing of properties (MLA). Recent work by 2026 has been pushing hard at the frontier of ultra-low-bit quantization. Google and NYU's TurboQuant (ICLR 2026) and OSCAR's AI Together attack the same problem from different angles, while Apple's EpiCache tackles the problem without a single address.

Most KV quantizers fight the same basic enemy: external channels – several channels of unequal size that dominate the measurement range and compress the rest of the signal to a few representative levels. That's why the INT2 quantization (only four levels) drops to almost zero.

KIVI has established a common ground here. Show that key vectors have fixed outliers for all tokens while value vectors do not, so it measures the keys. for each station and values each token. That non-modifiable 2-bit recipe cuts the top-to-end memory (weights included) by about 2.6×, and is the reference point on which new methods are built.

TurboQuant: data-agnostic and theoretically correct

TurboQuant handles third parties without looking at your data, in two stages:

First stage: each vector is randomly rotated so that its coordinates are nearly independent and nearly independent, allowing the use of a pre-computed (Lloyd–Max) integral estimator for each coordinate.
Second stage: a 1-bit Quantized Johnson–Lindenstrauss (QJL) transform is applied to the remainder, which provides an unbiased estimate of the attentional detail without a common subject constant.

The selling point is theoretical: TurboQuant distortion is likely to be within a small constant factor (≈ 2.7×) of the information-theoretic lower bound. In practice it reaches full Needle-in-a-Haystack precision recall at 4× compression, and the paper reports absolute quality neutrality at 3.5 bits and only marginal degradation at 2.5 bits per channel. Because it does not require quantization, it works on any non-touch model and doubles as a fast vector-database quantizer.

One caveat to flag: the repeated value of “8× faster attention on the H100” appears on the Google blog, not in the paper, and refers to a microbenchmark for less attention. TurboQuant's Sweet spot is in the 3–4 bit loss-less regime.

OSCAR: aware and ready for deployment

OSCAR bets differently. Its basis is that for four levels of INT2, data-informed rotation is a wrong tool – the range of blind smoothing is not enough if there is almost no accuracy left. So OSCAR includes the attention rotation from a one-time measurement pass offline: the keys are rotated to the eigenbasis of the covariance question, the values are the covariance of the estimated value of the points. The Hadamard transformation and the permutation of the stepwise regression then distribute the channel importance equally across all measurement groups.

What makes OSCAR different is that it is delivered as a complete program, not just an algorithm:

Mixed precision page cache: sink and recent tokens remain in BF16 while history is choked in INT2 — in a 128K context only ~0.24% of tokens remain in BF16.
Triton characters combined with full SGLang integration (paged attention and cache prefix are compatible).
Pre-made rotation (“RotationZoo”) for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 — no recalibration required.

At 2.28 effective bits, OSCAR sits within 1.42 points of BF16 on Qwen3-8B and is actually on par with Qwen3-32B (0.02 point gap). In GLM-4.7-FP8 – where the naive INT2 falls to zero and the data-ignored bases reach only low numbers – OSCAR is similar to BF16 and edges even slightly ahead of the reported benchmarks (within noise). Together AI reports up to 7.83× throughput rate and 8× KV cache memory reduction for 100K cores, with up to ~3× faster decoding.

So which one wins?

Not really – and that's the honest answer. Because INT2 usable for 128K tokens on supported modelsOSCAR is currently the only featured option that doesn't crash, and it comes with production-ready SGLang support. Because training-free, model-agnostic quantization in the 3-4 bit regimeTurboQuant provides a very broad generalization.

The OSCAR paper reports that TurboQuant falls by more than 40 points on a comparable budget – but that test operates within the OSCAR framework, measures all layers, uses a single random seed, and operates below TurboQuant's target range, so it's a weak basis for a head-to-head decision. The most interesting possibility is that both exist which is compatible: pairing a rotation-aware balancer with a precision scaler is a promising combination that no one has posted yet. (Both parties have expressed the same opinion publicly.)

Third axis: EpiCache

TurboQuant and OSCAR are both designed for one long context. There are no handlers conversations with many possibilitieswhere history accumulates in many exchanges. Apple's EpiCache is a KV-free cache management framework aimed squarely at that gap:

Smart first fill processes history in blocks to keep high memory bound.
Episode collection it divides the conversation into coherent semantic “chunks”, each with its own compressed cache.
Episode-matched retrieval moves each question to the most relevant section in time.
Budget allocation with a flexible layer it measures the sensitivity of each layer to extraction and distributes the memory budget accordingly.

Across LongMemEval, RealTalk, and LoCoMo, EpiCache reports up to 40% higher accuracy than release baselines, near full cache accuracy at 4–6× compression, and up to 3.5× lower memory (and ~2.4× lower latency). Because it decides which tokens to keep rather than how exactly to save them, it integrates directly with OSCAR or TurboQuant for savings integration.

Key Takeaways

TurboQuant pushes the theoretical, model-agnostic boundary – to 3–4 bit near-lossless compression in any model.
The OSCAR leads to usable INT2, achieving 7.83× throughput and ~8× memory reduction for 100K cores on supported models.
EpiCache it resolves conversational memory for all curves – up to 40% accuracy achieved when fired and 3.5× lower memory high – and includes any quantizer.
Select by pressing: budget for a smaller scope, model portability, or discussion length, and combine orthogonal methods that fit. These methods are complementary rather than competing.

Sources

Arnav is currently a student at Rochester Institute of Technology pursuing a Bachelor's degree in Computer Science and a minor in Economics with experience in backend development, and contributes to Marktechpost, where he writes about AI/ML research.

Source link

nimda 3 hours ago

0 1 5 minutes read