Lengthy-context giant language fashions (LLMs) face a reminiscence bottleneck that has nothing to do with mannequin weights. Throughout decoding, transformers cache the important thing and worth (KV) vectors for each token at each layer so that they don’t must recompute consideration. This cache grows linearly with sequence size and batch measurement, and at lengthy context with excessive concurrency it may possibly dwarf the mannequin’s personal footprint.
Take into account Llama-3.1-70B in BF16. Its KV cache prices about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dim × 2 tensors × 2 bytes). At 128K tokens that’s ~40 GB; at 1M tokens it exceeds 300 GB — greater than the 140 GB of weights themselves. Worse, each newly decoded token has to stream your complete cache out of high-bandwidth reminiscence (HBM), which makes decoding memory-bandwidth-bound reasonably than compute-bound. Shrinking the KV cache is subsequently probably the most direct lever for chopping each value and decode latency.
Present approaches fall into roughly 5 households: token eviction (H2O, SnapKV), quantization (KIVI, GEAR), low-rank projection (Palu), merging (KVMerger), and architectural sharing (MLA). Latest 2026 work has pushed arduous on the ultra-low-bit quantization frontier. Google and NYU’s TurboQuant (ICLR 2026) and Collectively AI’s OSCAR assault the identical downside from reverse instructions, whereas Apple’s EpiCache tackles an issue neither one addresses.
Most KV quantizers are preventing the identical underlying enemy: outlier channels — a handful of channels with disproportionately giant magnitudes that dominate the quantization vary and squeeze the remainder of the sign into only a few representable ranges. That is why naive INT2 quantization (solely 4 ranges) collapses to near-zero accuracy.
KIVI established the usual baseline right here. It confirmed that key vectors have mounted outlier channels throughout tokens whereas worth vectors don’t, so it quantizes keys per-channel and values per-token. That tuning-free 2-bit recipe cuts end-to-end peak reminiscence (weights included) by about 2.6×, and it’s the reference level the newer strategies construct on.
TurboQuant: data-oblivious and theoretically optimum
TurboQuant handles outliers with out ever taking a look at your knowledge, in two levels:
- Stage one: every vector is randomly rotated so its coordinates change into almost impartial and roughly Gaussian, which lets an optimum precomputed scalar (Lloyd–Max) quantizer be utilized per coordinate.
- Stage two: a 1-bit Quantized Johnson–Lindenstrauss (QJL) rework is utilized to the residual, giving a provably unbiased estimate of consideration logits with no normalization-constant overhead.
The promoting level is theoretical: TurboQuant’s distortion is provably inside a small fixed issue (≈ 2.7×) of the information-theoretic decrease sure. In observe it reaches basically full-precision recall on Needle-in-a-Haystack at 4× compression, and the paper stories absolute high quality neutrality at 3.5 bits and solely marginal degradation at 2.5 bits per channel. As a result of it wants no calibration, it really works on any mannequin untouched and doubles as a quick vector-database quantizer.
One caveat value flagging: the extensively repeated “8× sooner consideration on H100” determine comes from Google’s weblog, not the paper, and refers to a slender attention-logit microbenchmark. TurboQuant’s documented candy spot is the three–4 bit near-lossless regime.
OSCAR: attention-aware and deployment-ready
OSCAR bets the other manner. Its premise is that at INT2’s 4 ranges, a data-oblivious rotation is the fallacious instrument — blindly smoothing ranges isn’t sufficient when there’s virtually no precision to spare. So OSCAR computes an attention-aware rotation from a one-time offline calibration move: keys are rotated into the eigenbasis of the question covariance, values into the score-weighted worth covariance. A Hadamard rework plus a bit-reversal permutation then unfold channel significance evenly throughout the quantization teams.
What units OSCAR aside is that it ships as a whole system, not simply an algorithm:
- Blended-precision paged cache: sink and up to date tokens keep in BF16 whereas the historical past compresses to INT2 — at 128K context solely ~0.24% of tokens stay in BF16.
- Fused Triton kernels with full SGLang integration (paged-attention and prefix-cache suitable).
- Precomputed rotations (a “RotationZoo”) for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 — no recalibration wanted.
At an efficient 2.28 bits, OSCAR lands inside 1.42 factors of BF16 on Qwen3-8B and is basically on par on Qwen3-32B (a 0.02-point hole). On GLM-4.7-FP8 — the place naive INT2 collapses to zero and data-oblivious baselines attain solely low single digits — OSCAR matches BF16 and even edges barely forward on the reported benchmarks (inside noise). Collectively AI stories as much as 7.83× job-level throughput and roughly 8× KV-cache reminiscence discount at 100K context, with as much as ~3× sooner decoding.
So which one wins?
Neither — and that’s the trustworthy reply. For deployable INT2 at 128K tokens on supported fashions, OSCAR is at present the one demonstrated possibility that doesn’t collapse, and it comes with production-ready SGLang assist. For training-free, model-agnostic quantization within the 3–4 bit regime, TurboQuant affords far broader generality.
OSCAR’s paper stories that TurboQuant drops by greater than 40 factors at a comparable price range — however that analysis runs inside OSCAR’s personal framework, quantizes all layers, makes use of a single random seed, and operates nicely under TurboQuant’s supposed bit-width, so it’s a weak foundation for a head-to-head verdict. The extra fascinating chance is that the 2 are complementary: pairing a calibration-aware rotation with an optimum scalar quantizer is a promising mixture no person has shipped but. (Each groups have publicly famous the identical thought.)
The third axis: EpiCache
TurboQuant and OSCAR are each constructed for a single lengthy context. Neither handles prolonged multi-turn conversations, the place historical past piles up throughout many exchanges. Apple’s EpiCache is a training-free KV-cache administration framework aimed precisely at that hole:
- Block-wise prefill processes historical past in blocks to maintain peak reminiscence bounded.
- Episodic clustering segments the dialog into coherent semantic “episodes,” every with its personal compressed cache.
- Episode-matched retrieval routes every question to probably the most related episode at inference time.
- Adaptive layer-wise price range allocation measures every layer’s sensitivity to eviction and distributes the reminiscence price range accordingly.
Throughout LongMemEval, RealTalk, and LoCoMo, EpiCache stories as much as 40% greater accuracy than eviction baselines, near-full-cache accuracy at 4–6× compression, and as much as 3.5× decrease peak reminiscence (and ~2.4× decrease latency). As a result of it decides which tokens to maintain reasonably than how exactly to retailer them, it composes immediately with OSCAR or TurboQuant for compounding financial savings.
Key Takeaways
- TurboQuant pushes the theoretical, model-agnostic frontier — the go-to for 3–4 bit near-lossless compression on any mannequin.
- OSCAR leads on deployable INT2, with as much as 7.83× throughput and ~8× reminiscence discount at 100K context on supported fashions.
- EpiCache solves conversational reminiscence throughout turns — as much as 40% accuracy positive factors over eviction and three.5× decrease peak reminiscence — and composes with both quantizer.
- Decide by constraint: bit-width price range, mannequin portability, or dialog size, then mix the orthogonal strategies that match. These approaches are extra complementary than aggressive.
Sources






![How creators and entrepreneurs are utilizing AI to hurry up & succeed [data]](https://blog.aimactgrow.com/wp-content/uploads/2025/06/Untitled20design-Apr-07-2023-08-24-35-4586-PM-120x86.png)


