• About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us
AimactGrow
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing
No Result
View All Result
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing
No Result
View All Result
AimactGrow
No Result
View All Result

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

Admin by Admin
June 18, 2026
Home AI
Share on FacebookShare on Twitter


Lengthy-context giant language fashions (LLMs) face a reminiscence bottleneck that has nothing to do with mannequin weights. Throughout decoding, transformers cache the important thing and worth (KV) vectors for each token at each layer so that they don’t must recompute consideration. This cache grows linearly with sequence size and batch measurement, and at lengthy context with excessive concurrency it may possibly dwarf the mannequin’s personal footprint.

Take into account Llama-3.1-70B in BF16. Its KV cache prices about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dim × 2 tensors × 2 bytes). At 128K tokens that’s ~40 GB; at 1M tokens it exceeds 300 GB — greater than the 140 GB of weights themselves. Worse, each newly decoded token has to stream your complete cache out of high-bandwidth reminiscence (HBM), which makes decoding memory-bandwidth-bound reasonably than compute-bound. Shrinking the KV cache is subsequently probably the most direct lever for chopping each value and decode latency.

Present approaches fall into roughly 5 households: token eviction (H2O, SnapKV), quantization (KIVI, GEAR), low-rank projection (Palu), merging (KVMerger), and architectural sharing (MLA). Latest 2026 work has pushed arduous on the ultra-low-bit quantization frontier. Google and NYU’s TurboQuant (ICLR 2026) and Collectively AI’s OSCAR assault the identical downside from reverse instructions, whereas Apple’s EpiCache tackles an issue neither one addresses.

Most KV quantizers are preventing the identical underlying enemy: outlier channels — a handful of channels with disproportionately giant magnitudes that dominate the quantization vary and squeeze the remainder of the sign into only a few representable ranges. That is why naive INT2 quantization (solely 4 ranges) collapses to near-zero accuracy.

KIVI established the usual baseline right here. It confirmed that key vectors have mounted outlier channels throughout tokens whereas worth vectors don’t, so it quantizes keys per-channel and values per-token. That tuning-free 2-bit recipe cuts end-to-end peak reminiscence (weights included) by about 2.6×, and it’s the reference level the newer strategies construct on.

TurboQuant: data-oblivious and theoretically optimum

TurboQuant handles outliers with out ever taking a look at your knowledge, in two levels:

  • Stage one: every vector is randomly rotated so its coordinates change into almost impartial and roughly Gaussian, which lets an optimum precomputed scalar (Lloyd–Max) quantizer be utilized per coordinate.
  • Stage two: a 1-bit Quantized Johnson–Lindenstrauss (QJL) rework is utilized to the residual, giving a provably unbiased estimate of consideration logits with no normalization-constant overhead.

The promoting level is theoretical: TurboQuant’s distortion is provably inside a small fixed issue (≈ 2.7×) of the information-theoretic decrease sure. In observe it reaches basically full-precision recall on Needle-in-a-Haystack at 4× compression, and the paper stories absolute high quality neutrality at 3.5 bits and solely marginal degradation at 2.5 bits per channel. As a result of it wants no calibration, it really works on any mannequin untouched and doubles as a quick vector-database quantizer.

One caveat value flagging: the extensively repeated “8× sooner consideration on H100” determine comes from Google’s weblog, not the paper, and refers to a slender attention-logit microbenchmark. TurboQuant’s documented candy spot is the three–4 bit near-lossless regime.

OSCAR: attention-aware and deployment-ready

OSCAR bets the other manner. Its premise is that at INT2’s 4 ranges, a data-oblivious rotation is the fallacious instrument — blindly smoothing ranges isn’t sufficient when there’s virtually no precision to spare. So OSCAR computes an attention-aware rotation from a one-time offline calibration move: keys are rotated into the eigenbasis of the question covariance, values into the score-weighted worth covariance. A Hadamard rework plus a bit-reversal permutation then unfold channel significance evenly throughout the quantization teams.

What units OSCAR aside is that it ships as a whole system, not simply an algorithm:

  • Blended-precision paged cache: sink and up to date tokens keep in BF16 whereas the historical past compresses to INT2 — at 128K context solely ~0.24% of tokens stay in BF16.
  • Fused Triton kernels with full SGLang integration (paged-attention and prefix-cache suitable).
  • Precomputed rotations (a “RotationZoo”) for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 — no recalibration wanted.

At an efficient 2.28 bits, OSCAR lands inside 1.42 factors of BF16 on Qwen3-8B and is basically on par on Qwen3-32B (a 0.02-point hole). On GLM-4.7-FP8 — the place naive INT2 collapses to zero and data-oblivious baselines attain solely low single digits — OSCAR matches BF16 and even edges barely forward on the reported benchmarks (inside noise). Collectively AI stories as much as 7.83× job-level throughput and roughly 8× KV-cache reminiscence discount at 100K context, with as much as ~3× sooner decoding.

So which one wins?

Neither — and that’s the trustworthy reply. For deployable INT2 at 128K tokens on supported fashions, OSCAR is at present the one demonstrated possibility that doesn’t collapse, and it comes with production-ready SGLang assist. For training-free, model-agnostic quantization within the 3–4 bit regime, TurboQuant affords far broader generality.

OSCAR’s paper stories that TurboQuant drops by greater than 40 factors at a comparable price range — however that analysis runs inside OSCAR’s personal framework, quantizes all layers, makes use of a single random seed, and operates nicely under TurboQuant’s supposed bit-width, so it’s a weak foundation for a head-to-head verdict. The extra fascinating chance is that the 2 are complementary: pairing a calibration-aware rotation with an optimum scalar quantizer is a promising mixture no person has shipped but. (Each groups have publicly famous the identical thought.)

The third axis: EpiCache

TurboQuant and OSCAR are each constructed for a single lengthy context. Neither handles prolonged multi-turn conversations, the place historical past piles up throughout many exchanges. Apple’s EpiCache is a training-free KV-cache administration framework aimed precisely at that hole:

  • Block-wise prefill processes historical past in blocks to maintain peak reminiscence bounded.
  • Episodic clustering segments the dialog into coherent semantic “episodes,” every with its personal compressed cache.
  • Episode-matched retrieval routes every question to probably the most related episode at inference time.
  • Adaptive layer-wise price range allocation measures every layer’s sensitivity to eviction and distributes the reminiscence price range accordingly.

Throughout LongMemEval, RealTalk, and LoCoMo, EpiCache stories as much as 40% greater accuracy than eviction baselines, near-full-cache accuracy at 4–6× compression, and as much as 3.5× decrease peak reminiscence (and ~2.4× decrease latency). As a result of it decides which tokens to maintain reasonably than how exactly to retailer them, it composes immediately with OSCAR or TurboQuant for compounding financial savings.

Key Takeaways

  • TurboQuant pushes the theoretical, model-agnostic frontier — the go-to for 3–4 bit near-lossless compression on any mannequin.
  • OSCAR leads on deployable INT2, with as much as 7.83× throughput and ~8× reminiscence discount at 100K context on supported fashions.
  • EpiCache solves conversational reminiscence throughout turns — as much as 40% accuracy positive factors over eviction and three.5× decrease peak reminiscence — and composes with both quantizer.
  • Decide by constraint: bit-width price range, mannequin portability, or dialog size, then mix the orthogonal strategies that match. These approaches are extra complementary than aggressive.

Sources


Arnav Rai

Arnav is at present a scholar at Rochester Institute of Know-how pursuing a Bachelor’s diploma in Laptop Science and a minor in Economics with hands-on backend improvement expertise, and is a contributor at Marktechpost, the place he writes about AI/ML analysis.

Tags: CacheCompressionEpiCacheOSCARRaceTurboQuant
Admin

Admin

Next Post
Marshall Milton A.N.C. Headphones Evaluate: Pure Perspective, Higher Sound

Marshall Milton A.N.C. Headphones Evaluate: Pure Perspective, Higher Sound

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended.

Your Login Pages Could Be Hurting Your search engine optimization Efficiency

Your Login Pages Could Be Hurting Your search engine optimization Efficiency

September 5, 2025
What They Are & How one can Earn Them

What They Are & How one can Earn Them

September 13, 2025

Trending.

Nsfw Chatgpt Options – Examples I’ve Used

Nsfw Chatgpt Options – Examples I’ve Used

October 13, 2025
Digital Detox & Display Time Statistics 2025

Digital Detox & Display Time Statistics 2025

March 28, 2026
How creators and entrepreneurs are utilizing AI to hurry up & succeed [data]

How creators and entrepreneurs are utilizing AI to hurry up & succeed [data]

June 17, 2025
All Overwatch 2 Dokiwatch Skins, Title Playing cards, And Cosmetics

All Overwatch 2 Dokiwatch Skins, Title Playing cards, And Cosmetics

April 24, 2025
What’s a Ahead Deployed Engineer: The AI Position OpenAI, Anthropic, and Google Are Hiring in 2026

What’s a Ahead Deployed Engineer: The AI Position OpenAI, Anthropic, and Google Are Hiring in 2026

May 21, 2026

AimactGrow

Welcome to AimactGrow, your ultimate source for all things technology! Our mission is to provide insightful, up-to-date content on the latest advancements in technology, coding, gaming, digital marketing, SEO, cybersecurity, and artificial intelligence (AI).

Categories

  • AI
  • Coding
  • Cybersecurity
  • Digital marketing
  • Gaming
  • SEO
  • Technology

Recent News

Summer time treasure hunt location in Forza Horizon 6

Summer time treasure hunt location in Forza Horizon 6

June 18, 2026
The best way to rank in AI search outcomes: Professional finest practices

The best way to rank in AI search outcomes: Professional finest practices

June 18, 2026
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us

© 2025 https://blog.aimactgrow.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Technology
  • AI
  • SEO
  • Coding
  • Gaming
  • Cybersecurity
  • Digital marketing

© 2025 https://blog.aimactgrow.com/ - All Rights Reserved