Speculative KV Coding: 4× Lossless Cache Compression
Disclosure: This post was researched and drafted with AI assistance. Primary source: "kkm", "Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model", fergusfinn.com, posted 8 May 2026; surfaced on the HN front page on 4 June 2026. The arithmetic-coder framing, the 11-bits-per-scalar bf16 cache entropy number, the ~4× lossless / ~8× gross compression claim, and the analogy to Leviathan et al.'s speculative decoding (2022) are all from the post. The "predictor is the product" framing in the original-take section is the author's synthesis. The comments quoted in the discussion section are real HN comments on that thread, permalinked to the right authors; we did not paraphrase around them. Benchmarks were not independently reproduced.
In 2026, the bottleneck in long-context LLM serving is VRAM holding the KV cache and PCIe moving it — not flops. As agentic workflows (coding agents, long-document RAG, multi-hour research sessions) push average context windows past the 200K mark, the cache stops being "a little memory" and starts being the dominant line item on the inference bill. A new write-up from kkm on fergusfinn.com describes a method called Speculative KV coding that gets you up to ~4× lossless compression of the cache using a cheaper predictor model, stacking on top of the lossy FP8 compression everyone is already doing for a gross ~8× reduction. The post hit the HN front page on June 4 with 79 points and a comment thread that is, unusually, a real engineering discussion rather than a flame war. It deserves more attention than the ranking suggests.
The headline is "4×." The interesting number is buried in the setup cost.
What speculative KV coding actually does
The classical way to make a KV cache smaller is lossy quantization: drop K and V from bf16 to FP8 (or FP4), accept the quality hit, and run evals until your benchmarks stop screaming. TurboQuant is the most-discussed recent example of this family, and sits in the same conceptual neighborhood as the fergusfinn post.
Speculative KV coding is a different move. It is lossless — the reconstructed cache is bit-identical to the original — and it works by analogy with speculative decoding (Leviathan, Kalman, Matias, 2022):
- Pick a predictor model — a smaller, faster model whose forward pass on the same prompt gives a per-scalar guess μ and a calibrated uncertainty σ² of the target model's KV cache.
- Both the encoder (who has access to the target model) and the decoder (who will reconstruct the cache) run the predictor on the prompt. The predictor is cheap, so running it twice is fine. Both sides end up with the same (μ, σ) per scalar.
- The encoder runs the target model to get the real KV cache, then feeds (KV_full, μ, σ) into an arithmetic coder (the same family of coders behind
rANS/tANS— the post links to prior work on both). The coder emits a bitstream whose length is bounded by the cross-entropyH(p, q) = H(p) + KL(p || q). Because the KV cache is a deterministic function of weights and prompt, its "true" entropy is zero; every bit the coder emits is pure KL against the predictor. - The decoder consumes the bitstream alongside its locally reconstructed (μ, σ) and recovers
KV_fullexactly.
The whole point is the split cost. The encoder pays one full target-model forward pass (it has to, that's the only way to get the real cache). The decoder pays a predictor forward pass per token and some arithmetic. The bandwidth between them is just the bitstream. In a long-context agent session, the decoder side is the side that runs many many times — the encoder is prefill-once, the decoder is decode-many — so the asymmetry is the entire point of the method.
The numbers that matter
The post gives three numbers worth keeping in your head.
- bf16 KV cache is about 11 bits per scalar of bytewise entropy, roughly 30% smaller than the raw 16-bit format. So even a perfect general-purpose entropy coder, with no model of the cache at all, gets you ~1.45×. That is the floor.
- ~4× lossless compression of a bf16 cache with the predictor-model approach. The author is explicit that this is on top of any lossy FP8 quantization you were already doing — which, because the bf16→FP8 step is already saving 2× on its own, gives a gross ~8× reduction in cache size for an FP8 cache you are now losslessly compressing.
- The bitrate is ~½ ln(2πe σ²) bits per scalar in expectation, which is just
log(typical error magnitude). Better predictor → smaller typical error → fewer bits. The marginal cost of a smarter predictor is paid in flops; the marginal benefit is paid in VRAM and bandwidth. The arbitrage is in the ratio.
That last equation is the reason this is interesting. The cost of a forward pass through a predictor model scales with the predictor's parameters. The savings scale with how well that predictor's μ matches the target's KV_full. There is a break-even point, and the post is careful to say it does not yet know exactly where.
The comment thread is the real story
The HN discussion is roughly a dozen comments long and unusually high-signal. Three exchanges in particular are worth quoting at length.
wongarsu lays out the cost curve: "The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model. For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so."
That is the post in one sentence. The economics only flip in your favor when the cache is genuinely expensive, which today means frontier models in production serving, not your laptop running Llama 8B.
0-_-0 raises the obvious follow-up: "You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it." That is the trivial upper bound the post walks you through, and yes, paying a full target forward pass to predict your own forward pass is silly. The author's framing is that the predictor needs to be cheaper than the target — and the choice of predictor is the cost-versus-bits tradeoff the whole post is organized around.
saagarjha makes the cleaner point: "Speculation is only worth it if you can profit from it. Not every context allows this or has a similar idea of what can be speculated." A predictor model only helps if its forward pass on the prompt is correlated with the target's forward pass. If you pick a predictor that is just a bag of weights with no shared structure, you get the floor (the 11-bit entropy, ~1.45×). The post's choice — "an optimised version of the same model" — is the obvious, principled answer: same architecture, same prompt, same attention pattern, just a cheaper optimization. That is what makes the conditional entropy H(KV_full | M_pred(prompt)) actually small.
The original take: the cheap predictor is the product, not the compression
Most coverage of compression releases treats the codec as the product and the predictor as a black box. The framing in this post has it exactly backwards, and that is the part most people will miss.
The codec is two pages of rANS. It is the part that has been solved for twenty years. The predictor is the part that has just become cheap enough to use, because in 2026 you can serve a small open-weights model in a few hundred milliseconds on a single GPU. The cost of running a 1B-parameter predictor model on a 200K-token prompt, in 2026, is in the range of seconds. The cost of not compressing your 1T-parameter target model's KV cache is in the range of not-fitting-it-in-memory.
That cost curve is what makes the method timeable. Two years ago, the predictor would have been a quarter of the target's flops and the arbitrage would not have closed. Two years from now, the predictor will be a single forward pass of a distilled version of the same model trained specifically to predict the target's cache, and the 4× number will probably be 6×. The interesting number is the one we will get when someone trains that predictor end-to-end.
Expect the first production deployments to stay within a single model family — same architecture, same tokenizer, same training data. Cross-family prediction is possible (the arithmetic coder is still lossless) but the bitrate will be much higher because the conditional entropy gets larger as the predictor and target diverge.
What this means for you
Four reader profiles, four different calls:
- If you serve frontier models in production (≥ 70B, long contexts, batched traffic): the 4× lossless number is real and the 8× gross number is the one that matters for your VRAM bill. This is the deployment profile wongarsu describes, and it is the only profile where the cost curve is unambiguously in your favor. The integration cost is one predictor-model forward pass per request, which is in the noise relative to a 1T-class prefill.
- If you run smaller models locally (8B–13B, single-user, sub-100K context): you are on the wrong side of the break-even. The predictor model would cost you a meaningful fraction of the target's flops, and the cache is not the bottleneck yet. Hold off.
- If you build agentic systems: this is the workload that should make you care the most. An agent loop that holds a 500K-token context across many turns is paying the cache cost on every decode. The 4× compression is bandwidth between your LLM provider and your agent runtime, which today is the single biggest cap on agent session length. Watch for vendor support here first; it will land in the inference stacks that already do speculative decoding (vLLM, TensorRT-LLM, SGLang) before it shows up in any closed API.
- If you build ML systems for a living: the predictor-quality story is the next thing to pay attention to. A predictor model trained end-to-end to minimize the cross-entropy of (target KV | predictor forward pass) is a much smaller research project than the codec work was, and the marginal value is large. The post is essentially an open call for that work.
What to do this week
# 1. If you maintain a vLLM / TensorRT-LLM / SGLang fork:
# - Find the existing speculative-decoding code path.
# - The (encoder, decoder) asymmetry it implements is structurally
# identical to what Speculative KV coding needs.
# - The 4× number is a vRAM win, not a flops win. Plan the benchmark
# for batched traffic, not single-stream.
# 2. If you serve a frontier model with > 200K context windows:
# - Measure the share of your inference cost that is cache storage
# and cache transfer. If it is < 20%, skip. If it is > 40%, this
# is worth a prototype.
# - Start with same-family predictor (e.g., target = Llama-3 70B,
# predictor = Llama-3 8B at INT4). Cross-family is a research
# project, not a deploy.
# 3. If you build agents: be ready to switch inference providers
# the day one of them ships this. An 8× cache-size win is the
# difference between a 30-minute session and a 4-hour session
# on the same hardware, and whichever provider gets there first
# is the one whose API key ends up in your framework's default.
# 4. If you write ML systems posts: do not lead with "4× lossless
# compression." Lead with "the predictor model is the product."
# That is the framing nobody else has and it is the part of
# the post that will still be true in 2028.
The bottom line
Speculative KV coding is not a clever codec trick. It is a cost-curve observation: that a 1B-parameter predictor model in 2026 is cheap enough to run as a side computation, that a frontier model's KV cache is expensive enough to make that side computation worth it, and that the gap between those two facts has been closing for three years and will continue to close. The 4× number is real. The interesting question is what the number will be in twelve months, when the predictor is trained end-to-end against the target's actual cache distribution, and the answer is almost certainly "larger than 4×, and the predictor itself is the thing someone ships as a model."
This is the post to send to the engineer on your team who keeps saying "we'll just quantize harder." It is also the post to send to the person who keeps saying "VRAM is the new FLOPS." Both of them are right. The 2026 argument is about which side of the cost curve you are on, and this Speculative KV coding write-up is the cleanest published version of that argument I have read.
Related reads from this blog
- Microsoft Just Put a Workflow Engine Inside Postgres — same week, different bottleneck: durable execution in the database. The structural similarity is that both moves relocate work from where it is expensive (a separate orchestrator, separate VRAM) to where it is already paid for (the database, the predictor model).
- Redis 8.8: Your Lua Rate Limiter Is Now Obsolete — both posts are about a vendor deciding your separate layer is now their default. Redis 8.8 ate the rate-limiter; whoever ships Speculative KV coding in vLLM eats your cache-cost budget.
Sources
- kkm, "Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model", fergusfinn.com, 8 May 2026. https://fergusfinn.com/blog/kv-entropy-coder/
- Hacker News discussion, front page 4 June 2026. https://news.ycombinator.com/item?id=48400151
- Yaniv Leviathan, Matan Kalman, Yossi Matias, "Fast Inference from Transformers via Speculative Decoding", ICML 2023 (arXiv 2211.17192, Nov 2022). https://arxiv.org/abs/2211.17192

No comments:
Post a Comment