KV caching looks like a bad trade on paper. Memory, complexity, and operational surface area, all spent to shave a few percent off a request.
The benchmarks do not rescue it.
We’ve seen teams leave it at that. KV cache is necessary inside a single forward pass, so you keep it for the life of the request and move on. Keeping it alive beyond that, reused across requests, starts to sound like a serving-layer luxury. You picture the memory it would pin, the eviction logic, the extra moving parts in a stack that is already hard enough to operate. Set that against a few percent of latency and the trade does not look worth making.
Understandably so. Run the numbers on a single request and long-lived KV caching underwhelms. We ran them, and at first it was a very unflattering story. But a single request is the wrong unit to judge this on, and once you measure at production load the economics turn.
The single-request savings are bounded
In one of our Qwen3-30B-A3B runs, a 1K input / 512 output request came in around 135 ms TTFT and about 2.5 seconds of total request latency. TTFT carries scheduling and queueing overhead on top of raw prefill compute, so call the prefill portion roughly 100 ms. Erase it completely, the most a perfect cache hit can do, and you save about 4 percent of the request. Treat that as the ceiling, not the everyday case.
As context grows, so does prefill’s share of the total request latency. At 16K input / 512 output, TTFT was about 769 ms of 3,200 ms total, which puts it near 24 percent. That is a real step up from the 1K case. The input/output ratio affects prefill more than the context length does. KV cache earns the most when a request carries a large input and returns a small output, because prefill is then a bigger share of the bill. In scenarios where you have short input and long output, decode takes over while the cache has little room to help.
On its own, a 4 or 24 percent share looks modest. Latency is measured at a target throughput, GPU capacity is scarce, and as throughput climbs, utilization, queueing, and pipeline stalls increase the cost of redoing prefill. So in production, prefill becomes a capacity and tail-latency problem once thousands of requests compete for the same accelerators.
So the skepticism is fair, for the single request. If the only question is whether one cache hit meaningfully cuts one request’s latency, the answer is often no, and it depends on the input/output ratio and how much of the request prefill actually owns. At that level, KV caching is not an automatic win.
Large-input, small-output workloads are getting more common, not less. Agentic workloads are multi-turn by nature. Context grows as chat history, tool-call results, and retrieval chunks pile up, so each new turn carries a larger input against a small output. Exactly the type of workload where prefill dominates and where reusing the cache has the most to give.
For application teams, reusing the cache shows up as lower TTFT, tighter p95, and lower cost per request. For the platform teams running the GPUs, it shows up as higher utilization and more capacity per dollar.
Expensive to hold, hard to reuse
That said, the KV cache is not free to keep. Hold it in GPU memory and it competes with active inference for the scarcest space you have. Move it to host memory and it is cheaper but still bounded, with DRAM prices trending the wrong way. Push it out to remote memory or storage and you take on transfer latency, placement problems, and more operational surface. KV caching is a bet. You are spending scarce memory on the wager that future requests reuse the work you are holding.
But that’s only half the problem. Even when you are willing to pay for the memory, the reuse you get back is limited. The production-friendly option today is prefix caching: if a later request begins with the exact same prefix, the engine reuses the KV cache already computed for it. The rule is strict, exact prefix match or nothing. Plenty of real workloads share meaning without sharing a prefix. Reordered retrieval chunks, varying tool results, and shifting user context carry the same semantic content in different positions, and none of it counts as a hit, so hit rates suffer.
It might seem like KV caching has a lot going against it. The single-node latency win is bounded. The memory cost is high. The reuse model is narrow. Evaluate it as an isolated optimization on one node and the honest question is whether the complexity earns its place. In isolation, often it does not. But isolation is the wrong frame because inference is not the system it was when those objections were formed. Each of them was measured against a single node running prefill and decode together, holding a cache that was large and expensive to keep. Two things have shifted since. The first is structural, in where prefill and decode run. The second is economic, in what the cache costs to hold and to move. Each one undercuts a different piece of the case against.
Inference is becoming a distributed systems problem
Prefill and decode are not the same kind of work. Prefill is compute-heavy. Decode is sensitive to memory bandwidth and to latency. Put both on the same accelerator and you force a compromise on one to serve the other. Split them, and you create a clean boundary between two workloads that want different things. The KV cache, however, has to cross that boundary.
When prefill and decode live on different hardware, the KV cache becomes a first-class distributed systems primitive, something you transfer, place, and manage a lifecycle for. NVIDIA Dynamo and the disaggregated stacks coming out of AWS and Cerebras are building the split into the infrastructure itself, which is what forces developers to think about how the KV cache moves, where it lives, and how long it stays alive.
The economics are starting to move
The second shift is economic. The KV cache itself is getting more efficient to store and to move. A surprising amount of recent model progress is really KV cache innovation, and the last 18 months have been striking.
DeepSeek-V2 and V3 introduced Multi-head Latent Attention (MLA). MLA compresses keys and values into a shared low-rank latent vector before anything gets cached. For V3, that takes the per-token cache from roughly 16,384 scalar dimensions under standard multi-head attention down to 576, a 512-dimensional latent plus 64 dimensions for decoupled RoPE. Against an MHA baseline that is about a 28x reduction. Against the GQA baseline most modern models already use, it is smaller, roughly 4 to 8x depending on group size, and MLA gets there while holding MHA-level quality, which GQA gives up.
Qwen 3.5 goes a different way with a Gated DeltaNet hybrid. It replaces 75 percent of its attention layers with Gated DeltaNet linear attention, layers that hold a fixed-size state matrix, 128 by 128 per head, and update it incrementally with each token. The state does not grow with sequence length. Only the remaining 25 percent, full softmax attention with GQA, still needs a traditional KV cache. At long contexts, where the KV cache usually dominates memory, this removes most of the growth. The payoff scales with context: substantial at 256K tokens, modest at 1K, where a fixed-size state costs about what a small KV cache would anyway.
TurboQuant and PolarQuant, from Google at ICLR 2026, take yet another angle. Instead of changing the attention mechanism, they quantize the KV cache itself to 3 or 4 bits per coordinate with no measurable accuracy loss on standard benchmarks. PolarQuant rotates vectors with a random orthogonal matrix so the coordinates follow a known distribution, then applies an optimal Lloyd-Max scalar quantizer, and QJL adds a 1-bit residual correction. At 4 bits the paper reports up to 8x faster attention on an H100. At 3 bits, roughly 6x memory reduction.
The exact numbers depend on baselines and configurations, but the direction is obvious. MLA shrinks the cache dramatically. Hybrid architectures such as Qwen’s Gated DeltaNet reduce cache growth across much of the network. Quantization approaches such as TurboQuant reduce memory requirements further without changing the model architecture. Different tradeoffs, same trend: the object is getting smaller.
Memory cost is the usual objection to KV caching, but this recent work almost makes it moot. Shrink the cache by 6x to an order of magnitude and the economics look very different. More entries fit in the same budget. Transfers from remote memory, SSD, or another node get faster. There’s a misconception that the cache has to become trivially small. But it only has to get small enough that the economics cross over for the workloads people run in production.
The storage hierarchy is changing as well. The previous thought was if the KV cache is not in GPU memory, it is too slow to matter. That is getting harder to say. Fast interconnects and local NVMe continue to improve. Now the question is whether moving or loading the cache can free the decode GPU from repeated prefill work and keep it pointed at the latency-sensitive part of the pipeline. If a storage-backed cache lowers prefill pressure and keeps accelerator capacity on decode, it can pay off quickly.
The workload is the main success driver. For a given model and context length, the system weighs the time to recompute prefill against the time to move the cache over the network, the time to read it from SSD, the cost of reserving the memory or storage, and the odds the cache gets reused at all. When transfer or load time comes in well under recompute time and reuse is likely enough, the cache earns its place. When it does not, the cache is a cost with no return.
Prefix caching under load
We saw this behavior in an experiment we recently ran. We used Qwen3-1.7B on an L40S, a 10K-token prompt, and the number of shared prefix tokens varied from 0 to 10K across several concurrency levels. As the shared prefix grows, the vLLM prefix cache hit ratio climbs from 0 to 1. Throughput and request latency were monitored at each concurrency level.
Prefix cache hit ratio grows linearly with shared prefix length, from 0 to 10K tokens against the fixed 10K-token prompt.
Throughput rises as the shared prefix grows and is sharpest at high concurrency, where skipping repeated prefill lets the system sustain more requests per second.
Mean request latency falls as reuse increases, with the high-concurrency settings improving most.
At low concurrency, a higher hit ratio helps, but only a little. At high concurrency, the same increase results in a much larger system effect. Requests per second climb sharply as more of the prompt comes from cache, and mean latency decreases along with it, dropping fastest at the higher concurrency levels. That is the behavior one would expect if KV caching is a systems optimization rather than a single-request latency trick.
The experiment is a small model on a single node, so it does not prove the disaggregated-architecture argument on its own. But it does verify that prefix cache hits remove prefill work from the serving path, and the system-level benefit grows with concurrency. The disaggregation thesis is that this gets stronger when prefill and decode run on separate hardware and the KV cache moves between them as a first-class object.
Prefix caching already works for the right workloads
Prefix caching generally has a narrow sharing mode. For agentic workflows it fits more naturally than you might expect, though the reason it fits changes by category of context.
System prompts are the easy case. They are stable across requests, they sit at the front of the prompt, and are a textbook prefix hit. An agent making a series of tool calls against the same backend reuses the same 2K to 8K token system prompt on every request. A multi-turn conversation with a fixed system prompt reuses the whole instruction block. A code-generation agent with stable repository context reuses the project description and file summaries. For this category, cross-request caching is straightforward.
Other kinds of context ask for more care. Chat history grows and shifts from turn to turn. Tool-call exemplars get reordered or swapped. Retrieval chunks change with every RAG query. These often share material across requests without sharing an exact prefix, so the effectiveness of the cache comes down to how much of the context is positionally stable (which prefix caching handles), versus variable (which needs something like CacheBlend to unlock).
Research for a better solution
For messier patterns, like RAG with retrieval chunks that vary by query or tool results that differ between calls, two requests can share a great deal of material without sharing the exact same prefix, and classic prefix caching returns a miss in those cases even when most of the computation could have been reused.
CacheBlend is one of the research directions in this area. It is exploring the idea of cache repair, which takes a semantically similar cached entry to what the current request needs, and selectively recomputes only the parts that differ. If repair is cheap enough, individual caches become reusable across more requests and hit rates rise without spending more memory.
This is still in open research, it’s not solved yet. No major inference framework ships chunk-level KV reuse today. The selective recomputation carries its own latency, quality preservation depends on the workload, and the methodology for measuring these tradeoffs is still maturing. But the direction is promising. More flexible cached entries raise the effective hit rate inside the same memory budget.
Prefix caching answers “does caching work?” for a growing number of workloads. The open question is how much of the rest can be brought into the cacheable regime, and cache repair is where we are working that out.
From per-request state to systems primitive
If you evaluate KV caching as an isolated optimization on a single node, it doesn’t make much sense. The memory cost is high and prefix-based reuse is limited.
But the architecture underneath is changing. Disaggregated prefill and decode create the right interface. Better attention mechanisms shrink the object you have to store and move. Faster networks and SSDs reduce transfer costs. Cache repair could push reuse beyond strict prefixes. Scarce GPU capacity makes repeated prefill work harder to justify.
Together, these shifts are turning the KV cache from a temporary intermediate state into an inference systems primitive.

