< back to blog

vLLM's Hash Chain and Why Prefix Caching Is Still Prefix Caching

Automatic prefix caching, hash chains, radix trees. The data structures get cleverer, but we haven’t moved past shared prefixes.

Automatic Prefix Caching sounds like it should solve a bigger problem than plain prefix caching. Requests are hashed, cache entries are discovered automatically, and shared work never has to be tracked by hand. It looks like a system that can find reusable computation wherever it appears.

But it’s still the same type of reuse we’ve always had. Shared prefixes are reusable. Shared content that is not a prefix is not. That rule sets the biggest limit on what today’s inference infrastructure can reuse.

Most of the recent work in KV caching has focused on finding prefixes more efficiently. Hash chains, radix trees, and automatic discovery all improve the mechanics of reuse. But they don’t change what can be reused. The workloads where it succeeds, and where it falls short, show why.

When prefix caching is enough

Reusing the KV cache only pays off when requests share computation, so the question is how much of a real workload today’s infrastructure can reuse.

For many agentic workloads, the answer is often “enough”. Stable system prompts, tool definitions, and conversation scaffolding create long shared prefixes, and inference engines are good at finding and reusing them.

Plenty of workloads share more than just prefixes, though. RAG pipelines are where it breaks down. The system prompt stays fixed while the retrieved documents change from request to request. Two requests might carry the same five documents in a different order. The meaning is almost identical. The token sequence is not, and the cache matches on the token sequence. Same content, different positions, no reuse.

Why prefix caching remains prefix-bound

vLLM’s Automatic Prefix Caching uses content hashing to remove the need for explicit prefix tracking. The KV cache is divided into fixed-size blocks, and each block gets a SHA-256 hash. The hash for block N folds in the hash of every preceding block along with the content of the current block. Chained together, every block fingerprints not only its own content but the entire token history before it.

When a request arrives, vLLM hashes each block-sized chunk of input and checks whether a matching block already exists. Matching blocks reuse previously computed KV cache. Missing blocks trigger fresh computation. Lookup is effectively constant-time, eviction is straightforward, and fixed-size blocks map cleanly onto GPU memory.

But because each block hash depends on every block before it, one divergence changes every hash that follows. Take two requests that share the first 3,000 tokens and split at token 3,001. The block holding token 3,001 hashes differently, and so does every block after it. Reuse stops at the point of divergence. The system discovers shared prefixes on its own, and it can’t discover shared content that appears once requests have diverged.

Reuse happens only at block boundaries. If two requests share 1,000 tokens and a block holds 16, vLLM reuses 62 whole blocks, or 992 tokens, and recomputes the remaining 8. For long prefixes that waste is negligible. For short or irregular shared segments it’s more substantial.

There is no matching inside a block, either. Two blocks that share 15 of their 16 tokens still hash to completely different values, so reuse is all-or-nothing at the block level. These are reasonable tradeoffs that keep the implementation simple and fast, but they still leave you with the same limitation: reuse follows exact prefix structure.

A different cache structure might help. SGLang takes that route. Instead of hashing fixed-size blocks, it keeps cached state in a radix tree indexed by token sequences. When a request arrives, the runtime walks the tree and finds the longest matching cached prefix, and matches can fall on arbitrary token boundaries rather than fixed block ones. That helps workloads with variable-length turns, branching conversations, and irregular prefix lengths.

The radix tree still searches for the longest shared prefix, though. Once two requests diverge, the content they share later in the sequence stays out of reach. SGLang improves how prefixes are discovered, but it does not extend reuse past prefixes.

Beyond prefixes

Prefix caching tells us that KV reuse clearly works. The more open question now is how much reuse survives divergence. Today’s runtimes are tuned for shared prefixes. The next generation goes after shared segments, cache repair, and the reuse that prefix matching cannot reach.

So much of the current research now focuses on cache repair and segment-level reuse. The goal has shifted from proving that KV reuse is valuable to recovering the work that today’s prefix-based systems still leave behind.