Momento Japan Feed

The concurrency cliff is a memory limit

Khawaja Shams — Fri, 26 Jun 2026 00:00:00 GMT

Add concurrent users to an inference server and latency usually creeps up. KV cache serving does not creep. It holds flat, then falls off a cliff.

On a single L4 running a Qwen3-4B coding agent, the server holds 12 concurrent sessions at p99 under 2.6 seconds. Add two more and p99 jumps to 39 seconds, a 15× increase in a single step. The cliff is the moment the KV cache fills.

An agentic coding workload on commodity hardware

Our setup is a single g6.4xlarge EC2 instance with one NVIDIA L4 GPU (24 GB VRAM), running Qwen3-4B with FP8 weights on vLLM 0.20.2, automatic prefix caching (APC) enabled.

The workload models a lightweight coding agent mid-task. Each session opens with a 10,000-token shared system prompt (repository context plus agent instructions), followed by a 12-turn conversation where each turn appends about 1,000 tokens of unique context (tool call inputs, code snippets, responses). At turn 12, total context reaches roughly 22,800 tokens per session. Output is capped at 75 tokens per turn. The workload is heavily input-dominated, as agentic workloads tend to be.

System prompt:          10,000 tokens  (shared across sessions → APC cached)
Per-session turns:      ~12,800 tokens across 12 turns  (unique per session)
Output per turn:        75 tokens
—————————————————
Total at turn 12:       ~22,800 tokens

APC caches the system prompt once and amortizes its KV cost across all sessions (the blocks still occupy GPU memory, but only one copy exists). Within a session, APC also caches the growing turn history. Turn n+1 extends the exact prefix from turn n, so each turn only prefills the new ~1,000 tokens, as long as the prior turns’ KV blocks survive in cache. But the per-session turn history is unique (different code, different tool outputs) and cannot be shared across sessions. When concurrency pressure forces eviction of a session’s blocks, the next request in that session has to re-prefill the full accumulated history (up to ~12,800 tokens at turn 12). That re-prefill cost drives the TTFT cliff.

We swept concurrency from 1 to 48 sessions, measuring TTFT at each level, across three KV cache precisions, fp16, fp8, and TurboQuant 4-bit. We also ran best-case (all context cached) and worst-case (all context re-prefilled) bounds to bracket where realistic performance should land.

The concurrency cliff

The chart below shows TTFT percentiles (p50, p95, p99) and throughput across the full concurrency sweep with fp8 KV cache. The Y axis is logarithmic, and even on a log scale the cliff is sharp.

TTFT p50, p95, and p99, with throughput on the right axis, versus concurrent sessions. fp8 KV cache, Qwen3-4B FP8 on a g6.4xlarge (L4). The shaded band marks the 12 to 16 session collapse, and the dashed line marks the KV cache filling at 14c.

Below 12 concurrent sessions, p99 TTFT stays under 2.6 seconds and throughput climbs to a peak of 1.36 req/s. At 14 sessions, p99 jumps to 38.9 seconds in a single step, a 15× increase, and throughput drops 23 percent at the same time.

At 12 sessions, everything fits in KV cache. At 14, filling it forces eviction of another session’s blocks. That evicted session has to re-prefill its full 12,800 tokens of unique context at its next turn, and the cascade collapses latency.

The cliff sits between 12 and 14 sessions, and throughput never recovers past the peak at 12c. For a 2-second p99 SLA, the safe operating point is 6 concurrent sessions.

After the cliff, p50 lies to you

Above the cliff, p50 and p99 live in different regimes. At 20c, p50 is 2.2 seconds, which looks manageable, while p99 is 44.8 seconds, which is not. The distribution is bimodal, because APC creates two populations of requests. Lucky requests hit warm cache entries for their session context, prefill only the latest turn, and finish fast. Unlucky requests arrive after their session’s blocks were evicted, re-prefill the full 12,800 tokens, and take 30 to 50 seconds under load.

The p50 reflects the lucky cohort and the p99 reflects the unlucky one. A single TTFT average is meaningless past the cliff. You have to look at the tail to see the failure.

A few points from the sweep show the whole shape, flat through 12 sessions then the cliff at 14 and the widening p50/p99 gap past it.

Sessions	TTFT p50	TTFT p95	TTFT p99	req/s
1	388	487	493	0.42
6	1.06s	1.43s	1.75s	1.18
10	1.19s	1.89s	2.31s	1.33
12	1.25s	2.16s	2.60s	1.36
14	1.29s	8.93s	38.9s	1.05
20	2.25s	25.5s	44.8s	0.68
48	5.74s	40.5s	43.2s	0.89

KV cache precision moves the knee

Running the same workload with 16-bit KV cache (vLLM defaults to the model’s dtype, bf16 for Qwen3, when --kv-cache-dtype is not set) halves the token capacity, and the knee shifts left in proportion.

KV dtype	Bits/element	Token capacity	Knee (sessions)	2s p99 ceiling
bf16/fp16	16	~89K	~8	~4
fp8	8	~178K	~14	~6
TurboQuant 4-bit	~4.2	~275K	~23 (est.)	pending

The fp16 to fp8 shift is confirmed, with fp16 knees at about 8 and fp8 at about 14, a 1.75× shift for a 2× capacity increase. The slight compression below 2× is expected, since KV management overhead and block table fragmentation consume some of the headroom regardless of precision.

TTFT p99 for fp8 KV against fp16 KV, same workload and GPU. The markers sit at the observed knees, 8 sessions for fp16 and 14 for fp8. The tq4 knee near 23 is a capacity-based estimate, pending the experiment.

Quantization buys concurrency headroom directly. Halving KV precision from fp16 to fp8 nearly doubles how many concurrent sessions fit before the cliff. TurboQuant 4-bit (about 3.8× fewer bytes per element than fp16, partially offset by the lower gpu_memory_utilization it needs for autotuning scratch space) predicts a knee at about 23c, roughly 3× more concurrent sessions than fp16.

The accuracy tradeoff may be small. KV cache quantization at 4-bit typically reports low single-digit perplexity impact, though the exact effect depends on model and task. The knee shifts from about 8 sessions (fp16) to about 14 (fp8) to an estimated 23 (tq4), roughly 3× more sessions before eviction onset, from the same GPU.

Best case, worst case, realistic

To separate the latency budget that is fundamental (prefill compute) from the part that is avoidable (cache misses), we ran two controlled bounds alongside the realistic workload.

In the best case (miss_rate=0.0), every request hits the same cached content. APC holds the full 12,800-token session context, so only about 200 unique tokens need prefilling, which is perfect KV utilization.

The worst case (miss_rate=1.0) gives every request a unique prefix that breaks APC for the user context. The 10K system prompt still hits the cache, but all ~12,800 tokens of per-session turn history are re-prefilled on every request. Every miss lands at peak session depth (turn 12), forcing the maximum re-prefill cost each time.

TTFT p50 for best (fully cached), realistic (natural APC), and worst (always re-prefilled), on a log scale. The band between best and worst is the envelope any real workload lands in.

What the bounds tell us

At a single concurrent session, with zero contention, the three bounds separate cleanly.

Workload	TTFT p50	What’s happening
Best (cached)	57 ms	Only ~200 unique tokens prefilled; rest is cached
Realistic (APC)	388 ms	System prompt cached; 12,800 unique tokens prefilled
Worst (evicted)	2,500 ms	System prompt cached; ~12,800 user tokens re-prefilled at peak depth every request

On an absolute scale, realistic (388 ms) is much closer to best (57 ms) than to worst (2,500 ms). But realistic is still 7× slower than best. That gap is the cost of prefilling about 12,800 tokens of per-session unique context on each request. APC removes the system prompt cost, but the per-session turn history still has to be computed.

The gap between realistic and worst is about miss depth. In the realistic workload, cache misses happen at any turn. A session evicted at turn 3 re-prefills about 3,000 tokens, while eviction at turn 12 costs about 12,800. The worst case forces every miss to peak session depth, paying the maximum re-prefill on every request. Real traffic produces a distribution of miss depths, which is why realistic latency stays close to best.

The best-case result is the surprising one. With perfectly cached session context, the L4 handles more than 48 concurrent sessions within a 2-second p99 SLA. The 6 session realistic ceiling is the cost of per-session context uniqueness, the turn histories that cannot be shared. It is not a GPU compute limit.

The worst case grows linearly at about 2.1 seconds per additional concurrent session, reaching 102 seconds at 48c. Throughput saturates at 0.32 req/s from 6 sessions onward. The GPU is fully consumed re-prefilling 12,800 tokens per request, and extra concurrency just lengthens the queue.

Why the knee is where it is

The L4 has 24 GB of VRAM, but far less than that is available for KV cache. The memory that actually holds KV cache is roughly half of raw VRAM.

Where the memory goes

vLLM’s gpu_memory_utilization was set to 0.9 for the fp8 and fp16 experiments, reserving about 21.6 GB. After model weights, CUDA graph capture, activation tensors, and block table overhead, about 13 GB remains for KV cache. The TurboQuant experiment used 0.8 (it needs about 2 GB of extra scratch for torch.inductor autotuning at startup), leaving about 10.6 GB.

KV cache per token

Qwen3-4B uses GQA with 36 layers, 8 KV heads, and head_dim 128. The per-token KV cache size depends on precision.

2 (K+V) × 36 layers × 8 KV heads × 128 head_dim × bytes_per_element

FP16/BF16: ... × 2 bytes = 147,456 bytes/token  → ~89K tokens in ~13 GB
FP8:       ... × 1 byte  =  73,728 bytes/token  → ~178K tokens in ~13 GB
TQ4:       ~0.53 B effective (4-bit + quantization metadata)
           =  ~38,700 bytes/token  → ~275K tokens in ~10.6 GB

The capacity arithmetic (with APC)

With APC, the 10,000-token system prompt is stored once and shared. Only the per-session unique context (about 12,800 tokens at peak depth) needs its own blocks.

FP8 KV
Token budget:     ~178K
Shared prefix:     10K (1×)
Available:        ~168K
Per-session:      ~12.8K
Max sessions:   168K / 12.8K ≈ 13      Observed knee: ~14c

FP16 KV
Token budget:      ~89K
Shared prefix:     10K (1×)
Available:         ~79K
Per-session:      ~12.8K
Max sessions:    79K / 12.8K ≈ 6       Observed knee: ~8c

TQ4 KV (estimated)
Token budget:     ~275K (0.8 util)
Shared prefix:     10K (1×)
Available:        ~265K
Per-session:      ~12.8K
Max sessions:   265K / 12.8K ≈ 21      Predicted knee: ~23c

The arithmetic predicts the knees within 1 to 2 sessions of the observed values. The slight overshoot (observed 14 sessions against predicted 13) is because sessions are not all at peak depth at once. Earlier turns have smaller contexts, which buys a few extra sessions before capacity runs out.

In this setup, the concurrency cliff is a memory limit. The binding constraint is how many sessions’ KV caches fit in VRAM at once. The best-case bound supports this. With perfect caching, the same GPU handles more than 48 sessions within 2-second p99. Compute, scheduling, and continuous batching also contribute, but memory capacity sets the ceiling.

How to find the knee for your workload

The knee location depends on three variables. Available KV cache memory is total VRAM minus model weights, CUDA graphs, activations, and fragmentation, typically about half of raw VRAM, and vLLM reports the exact number at startup. Per-session unique context is the total session tokens at peak depth, minus any shared prefix cached by APC. KV precision is the bytes per element, where halving it from fp16 to fp8 to 4-bit roughly doubles token capacity at each step and shifts the knee right.

The estimate is max concurrent sessions ≈ (token capacity − shared prefix) / per-session unique context.

For this setup (Qwen3-4B, L4, 22.8K-token agentic sessions with a 10K shared prefix), the arithmetic predicts about 13 sessions (fp8) and 6 (fp16). The observed knees are about 14 and 8. The arithmetic gives a first-order estimate, and a concurrency sweep gives the precise number. The gap between estimate and observation comes from session depth staggering, block fragmentation, and APC reuse patterns.

Different workloads shift each variable. A single-turn QA workload with 2K tokens per session has a much higher knee. A code review agent with 50K-token inputs has a much lower one. A GPU with more VRAM (A100, H100) raises the budget. The method is the same. Estimate the budget, divide by per-session cost, then verify with a sweep.

What this means for deployment

Know your KV budget before you set your concurrency limit. Below the knee you get the best throughput with stable latency and effective caching. Above it you get worse throughput, worse latency, and a wasted APC investment.

KV quantization is a direct concurrency multiplier. On this L4, switching from fp16 to fp8 KV cache moves the 2-second p99 SLA ceiling from about 4 to 6 (50 percent more sessions) and the eviction knee from about 8 to 14 (75 percent more sessions). The gain is a direct consequence of halving the bytes per KV element. Quantization buys memory, and memory buys concurrency.

Monitor tail latency, not averages. After the cliff, p50 looks manageable while p99 is catastrophic. The bimodal distribution means some users get sub-second responses while others wait 40-plus seconds, and an average-based dashboard hides it until users complain.

If some of your sessions are latency-tolerant background work, run them off the interactive path. They do not need to compete for cache memory, and keeping them off it frees KV budget for the sessions that need low TTFT.

Several caveats temper the numbers. These experiments use synthetic token content, not real code. The workload has a fixed 12-turn structure, while real agent sessions vary widely in depth. Poisson arrivals do not capture bursty agentic traffic, where agents send follow-up requests immediately. p99 at high concurrency is noisy, since with about 200 requests per run it is only the second-worst request. Chunked prefill, not enabled here, could smooth the knee transition. The numbers are specific to a single L4 with Qwen3-4B. Larger models, multi-GPU setups, and different context lengths shift the absolute numbers while the pattern holds.

KV cache behaves like a systems problem, and the concurrency knee is where that meets a specific GPU, a specific model, and a specific workload shape. The math is simple. The discipline is running it before production tells you the hard way.

Your KV cache benchmark is “hi hi hi”

Khawaja Shams — Wed, 24 Jun 2026 00:00:00 GMT

Before you commit to a KV cache offloading system, you benchmark it to make sure it performs well below your SLA. You see that it has excellent compression and cheap transfers. Seems like an easy win.

But there’s some trouble with what standard KV cache benchmarks run on.

LMCache ships with a long-document benchmark for measuring KV cache offloading performance. Run it without a corpus file and it generates documents like this:

warmup_prompts = [
  str(i) + " " + " ".join(["hi"] * args.document_length)
  for i in range(args.num_documents)
]

A 10,000-token document comes out looking a little underwhelming.

0 hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi …

Technically speaking, it is the 10K token count you were looking for, but it doesn’t represent a real 10K token workload.

KV cache systems do not run on token count alone. Compression ratios, activation patterns, transfer sizes, and cache behavior all depend on the shape of the input. Two documents of the same length can be two entirely different workloads.

Unfortunately, much of the current KV cache ecosystem is benchmarked on synthetic inputs that look nothing like the workloads people run.

The benchmark is not representative

The default benchmark document contains a numeric identifier and the token “hi” repeated thousands of times.

Transformers do not produce identical activations for repeated tokens. Positional encoding, attention mixing, and residual connections keep every position distinct. To an LLM, distinct and varied are not the same thing. Repeating a single token produces far more regular activation patterns than diverse text does.

Compression improves, transfer sizes shrink, and cache behavior becomes easier to predict with non-varied workloads. The benchmark is measuring something, but it is not measuring a realistic production workload.

Benchmarking KV cache offloading with “hi hi hi” is like benchmarking a database with SELECT 1. The numbers come back fast, but they do not tell you much about real workloads.

The difference shows up immediately

We compared the default benchmark document against a realistic medical document using Qwen3-8B-FP8. Both ran about 10,000 tokens. The synthetic one carried two unique tokens, a token diversity of 0.02 percent. The medical one carried 1,329, or 13.3 percent. The token count is the same. Everything else is different.

Token diversity affects activation patterns. Activation patterns affect tensor value distributions. Tensor distributions affect compression ratios and transfer sizes. A benchmark built from highly repetitive inputs can diverge sharply from one built on realistic text.

Compression behaves differently too

We first ran into this while building a Valkey-backed KV cache connector. We followed the LMCache tutorials and used dummy weights, and compression ratios looked excellent. Then we switched to Qwen3-8B-FP8 with real trained weights, and the results were night and day.

Real model weights produce tensor values that behave more like high-entropy floating-point data. General-purpose compression still helps, but the gains are smaller than they appear with dummy weights. Repetitive inputs create more structured activation patterns that compress more effectively, making the benchmark results appear better than they actually are.

So we built a more realistic corpus

To see how KV cache systems behave under representative inputs, we built a corpus of 30 long-form documents across medical and legal domains. We wanted documents that resemble the structure, formatting, vocabulary, and variability that real systems process.

Medical documents averaged 14.2 percent token diversity. Legal documents averaged 9.4 percent. Both are hundreds of times more diverse than the synthetic baseline at 0.02 percent.

The corpus holds 300,000 tokens and was generated with Qwen3-8B-FP8. The corpus and generation scripts are open source for those interested.

Benchmarking the workload you actually have

While token count is easy to generate, it can also be the least informative. Match the diversity, structure, and vocabulary of the text your system serves, and the compression ratios and transfer sizes start to represent values you can trust.

The input comes first. Before ranking cache connectors, compression schemes, or storage backends, run them on inputs that look like your traffic. Compare them on “hi hi hi” and you are ranking them on a workload nobody runs.

Before you evaluate your next KV cache offloading system, be sure to ask “Are the benchmark documents representative of the workload I actually run?”

vLLM's Hash Chain and Why Prefix Caching Is Still Prefix Caching

Khawaja Shams — Mon, 22 Jun 2026 00:00:00 GMT

Automatic Prefix Caching sounds like it should solve a bigger problem than plain prefix caching. Requests are hashed, cache entries are discovered automatically, and shared work never has to be tracked by hand. It looks like a system that can find reusable computation wherever it appears.

But it’s still the same type of reuse we’ve always had. Shared prefixes are reusable. Shared content that is not a prefix is not. That rule sets the biggest limit on what today’s inference infrastructure can reuse.

Most of the recent work in KV caching has focused on finding prefixes more efficiently. Hash chains, radix trees, and automatic discovery all improve the mechanics of reuse. But they don’t change what can be reused. The workloads where it succeeds, and where it falls short, show why.

When prefix caching is enough

Reusing the KV cache only pays off when requests share computation, so the question is how much of a real workload today’s infrastructure can reuse.

For many agentic workloads, the answer is often “enough”. Stable system prompts, tool definitions, and conversation scaffolding create long shared prefixes, and inference engines are good at finding and reusing them.

Plenty of workloads share more than just prefixes, though. RAG pipelines are where it breaks down. The system prompt stays fixed while the retrieved documents change from request to request. Two requests might carry the same five documents in a different order. The meaning is almost identical. The token sequence is not, and the cache matches on the token sequence. Same content, different positions, no reuse.

Why prefix caching remains prefix-bound

vLLM’s Automatic Prefix Caching uses content hashing to remove the need for explicit prefix tracking. The KV cache is divided into fixed-size blocks, and each block gets a SHA-256 hash. The hash for block N folds in the hash of every preceding block along with the content of the current block. Chained together, every block fingerprints not only its own content but the entire token history before it.

When a request arrives, vLLM hashes each block-sized chunk of input and checks whether a matching block already exists. Matching blocks reuse previously computed KV cache. Missing blocks trigger fresh computation. Lookup is effectively constant-time, eviction is straightforward, and fixed-size blocks map cleanly onto GPU memory.

But because each block hash depends on every block before it, one divergence changes every hash that follows. Take two requests that share the first 3,000 tokens and split at token 3,001. The block holding token 3,001 hashes differently, and so does every block after it. Reuse stops at the point of divergence. The system discovers shared prefixes on its own, and it can’t discover shared content that appears once requests have diverged.

Reuse happens only at block boundaries. If two requests share 1,000 tokens and a block holds 16, vLLM reuses 62 whole blocks, or 992 tokens, and recomputes the remaining 8. For long prefixes that waste is negligible. For short or irregular shared segments it’s more substantial.

There is no matching inside a block, either. Two blocks that share 15 of their 16 tokens still hash to completely different values, so reuse is all-or-nothing at the block level. These are reasonable tradeoffs that keep the implementation simple and fast, but they still leave you with the same limitation: reuse follows exact prefix structure.

A different cache structure might help. SGLang takes that route. Instead of hashing fixed-size blocks, it keeps cached state in a radix tree indexed by token sequences. When a request arrives, the runtime walks the tree and finds the longest matching cached prefix, and matches can fall on arbitrary token boundaries rather than fixed block ones. That helps workloads with variable-length turns, branching conversations, and irregular prefix lengths.

The radix tree still searches for the longest shared prefix, though. Once two requests diverge, the content they share later in the sequence stays out of reach. SGLang improves how prefixes are discovered, but it does not extend reuse past prefixes.

Beyond prefixes

Prefix caching tells us that KV reuse clearly works. The more open question now is how much reuse survives divergence. Today’s runtimes are tuned for shared prefixes. The next generation goes after shared segments, cache repair, and the reuse that prefix matching cannot reach.

So much of the current research now focuses on cache repair and segment-level reuse. The goal has shifted from proving that KV reuse is valuable to recovering the work that today’s prefix-based systems still leave behind.

Disaggregation makes KV cache a system primitive

Khawaja Shams — Fri, 19 Jun 2026 00:00:00 GMT

Inference is scaling faster than the serving architectures around it. Prefill and decode are often treated as one pipeline, but they are fundamentally different workloads.

Prefill is compute-heavy, with an order of magnitude more arithmetic intensity than decode. Decode is sensitive to memory bandwidth and to latency. Prefill wants high-FLOPS accelerators. Decode wants large, fast memory. Put both on the same GPU and you tune it for one profile while it absorbs the other, and under load prefill interferes with decode. Neither phase gets the hardware it would choose.

The cache becomes the connection

Disaggregation separates these two phases. Prefill nodes do prefill, decode nodes do decode, each on hardware suited to its bottleneck.

Separation removes the interference, but it opens a gap. Inside one accelerator, prefill flows straight into decode and the intermediate state never leaves the chip. Pull the two onto different machines and that state, the KV cache, has to be handed across. Prefill produces it, decode consumes it, and disaggregation turns it into the object that connects them.

On a single node, the KV cache is largely an implementation detail the inference engine manages for you. But once prefill and decode are separate systems, the cache is the connection between them, and every request depends on getting it from one to the other.

What the split asks of the cache

Once the cache has to travel between machines, it takes on the requirements of any object moving through a distributed system.

The cache has to move from prefill nodes to decode nodes, which makes transfer latency, serialization format, and network bandwidth first-order concerns. It has to land in the right place, so deciding which decode node receives which cache turns routing into a scheduling problem. It has to expire, which raises the question of who evicts an entry and when, work the engine handled in a colocated system and that now needs coordination. And it has to live somewhere across GPU memory, host memory, NVMe, and remote storage, each tier with its own latency and capacity tradeoffs.

These are distributed systems problems that present themselves whenever you separate storage and compute. What was colocated becomes independently addressable, connected by a transfer layer. The techniques for solving them, placement, routing, eviction, and tiered storage, are well understood. What is new is applying them to the KV cache inside inference serving.

From implementation detail to system design

The major inference stacks are already built around this split. NVIDIA Dynamo describes disaggregated inference as a prefill engine that computes the prefill phase and generates KV cache, hands that cache to a decode engine, and lets the decode engine run the decode phase. AWS is building the split into its infrastructure, with Trainium for compute-heavy prefill, and the Cerebras partnership likely points the same way, since wafer-scale SRAM suits memory-bound decode.

In each of these, the KV cache is the object the tiers hand between them. On a single node it was an optimization you could run, skip, or tune, and the architecture did not care. Disaggregation moves those same problems up a level, from implementation details the engine used to hide to system design concerns the architecture has to own. The cache now has to be transferred, routed, stored, and expired across machines, and the whole system is built around getting it from prefill to decode.

What began as transient state inside a single request becomes the handoff between two systems. Once we have that handoff, cache management becomes a first-class primitive of the architecture.

KV Caching Pays Off Under Load

Khawaja Shams — Tue, 16 Jun 2026 07:00:00 GMT

KV caching looks like a bad trade on paper. Memory, complexity, and operational surface area, all spent to shave a few percent off a request.

The benchmarks do not rescue it.

We’ve seen teams leave it at that. KV cache is necessary inside a single forward pass, so you keep it for the life of the request and move on. Keeping it alive beyond that, reused across requests, starts to sound like a serving-layer luxury. You picture the memory it would pin, the eviction logic, the extra moving parts in a stack that is already hard enough to operate. Set that against a few percent of latency and the trade does not look worth making.

Understandably so. Run the numbers on a single request and long-lived KV caching underwhelms. We ran them, and at first it was a very unflattering story. But a single request is the wrong unit to judge this on, and once you measure at production load the economics turn.

The single-request savings are bounded

In one of our Qwen3-30B-A3B runs, a 1K input / 512 output request came in around 135 ms TTFT and about 2.5 seconds of total request latency. TTFT carries scheduling and queueing overhead on top of raw prefill compute, so call the prefill portion roughly 100 ms. Erase it completely, the most a perfect cache hit can do, and you save about 4 percent of the request. Treat that as the ceiling, not the everyday case.

As context grows, so does prefill’s share of the total request latency. At 16K input / 512 output, TTFT was about 769 ms of 3,200 ms total, which puts it near 24 percent. That is a real step up from the 1K case. The input/output ratio affects prefill more than the context length does. KV cache earns the most when a request carries a large input and returns a small output, because prefill is then a bigger share of the bill. In scenarios where you have short input and long output, decode takes over while the cache has little room to help.

On its own, a 4 or 24 percent share looks modest. Latency is measured at a target throughput, GPU capacity is scarce, and as throughput climbs, utilization, queueing, and pipeline stalls increase the cost of redoing prefill. So in production, prefill becomes a capacity and tail-latency problem once thousands of requests compete for the same accelerators.

So the skepticism is fair, for the single request. If the only question is whether one cache hit meaningfully cuts one request’s latency, the answer is often no, and it depends on the input/output ratio and how much of the request prefill actually owns. At that level, KV caching is not an automatic win.

Large-input, small-output workloads are getting more common, not less. Agentic workloads are multi-turn by nature. Context grows as chat history, tool-call results, and retrieval chunks pile up, so each new turn carries a larger input against a small output. Exactly the type of workload where prefill dominates and where reusing the cache has the most to give.

For application teams, reusing the cache shows up as lower TTFT, tighter p95, and lower cost per request. For the platform teams running the GPUs, it shows up as higher utilization and more capacity per dollar.

Expensive to hold, hard to reuse

That said, the KV cache is not free to keep. Hold it in GPU memory and it competes with active inference for the scarcest space you have. Move it to host memory and it is cheaper but still bounded, with DRAM prices trending the wrong way. Push it out to remote memory or storage and you take on transfer latency, placement problems, and more operational surface. KV caching is a bet. You are spending scarce memory on the wager that future requests reuse the work you are holding.

But that’s only half the problem. Even when you are willing to pay for the memory, the reuse you get back is limited. The production-friendly option today is prefix caching: if a later request begins with the exact same prefix, the engine reuses the KV cache already computed for it. The rule is strict, exact prefix match or nothing. Plenty of real workloads share meaning without sharing a prefix. Reordered retrieval chunks, varying tool results, and shifting user context carry the same semantic content in different positions, and none of it counts as a hit, so hit rates suffer.

It might seem like KV caching has a lot going against it. The single-node latency win is bounded. The memory cost is high. The reuse model is narrow. Evaluate it as an isolated optimization on one node and the honest question is whether the complexity earns its place. In isolation, often it does not. But isolation is the wrong frame because inference is not the system it was when those objections were formed. Each of them was measured against a single node running prefill and decode together, holding a cache that was large and expensive to keep. Two things have shifted since. The first is structural, in where prefill and decode run. The second is economic, in what the cache costs to hold and to move. Each one undercuts a different piece of the case against.

Inference is becoming a distributed systems problem

Prefill and decode are not the same kind of work. Prefill is compute-heavy. Decode is sensitive to memory bandwidth and to latency. Put both on the same accelerator and you force a compromise on one to serve the other. Split them, and you create a clean boundary between two workloads that want different things. The KV cache, however, has to cross that boundary.

When prefill and decode live on different hardware, the KV cache becomes a first-class distributed systems primitive, something you transfer, place, and manage a lifecycle for. NVIDIA Dynamo and the disaggregated stacks coming out of AWS and Cerebras are building the split into the infrastructure itself, which is what forces developers to think about how the KV cache moves, where it lives, and how long it stays alive.

The economics are starting to move

The second shift is economic. The KV cache itself is getting more efficient to store and to move. A surprising amount of recent model progress is really KV cache innovation, and the last 18 months have been striking.

DeepSeek-V2 and V3 introduced Multi-head Latent Attention (MLA). MLA compresses keys and values into a shared low-rank latent vector before anything gets cached. For V3, that takes the per-token cache from roughly 16,384 scalar dimensions under standard multi-head attention down to 576, a 512-dimensional latent plus 64 dimensions for decoupled RoPE. Against an MHA baseline that is about a 28x reduction. Against the GQA baseline most modern models already use, it is smaller, roughly 4 to 8x depending on group size, and MLA gets there while holding MHA-level quality, which GQA gives up.

Qwen 3.5 goes a different way with a Gated DeltaNet hybrid. It replaces 75 percent of its attention layers with Gated DeltaNet linear attention, layers that hold a fixed-size state matrix, 128 by 128 per head, and update it incrementally with each token. The state does not grow with sequence length. Only the remaining 25 percent, full softmax attention with GQA, still needs a traditional KV cache. At long contexts, where the KV cache usually dominates memory, this removes most of the growth. The payoff scales with context: substantial at 256K tokens, modest at 1K, where a fixed-size state costs about what a small KV cache would anyway.

TurboQuant and PolarQuant, from Google at ICLR 2026, take yet another angle. Instead of changing the attention mechanism, they quantize the KV cache itself to 3 or 4 bits per coordinate with no measurable accuracy loss on standard benchmarks. PolarQuant rotates vectors with a random orthogonal matrix so the coordinates follow a known distribution, then applies an optimal Lloyd-Max scalar quantizer, and QJL adds a 1-bit residual correction. At 4 bits the paper reports up to 8x faster attention on an H100. At 3 bits, roughly 6x memory reduction.

The exact numbers depend on baselines and configurations, but the direction is obvious. MLA shrinks the cache dramatically. Hybrid architectures such as Qwen’s Gated DeltaNet reduce cache growth across much of the network. Quantization approaches such as TurboQuant reduce memory requirements further without changing the model architecture. Different tradeoffs, same trend: the object is getting smaller.

Memory cost is the usual objection to KV caching, but this recent work almost makes it moot. Shrink the cache by 6x to an order of magnitude and the economics look very different. More entries fit in the same budget. Transfers from remote memory, SSD, or another node get faster. There’s a misconception that the cache has to become trivially small. But it only has to get small enough that the economics cross over for the workloads people run in production.

The storage hierarchy is changing as well. The previous thought was if the KV cache is not in GPU memory, it is too slow to matter. That is getting harder to say. Fast interconnects and local NVMe continue to improve. Now the question is whether moving or loading the cache can free the decode GPU from repeated prefill work and keep it pointed at the latency-sensitive part of the pipeline. If a storage-backed cache lowers prefill pressure and keeps accelerator capacity on decode, it can pay off quickly.

The workload is the main success driver. For a given model and context length, the system weighs the time to recompute prefill against the time to move the cache over the network, the time to read it from SSD, the cost of reserving the memory or storage, and the odds the cache gets reused at all. When transfer or load time comes in well under recompute time and reuse is likely enough, the cache earns its place. When it does not, the cache is a cost with no return.

Prefix caching under load

We saw this behavior in an experiment we recently ran. We used Qwen3-1.7B on an L40S, a 10K-token prompt, and the number of shared prefix tokens varied from 0 to 10K across several concurrency levels. As the shared prefix grows, the vLLM prefix cache hit ratio climbs from 0 to 1. Throughput and request latency were monitored at each concurrency level.

Prefix cache hit ratio grows linearly with shared prefix length, from 0 to 10K tokens against the fixed 10K-token prompt.

Throughput rises as the shared prefix grows and is sharpest at high concurrency, where skipping repeated prefill lets the system sustain more requests per second.

Mean request latency falls as reuse increases, with the high-concurrency settings improving most.

At low concurrency, a higher hit ratio helps, but only a little. At high concurrency, the same increase results in a much larger system effect. Requests per second climb sharply as more of the prompt comes from cache, and mean latency decreases along with it, dropping fastest at the higher concurrency levels. That is the behavior one would expect if KV caching is a systems optimization rather than a single-request latency trick.

The experiment is a small model on a single node, so it does not prove the disaggregated-architecture argument on its own. But it does verify that prefix cache hits remove prefill work from the serving path, and the system-level benefit grows with concurrency. The disaggregation thesis is that this gets stronger when prefill and decode run on separate hardware and the KV cache moves between them as a first-class object.

Prefix caching already works for the right workloads

Prefix caching generally has a narrow sharing mode. For agentic workflows it fits more naturally than you might expect, though the reason it fits changes by category of context.

System prompts are the easy case. They are stable across requests, they sit at the front of the prompt, and are a textbook prefix hit. An agent making a series of tool calls against the same backend reuses the same 2K to 8K token system prompt on every request. A multi-turn conversation with a fixed system prompt reuses the whole instruction block. A code-generation agent with stable repository context reuses the project description and file summaries. For this category, cross-request caching is straightforward.

Other kinds of context ask for more care. Chat history grows and shifts from turn to turn. Tool-call exemplars get reordered or swapped. Retrieval chunks change with every RAG query. These often share material across requests without sharing an exact prefix, so the effectiveness of the cache comes down to how much of the context is positionally stable (which prefix caching handles), versus variable (which needs something like CacheBlend to unlock).

Research for a better solution

For messier patterns, like RAG with retrieval chunks that vary by query or tool results that differ between calls, two requests can share a great deal of material without sharing the exact same prefix, and classic prefix caching returns a miss in those cases even when most of the computation could have been reused.

CacheBlend is one of the research directions in this area. It is exploring the idea of cache repair, which takes a semantically similar cached entry to what the current request needs, and selectively recomputes only the parts that differ. If repair is cheap enough, individual caches become reusable across more requests and hit rates rise without spending more memory.

This is still in open research, it’s not solved yet. No major inference framework ships chunk-level KV reuse today. The selective recomputation carries its own latency, quality preservation depends on the workload, and the methodology for measuring these tradeoffs is still maturing. But the direction is promising. More flexible cached entries raise the effective hit rate inside the same memory budget.

Prefix caching answers “does caching work?” for a growing number of workloads. The open question is how much of the rest can be brought into the cacheable regime, and cache repair is where we are working that out.

From per-request state to systems primitive

If you evaluate KV caching as an isolated optimization on a single node, it doesn’t make much sense. The memory cost is high and prefix-based reuse is limited.

But the architecture underneath is changing. Disaggregated prefill and decode create the right interface. Better attention mechanisms shrink the object you have to store and move. Faster networks and SSDs reduce transfer costs. Cache repair could push reuse beyond strict prefixes. Scarce GPU capacity makes repeated prefill work harder to justify.

Together, these shifts are turning the KV cache from a temporary intermediate state into an inference systems primitive.

Beyond the Goals, Three Ways Momento Scales the Football World Cup in Real Time

Lionel Bringuier — Wed, 03 Jun 2026 16:14:30 GMT

Don’t have five minutes? This infographic summarizes the key takeaways from this blog.

The FIFA World Cup 2026 is not just a tournament for Momento. It is a live fire test of what real-time data at global scale actually means.

In stadiums and on sofas, hundreds of millions of fans will see goals, cards, and heartbreak. Behind the scenes, three of Momento’s largest customers will be doing something just as intense: pushing a real-time data platform to the limit across live origination, content protection, and AI-powered personalization.

Beneath these vastly different workloads lies the same fundamental principle that decisions must be made instantly, at global scale, with zero excuses.

What Do We Mean by “Momento is a Real-Time Data Platform”?

“Data platform” is one of those phrases that can mean anything from a gigantic static data warehouse to a firehose of events in flight. When we say Momento is a real-time data platform, we mean something specific as we combine:

A sub-millisecond in-memory data engine: This is the RAM-cache based data plane that serves reads and writes in less than a millisecond, even under massive load.
An intelligent control plane: This layer automatically handles sharding, scaling, partitioning, and hot-key management, so app teams do not have to be distributed systems experts

In practice, customers can treat Momento like a simple API to store and retrieve state in real time: segments and manifests, concurrency counters, user events, AI embeddings, and more. They describe their data model and policy. Momento makes it fast, durable, and observable.

The FIFA World Cup is a perfect way to show what that actually looks like when the stakes are highest. Think of it as a hat-trick of real-time use cases.

1/ A Live Origin That Just Does Not Flinch

Who: A large UK broadcaster holding FIFA rights Problem: Their live origin service (AWS Elemental MediaStore) was deprecated before the World Cup. They needed the same low latency, failover behavior and observability, at World Cup scale, without rewriting their encoder, packager, video player or CDN configurations.

How Momento helps:Momento Media Storage is their new live origin. It gives them:

Predictable, low latency: Consistent latency for reads and writes at the live edge.
Granular TTL control: TTL settings on manifests/segments to preserve automatic failover capabilities.
Per-Service Limits and Metrics: Observability and metrics are built in so high traffic events don’t impact the rest of their 24/7 channels.

For viewers, nothing “looks” different. For their ops teams, the origin is now actively developed, supported, faster, and ready for tens of millions of concurrent fans.

🔗 Deep dive on How a major UK broadcaster moved its World Cup live origin to Momento ↗

2/ Content Protection Through Concurrency Tracking

Who: A major US broadcaster holding FIFA rights Problem: Pirates leverage stolen accounts and run illegal restreaming operations on top of the broadcaster’s own CDN. Traditional DRM and short-lived tokens verify a device can decrypt, but are completely blind to whether an account is behaving like a bot farm. Content rights holders are concerned about piracy and are asking broadcasters to implement server-side control over account behavior to shut down the illegal streams at the source.

**How Momento helps:**The broadcaster runs a server-side concurrency service backed by Momento. For each stream, a lightweight verification loop executes three steps:

Receive the heartbeat: The player calls a Momento-powered service with identifiers for the subscriber’s account, their device, the content they want to access, their geography, etc.
Enforce stream limits: Verify limits on concurrent streams per account and per event.
Deliver instant decisions: Make allow/deny decisions in single-digit milliseconds, inline with playback.

If one account suddenly powers hundreds of devices on the same game, the system can automatically shut it down, protecting revenue, CDN bills, and QoE for legitimate fans.

🔗 Details on the architecture and anti-leeching approach: Stop CDN Leeching with Concurrency Tracking ↗

3/ AI-Powered Personalized Feeds for a Sports App

Who: A US-based popular Sports content Super App Problem: Just ahead of the World Cup, this content provider wanted to relaunch their app, with a brand new User Experience. Beyond the traditional “click to watch” from their editorial content, they needed a real-time, personalized feed that mixes editorial pieces, highlights, YouTube clips, and content from popular social networks, tuned to each fan’s behavior as it happens. Think of it as a TikTok “For You Page”, for Sports.

**How Momento helps:**Momento serves as the real-time event collection and embedding layer, making real-time machine learning models visible to the app users at massive scale:

Dynamic content ingestion: New content from newsrooms, creators, social networks, and athletes is ingested and turned into AI embeddings.
Real-time signal streaming: User signals including emoji reactions, comments, watch time, and scroll depth stream into Momento in real time.
Sub-millisecond personalization: Recommendation services query Momento’s sub-millisecond data plane to match fresh content to each fan’s evolving interests.

The result: A feed that feels instantly relevant and keeps improving as fans interact with their personalized feed, at the scale of millions of concurrent users.

🔗 Video explainer on real-time embeddings with Momento: Momento AI embeddings & event collection explainer ↗

Let’s Watch Football, Not Infrastructure

Three very different workloads, all powered by the same underlying platform: a low latency data plane with an intelligent control layer that simplifies real-time data usage for application teams.

When the first match kicks off, the fans will be watching football, not infrastructure. But inside control rooms, NOCs, and product teams, our customers will know. They will see cleaner dashboards, stronger protections, and faster feedback loops. They will see a real-time data platform doing exactly what it was built to do.

We are excited to be part of their World Cup story. And once the final whistle blows, these capabilities will not disappear. They will become the new baseline for what fans expect from live sports, everywhere.

A New Live Streaming Origin Built for Global Scale

Lionel Bringuier — Thu, 28 May 2026 20:14:36 GMT

It’s the world’s most watched sport. In the US, it’s soccer; everywhere else, it’s football; and in the UK, it’s nearly a religion. In the run-up to the FIFA World Cup 2026, a large UK-based broadcaster made a big bet: they migrated their live streaming origin off from the deprecated AWS Elemental MediaStore service to Momento, and they did it in time to serve tens of millions of fans.

This wasn’t a simple lift and shift. The broadcaster’s live stack is a mature, battle-tested system that has evolved over years of 24×7 live linear channels, major tournaments and peak news moments. Replacing the core media storage and origin layer under that stack meant Momento had to meet an exacting bar on latency, reliability and operational visibility, while continuing to support an existing fleet of encoders and packagers, CDNs, and control-plane tooling.

This post walks through the existing architecture, adapting Momento as a drop-in replacement for AWS Elemental MediaStore, and then hardening the infrastructure for the World Cup and beyond.

The Starting Point: Live Origin at Scale

The broadcaster runs a large portfolio of 24×7 simulcast channels plus frequent pop-up live events in both HD and UHD, delivered via DASH and HLS. Streams are published into multiple AWS regions for redundancy, and each stream fans out across multiple CDNs.

To give a rough estimate of the data volume, their live origin manages 40+ live 24×7 channels, with some additional seasonal pop-up channels (up to 40 concurrent ones for large sport events). The live channels are encoded in H.264 and HEVC, with typically 8 to 11 video encoding profiles and 4 audio tracks in the ABR ladder. Segments and manifests are pushed into a media storage service deployed in two AWS regions, with cross-AZ replication in each region. On the playback side, CDNs pull from that origin, either directly or via an internal cache concentration layer to distribute traffic across providers and geographies.

At first, the broadcaster evaluated moving the origin to a standard S3 bucket. However, their architecture and operations depended on certain capabilities that would not be guaranteed.

When AWS Elemental MediaStore became deprecated, the broadcaster faced a classic “you have to rebuild the airplane while flying it” challenge: replace a key service in their workflow without rewriting their packagers, CDN configs, video players, or control planes, and without introducing new failure modes at the worst possible time, a year ahead the global football tournament. What were the tenets for the new service?

Fast live-edge reads and writes from London-based clients: tight time-to-first-byte and time-to-last-byte for both PUT (publication from the encoders) and GET (playback).
Transient data policies to aggressively expire stale manifests, forcing automatic failover to backup origins when the primary stopped publishing.
Per-container request limits to prevent one high-traffic service from drowning others.
Per-container access policies and CORS for secure origin access from CDNs and packagers.
Detailed access logging and metrics for publication latency, error codes, empty object publications, and regional breakdowns.
Lifecycle management to trim historical content and control storage costs for the content in the hot cache.
Keep the durability of an S3-backed storage under the hood, but without compromising access latency consistency.

Design Goal: A Drop-In MediaStore Replacement

The joint design goal was straightforward to state but hard to achieve: “just swap the origin to Momento Media Storage with minimal application changes, while preserving, or improving, the operational semantics we rely on today”.

Concretely, that translated into a few core requirements for Momento:

Equivalent HTTP surface areaKeep the same style of HTTP PUT/GET semantics, 404/50x behavior for missing segments, and origin-side access control via headers and tokens.
Performance parity or better from LondonFor UK-based clients, Momento had to deliver GET and PUT latencies that matched or beat their existing origin across both eu-west-1 (Dublin) and eu-west-2 (London).
Configurable object TTLs to emulate transient data policiesInstead of path-based lifecycle rules on containers, the broadcaster wanted fine-grained TTL control per object class (e.g., manifests vs segments) to preserve their model where a stale manifest can trigger failover.
Operational observability that matched their current dashboardsPublication latency, error codes, throttling, per-path metrics, and near-real-time access logs had to remain available for the broadcaster’s existing monitoring and alerting workflows.
Sensible multi-tenant safety railsPer-service request limits and regional SLAs needed to be enforced in a way that matched their mental model from the previous platform.

Today: We Are Ready for Kick-Off

The journey from early performance tests to full production lasted almost a year. During that time, the broadcaster and Momento ran a substantial battery of tests:

Distributed publication and playback across dozens of channels in parallel.
Comparative latency benchmarks from London-based clients to both eu-west-1 and eu-west-2, under varying bitrates and ladders.
Load and failover drills to confirm that short manifest TTLs and 404 semantics still triggered the right automatic reactions in their CDN and player stack.
SDK vs native HTTP tests to iron out any client-side inefficiencies and eliminate measurement artifacts.
Reproducibility and automation, with a single orchestration layer for the whole video stack.
Operational observability at every step of the workflow, that slots into their existing CloudWatch-based dashboards and alerting.

Most importantly, they are now running on an origin layer that is actively developed, not deprecated, and that can evolve with their roadmap and future needs.

For now the focus is simple: when the referee blows the whistle to start the first World Cup match, tens of millions of fans across the UK and beyond will be watching through a new origin, built on Momento, and they won’t even notice the difference. And that’s exactly how it should be.

Introducing valkey-lab: Stop Guessing When Your Cache Hits Its Limit

Khawaja Shams — Tue, 26 May 2026 20:14:51 GMT

Pop quiz: how many requests per second can your cache take before it stops meeting your latency SLO?

Chances are good you don’t know the answer, and that’s not a knock on you. It’s a genuinely hard number to come by. The standard tool for this is valkey-benchmark, and it’s great at exactly one thing: you point it at a server, you hammer it with some commands at full speed, and it prints a throughput number at the end. That number tells you the box is alive and roughly how fast it goes flat out.

From a production standpoint, that’s not as useful as it sounds.

How does the p999 hold up under an 80:20 read/write mix when half the requests land on hot keys? What does the tail look like at the rate you actually plan to run? How much headroom do you have before the SLO breaks? Was that latency spike at 10:32 a fluke or the ceiling? A single summary number printed after a sixty-second run can’t answer any of those, because it threw away the useful bits on the way to computing the average.

valkey-lab was built to answer these questions.

It’s a high-performance Valkey and Redis benchmark that uses io_uring for kernel-bypassed I/O, per-connection pipelining, and multi-threaded workers. The defaults are deliberately familiar. Run it without any arguments:

📄

valkey-lab

and you get a sixty-second run against localhost:6379 with an 80:20 GET/SET ratio and a million keys. Same shape as the tool you already know, same short flags (-h, -p, -c, -P, -r). The interesting part starts once you begin asking harder questions.

Saturation search: the headroom number, found for you

Here’s how you find your headroom number today, by hand. You run a benchmark at some rate, read the p999, decide it looks healthy, bump the rate, run it again, read it again. You do this five or ten times, squinting at each result, trying to find the rate where the tail skyrockets. Somewhere in that loop you lose track of which run had which config. Eventually you settle on a number you’re “pretty sure” is right and plan capacity around it. valkey-lab does that whole search for you with one command.

📄

valkey-lab saturate --slo-p999 1ms -c 16 -P 32

A synthetic benchmark might say a cache can handle 2M requests per second. But when you add a realistic read/write mix, hot keys, and a warm cache, your p999 suddenly crosses your SLO at 1.2M. Technically speaking, the server is still processing 2M requests per second, but the usable ceiling is significantly lower.

saturate starts issuing requests at whatever is provided in --start-rate (or 1000 if not provided) and multiplies the request rate by the --step factor on every step. The default step is 1.05, so the load compounds over time. Each step holds its rate for a sample window, measures the full percentile spread, and checks it against your SLO. The moment a percentile crosses the line, the ramp stops and reports the last rate that held.

When a step fails, valkey-lab tells you how it failed, either throughput-limited or latency-exceeded, which helps you tune your clusters more accurately.

Throughput-limited means the server couldn’t generate the requested rate at all. It topped out below the target. That’s a capacity problem: you need more CPU, more shards, or a different topology.

Latency-exceeded means the server kept up with the rate, but the tail blew past the SLO. The server can sustain the requested rate, but something in the path is introducing tail spikes under load. Could be a hot key, a GC pause, a scheduler stall, network jitter. You fix that by chasing the spike, and adding hardware won’t help.

So based on your failure, your mitigation strategy varies wildly. And it would be impossible to know which one to pursue if all you had was the throughput number.

Averages hide the interesting failures

The next problem surfaces when the benchmark completes. Summary statistics hide the behavior you’re usually trying to find. If your p999 was 312µs for fifty-nine seconds and 4.2 ms for one second, the run-level p999 still looks fine. The spike is the important part that you need to focus on.

valkey-lab streams one row per second with the full latency spread:

Every major percentile from p50 to p99.99 plus the max, the error count, and the cache hit rate, all per second. A spike that lasts one second appears as one row with a tall tail, a vast improvement over the executive summary at the end of a run. When you need it machine-readable instead, –output json gives you newline-delimited JSON you can pipe straight into something else, and –output quiet collapses the whole run to a single summary line.

Make the benchmark look like your workload

There’s an important gotcha with the saturation number, or any benchmark number. A ceiling is only as good as the load that produced it, and the default load most tools run is unrealistic.

Think about what a stock benchmark actually does. It sends all reads, or close to it, because a 100% GET run posts the biggest number (or it’s the easiest to simulate). It picks keys uniformly at random, so every key is equally cold and nothing is ever hot. It runs flat out, measuring throughput at saturation. And it normally starts against an empty cache. Now think about your production traffic. It’s a read-write mix. It has hot keys, with a small fraction of the keyspace taking most of the requests. And the cache is warm. Every one of those differences takes away from the realism of the benchmark run.

valkey-lab addresses each one of these gaps. Set the real read-write split with -r so you’re measuring the write path your cache actually carries. Turn on –distribution zipf so a small fraction of keys receives most of the traffic, like production systems often do. Uniform access patterns avoid contention and hide the behavior of your actual hot paths.

Pin the load with –rate-limit to track latency at the rate you plan to run. And warm the cache with –prefill, or model a read-through cache that fills on miss with –backfill, so a GET benchmark measures hits the way production would.

📄

valkey-lab --prefill -r 100:0 --distribution zipf -c 16 -P 32

Stack those and the ceiling you measure is a ceiling that meaningfully tracks production. There’s more depth when you need it, warmup tuning, RESP3, pinning workers to cores with –cpu-list, TLS, full TOML configs, but the move that matters is making the four big assumptions match your reality before you trust the number.

Getting the important data from a run

Now that we have realistic benchmark data, we have to make sure it’s useful after the run ends.

--parquet results.parquet saves the full dataset to disk. It stores the full metric set per snapshot: the counters, the gauges, and the latency distributions as actual nanosecond histograms. Combine this with the visualization functionality in valkey-lab, and you have a rich experience that lets you dig into every tiny detail.

📄

valkey-lab --parquet results.parquet
valkey-lab view results.parquet

view opens an interactive dashboard against the file, with a synchronized time axis you can zoom and pan through dimensions like throughput, hit rate, error rate, and latency split out by GET, SET, and combined, all on a log scale. Scrub to the exact second p999 jumped and read every other metric in that same window.

One use case for this is regression testing. Because every run is a Parquet file with the same schema, runs are directly comparable to each other. Benchmark before a Valkey upgrade and after, and the question “did this move my tail latency” is easily answered with a diff. The viewer is one way to read these files, but using your own queries is another easy way to act on changes in performance. DuckDB, pandas, and Polars all read Parquet directly, so a few lines of SQL across a directory of runs is a regression suite for cache performance. Point DuckDB at a folder of recorded runs and let it compute peak throughput per file:

📄

SELECT
  filename,
  max(responses_received) AS total_responses,
  max(request_errors)     AS errors
FROM read_parquet('runs/*.parquet', filename = true)
GROUP BY filename
ORDER BY filename;

That is a before-and-after table for every benchmark you have ever saved, built from data you already recorded.

Another use case for the Parquet output is root cause analysis. A spike on the latency chart tells you when something went wrong, not why. Point view at a Rezolus capture from the server or the client and it overlays system telemetry, CPU utilization, network, scheduler behavior, aligned to the same benchmark timeline. When a p999 spike lines up exactly with a scheduler stall or a network hiccup on the axis above it, you have your answer as simple as that.

Stop guessing

Back to the pop quiz. The reason it’s so hard to answer is that traditionally the tool you use to measure max RPS reports a summary and throws the important bits away. valkey-lab changes the approach. It remembers the mix, the hot keys, the per-second tail, and records your runs so you can come back to them. The headroom number that used to take an afternoon of manual ramping is now a single command, and it comes with the failure mode attached so you know what to do about it.

valkey-lab is built on top of cachecannon, inheriting its workload generation, saturation search, telemetry collection, and analysis capabilities. It needs Linux for io_uring (kernel 6.0+) and builds with Rust, under your choice of Apache-2.0 or MIT. Here is the whole getting-started path:

📄

cargo install --path . --bin valkey-lab
valkey-lab saturate --slo-p999 1ms

Run that against a Valkey server and see what number comes back. Stop asking “how fast can my cache go” and start asking “how fast can it go before my production workload breaks?” That’s the number you capacity-plan around if you want predictable systems at 3 AM.

Why Snap Was Willing to Fork, and Why They Still Came Back

Allen Helton — Thu, 21 May 2026 19:36:05 GMT

I have no intention of ever forking a database. The amount of bravery and engineering mastery that goes into it scares me to no end. But Snap did. They committed to it so hard that they acquired the company building it, open sourced the entire commercial codebase, and ran 100% of their caching infrastructure on it for years. KeyDB powered Snapchat at a scale most companies can only dream of.

And then they migrated to Valkey anyway.

At Unlocked San Jose, Ovais Khan, Principal Software Engineer at Snap, walked through that migration. As interesting as it was to hear how they did it, it was all the more interesting to hear why. Why it happened, why it wasn’t worth staying on the fork, and why when they came back, they came back to Valkey.

The case for forking in 2019

KeyDB started in 2019 as a project by John Sully and Ben Schermel at EQ Alpha Technology. The premise was simple. Redis ran a single-threaded event loop. Modern servers had 32, 64, 96 cores. To get peak throughput out of a single machine, you had to run a cluster of Redis nodes on it. That was wasteful, and Salvatore Sanfilippo, the creator of Redis, was on record arguing against changing it: “I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it’s a lot of complexity without a good reason.” Simplicity of the codebase was a value he was actively protecting.

KeyDB took the other side of that bet. It added real multithreading, with per-thread event loops and lock-based synchronization on shared state. It also added active-active replication and FLASH storage for cost-efficient large datasets. On the same hardware, it could move several times the operations per second that Redis could.

This is the textbook case for forking. The upstream project had made a deliberate architectural choice. That choice was the right one for them and the wrong one for a certain kind of user (Snap) who needed to push a single node harder. A fork was the only way forward.

By 2021, Snap was running KeyDB across enough of their caching infrastructure to want a permanent stake in it. They acquired the team in May 2022 and brought the formerly commercial KeyDB Pro features into the open source codebase under BSD-3. For about two years after that, all of Snap was running on KeyDB.

What forking buys you

The benefits of forking are easy to articulate when you ship. Snap got features that were important for their specific operating model:

Multithreaded command execution, which let them get more out of every node
Zone-aware read routing, which kept cross-AZ traffic down and cut data transfer costs considerably
Forkless background saves, which made snapshots predictable at high memory
Same-zone replica behavior that reduced timeout blast radius during upgrades

These features weren’t going to make it into Redis on Snap’s timeline. The fork gave them room to build it as soon as they were ready.

As far as forking goes, that’s usually the part written in blog posts and talked about on the conference loop. You wanted a feature, the upstream said no, you built it yourself, and now it works. Forking feels like freedom.

What forking costs you

Every change to upstream Redis after the fork point became a decision. Does it get ported over? Rewritten? Skipped? There’s a long tail at the end of whatever decision was made. Porting means you carry merge conflicts forever. Rewriting means you have two implementations of the same idea drifting apart. Skipping means your fork stops being a superset of upstream and starts being something else.

Ovais addressed this specifically in his talk. Snap could not easily move from KeyDB’s Redis 6.2 base to Redis 7.2. The cost of staying current with upstream had become high enough that they were stuck on a flavor of 6.2 while everyone else moved on. That meant they were also stuck without features the broader community had built on top of 7.2.

The same goes for the ecosystem. Every client library, operator, monitoring tool, and benchmark gets tested against upstream first. Your fork either matches upstream behavior closely enough that those tools just work, or it doesn’t, and you start maintaining your own.

While forking might have started off feeling like an accelerator, it quickly became a drag.

The Redis license change

In March 2024, Redis Ltd. changed the Redis license from BSD-3 to a dual SSPL and RSALv2 model. Neither license is OSI-approved. For any company offering Redis as a managed service, this was an immediate problem. AWS, Google Cloud, Oracle, and Ericsson responded by forking the last BSD release, Redis 7.2.4, and donating it to the Linux Foundation. Eight days after the license change, Valkey existed.

Up until then, the case for staying on KeyDB was obvious. The KeyDB team was inside Snap. The codebase was theirs. The performance was what they needed.

But Valkey made them pause. The project had open governance under the Linux Foundation, with a Technical Steering Committee across multiple companies and no single controlling vendor. It was BSD-licensed and would stay that way. Its roadmap included the things Snap had previously forked to get: I/O threading, dual-channel replication, and a path toward features Snap wanted. And every major cloud provider was committing serious engineering effort to it.

The KeyDB story also got more complicated from the inside. In January 2025, John Sully, KeyDB’s original creator, left Snap. His parting note on the KeyDB repository said it plainly:

“When we made KeyDB we wanted to prove that caches should have great performance and I think we succeeded. Now there are many options, including Valkey which is fully open source and based on my testing has matched KeyDB’s performance. I’m not sure what Snap will do with the project, but I think that development effort should move to Valkey moving forward as they have clear momentum and are the most up to date.”

When the person who started the fork tells you the fork is done, the fork is done.

The secret migration back

Snap runs caching at a scale where you can’t just swap a binary. The migration had to be invisible to application teams, comparable in cost, and safe across radically different workload types. Ovais walked through the major decisions that made their migration as easy as possible.

Abstraction layers are key to managing workloads at scale

Snap had built a storage abstraction with a RESP proxy in front of every cluster. Applications never talked to KeyDB directly. They talked to the proxy, which spoke Redis wire protocol back to whatever was running behind it. That layer of indirection made this migration possible. Without it, every application team at Snap would have needed to know about the change. With it, nobody had to.

These layers let them migrate around 30 caches per week. By the time Ovais gave this talk, 70 to 80 percent of workloads were on Valkey.

Do a gap analysis before changing any code

Snap did a feature-by-feature comparison between KeyDB and Valkey before touching anything in production. KeyDB’s multithreading and Valkey’s I/O threading work differently, so they benchmarked carefully to confirm comparable throughput.

Some KeyDB features were blockers and had to be ported to Valkey. Zone awareness was the first one Snap contributed. Replica MOVED behavior during upgrades was another. CPU throttling at high utilization was a third.

A hidden gap that wasn’t found until much later was with MGET. KeyDB supported it across slots, but Valkey does not. So after moving to Valkey, Snap had issues with command parsing pressure in large batching workloads. They quickly ported cross-slot MGET to their internal build, and are working with the core maintainers to get it added upstream.

Pick a stable version for a base, not a new one

Snap started on Valkey 8.2 RC, ported the features they needed, and immediately ran into crashes at 9 to 10k QPS. The root cause was new TLS offloading work. They rolled back to 8.0.2, ported the necessary fixes onto that, and benchmarked from there. New releases need a baking period, and a migration is the wrong time to find out.

Categorize and prioritize your workloads

Snap divided their caches into three categories: CPU-bound, high-memory, and high-write-rate. Each category needed different validation. CPU-bound workloads were primarily a throughput question. High-memory workloads were really about replication buffer behavior during full syncs, because if the buffer fills before a snapshot completes, you enter a sync loop that never finishes. High-write workloads required tuning replica buffer sizes and primary write throttling, because Valkey’s dual-channel replication puts buffers on replicas rather than primaries. Inside each category, they went lowest-criticality first, highest-criticality last.

Lessons from going full circle

The fork was the right call in 2019. Redis was not going to go multithreaded, and the workloads Snap was running needed it. KeyDB was a solid piece of engineering that pushed the ceiling on what a single Redis-compatible node could do.

The migration back was the right call in 2025 because the conditions that justified the fork had changed. The upstream that resisted features they needed was no longer the upstream they cared about. Valkey’s governance was open. Its roadmap included the work Snap had previously done alone. And every additional year on a Redis 6.2 build was another year of compounding distance from where the ecosystem was going.

Forks are leverage. They are also debt. Be honest with yourself about which one you are accumulating at any given moment. Snap was. They forked when forking gave them speed, and they came back when the fork started to cost more than it earned.

I don’t want you to take away from this that forking is bad. Sometimes it’s the right thing to do. The decision to fork is not permanent, and treating it like it is permanent is how you end up running a five-year-old codebase while your competitors are shipping on a roadmap you helped fund.

When the world moves, move with it.

Happy coding!

Why Large Payloads Break Caches at Scale

Allen Helton — Thu, 21 May 2026 16:52:11 GMT

If you’re running Valkey in production, you’ve probably configured a payload limit somewhere and assumed it protects the cache from oversized objects. But cache failures at scale don’t happen because of a single giant payload. They come from thousands of requests arriving in the wrong shape at the wrong time.

For example, ping latency on a system run by Apple jumped from 300ms to over a second during a burst of large payloads. Input throughput on that instance climbed from 7.5 MB/s to 75 MB/s, and output from 29 MB/s to 185 MB/s. No single item was anywhere close to the default 512MB limit. But with a surge in medium-sized items, things started to crumble because the system’s request processing path was becoming saturated.

Valkey 9.0’s copy avoidance update addressed one failure mode on the read path. Cumulative payload volume, command shape, and write-path allocation pressure are separate sources of pressure that can still degrade cluster performance.

At Unlocked, Yiwen Zhang from Apple and Ovais Khan from Snap described production systems where payload size created pressure in different parts of their infrastructure. Both were describing things they had to build their way out of. Here’s what they shared, and where the guardrail logic actually needs to live.

The event loop bottleneck

Everything in Valkey runs through a single-threaded main event loop: reads, writes, command parsing, and reply construction all compete for execution time. To understand why large payloads cause the problems they do, you have to start here.

Yiwen’s framing was that Valkey latency is a result of how long the event loop stays busy and how long it gets blocked. When large payloads enter the picture, they add pressure to the event loop and reduce the time available for other work.

On the read path, the expensive operation for a large GET is reply construction. Before Valkey 9.0, that meant a full memory copy of the value into a reply buffer on the main thread before any I/O thread could take over. A 5MB GET meant a 5MB memcpy blocking the loop, which causes every other client to wait.

Valkey 9.0’s copy avoidance helps with this for several use cases (normal client, RAW encoding, object size above the 16KB or 64KB limit). The main thread writes a reference instead of copying the value, and I/O threads handle the transfer. This solves the most visible noisy-neighbor failure, where a handful of large GETs would block small ones on the main thread.

Valkey gives operators the ability to cap payload size for writes via proto-max-bulk-len. A large GET has no corresponding limit. A 10MB value (like a list that has grown over time) can be returned and contribute to event-loop pressure regardless of what write-side limits are configured.

Event-loop pressure can build gradually from many requests or spike because of a single outlier request. Enough large GETs arriving close together can cause serialization time to dominate the event loop. No individual request needs to be anywhere near the configured limit for the system to degrade.

Large SETs have a copy problem

In theory, Valkey can skip the allocation and copy on the way in. For payloads at or above 32KB, it can reuse the query buffer directly as the stored value, but the buffer must be aligned at offset zero, and it must contain exactly that value with nothing else. In pipelined workloads, the buffer will always have something else in it, resulting in most large SETs producing a full allocation and memcpy in the parse phase.

Valkey 8.0’s I/O threading helps with this problem. The parse-phase allocation and copy now run on I/O threads instead of the main thread. Yiwen’s measurement on 256KB SETs (128 clients, pipeline depth 4) showed event-loop time per cycle dropping from ~171 µs with one I/O thread to ~93 µs with four. Throughput stayed flat around 13.3k RPS, and p99 improved roughly 10% (67ms to 60ms). The main thread got its time back even with the copy still occurring.

Shape degrades before size does

Ovais described large batch MGET commands creating regressions at Snap, caused by how the commands interacted with Valkey’s cluster architecture regardless of individual value size.

Valkey’s MGET is slot-based. Each key maps to a hash slot, and in a cluster, slots live on specific nodes. A multi-key MGET that spans multiple slots cannot be served by a single node and returns a CROSSSLOT error. Clients have to crack the batch into per-slot sub-batches and reassemble the result.

KeyDB, a Redis fork backed by Snap, had a proprietary capability called Cross-Slot MGET: a single command could query all the data hosted on a given node regardless of which slots those keys belonged to. Some of Snap’s workloads relied on this for throughput. When they ported to Valkey, those workloads regressed. Snap ported Cross-Slot MGET to Valkey internally to restore parity and are working with upstream maintainers to bring it to the project.

The shape of a command affects system pressure in different ways than payload byte count. How many keys it touches, how those keys are distributed across slots, and how the cluster has to coordinate the response, all affect performance in different ways. At the scale Snap operates, serving hundreds of millions of commands per second, that distinction adds up quickly.

A RESP proxy in front of the cluster introduces an additional pressure point. One slow command can hold up others behind it in the pipeline. Snap mitigated head-of-line blocking by increasing connection count and limiting pipelining for critical use cases. The root issue is command shape. Large batch commands occupy proxy pipelines longer than smaller individual requests.

Even if you understand these pressure points, they are still hard to identify in production because existing metrics don’t capture how workloads change over time.

Payload drift is invisible

Valkey provides solid system-level metrics: bytes per second, operations per second, event-loop health, SLOWLOG, large request and large reply diagnostics. But continuous visibility into payload size distribution is missing. Throughput at 100MB/s looks identical whether that’s 100 operations at 1MB each or 20 operations at 5MB each. However, they have very different costs to the event loop.

Yiwen proposed adding two payload size bucketing metrics (request_payload_bytes_bucket and reply_payload_bytes_bucket) to Valkey’s existing metrics. Bucketing requests and replies by size ranges gives operators the ability to detect when traffic shape shifts. Median payload sizes can slowly increase across a workload over weeks with no alert firing.

Runtime guardrails respond to distributions and trends, while static limits react to absolute values at the moment of ingestion. Avoiding payload drift requires both runtime guardrails and static limits.

Put guardrails above the engine

In Snap’s scenario, applications don’t connect directly to Valkey. They connect through a storage abstraction layer that connects to Valkey using a RESP proxy. This let them switch from KeyDB to Valkey without application teams being aware of the change. This architecture has allowed them to customize the system’s behavior, including partial MGET failure handling and zone-aware routing, without modifying application code.

Snap also added runtime guardrails inside the engine itself. They ported CPU throttling into their Valkey deployment so that when CPU utilization crosses 95%, write requests are throttled to preserve capacity for administrative commands. Protection activates based on what the system is experiencing. Something a byte-count threshold couldn’t do.

The proxy layer gave them a place to absorb large-payload problems like custom MGET handling, partial failure semantics, and connection count tuning to reduce head-of-line blocking. By moving these concerns into infrastructure, application teams no longer needed to own or implement them individually. Large-payload problems became contained to the infrastructure layer instead of every application’s codebase.

Ovais stated that strong abstractions turn risky migrations into operational exercises. The Snap migration peaked at around 30 caches per week through fully hands-off tooling. By the time of his talk, 70 to 80% of roughly 350 caches had moved, with application teams largely unaware it was happening. That’s what becomes possible when guardrails live above the engine.

Prioritize observability

The most practical step right now is adding payload size visibility to your cache observability stack, with the goal of catching payload drift. Aim to identify that the workload that had a 1KB median payload six months ago is now at 40KB, before it becomes a late-night latency incident.

Valkey 8.0 and 9.0 improved caching architecture meaningfully. I/O threading on the write path and copy avoidance on the read path both reduce event-loop pressure from large payloads. These alone change the performance curve. But knowing where your workload sits on that curve still requires instrumentation, and guardrails that respond to live system state still need to be built on top.

Payload limits are still important, but they’re increasingly becoming a first line of defence rather than the entire strategy. Modern cache systems need observability into traffic shape and guardrails that respond to changing conditions in real time.

Based on sessions by Yiwen Zhang (Apple) on large-payload guardrails and observability in Valkey, and Ovais Khan (Snap) on migrating 350+ cache clusters from KeyDB to Valkey. Watch the full replays at unlockedconf.io/san-jose-replays.

Disaggregated LLM Inference, Part 3: Why Your Networking Stack May Not Be Ready

Hien Luu — Wed, 13 May 2026 19:32:15 GMT

Parts 1 and 2 covered when to disaggregate, how requests find the right GPU, and how the KV cache moves between phases. All of that assumes a data plane that can actually carry the bytes.

Even with smart routing above and orchestration policy in the middle, you still have to move multiple gigabytes between GPUs in milliseconds.

The baseline is unambiguously bad: standard PyTorch serialization tops out below 1 GB/s, or three-plus seconds of dead air for a 3 GB KV cache. The next instinct is NCCL, but NCCL was built for training: collective patterns (AllReduce, AllGather) across a static topology known at startup. Disaggregated inference wants point-to-point transfers between prefill and decode nodes picked per-request by the router.

Three purpose-built alternatives have emerged, each solving a different pain point. NIXL (NVIDIA Inference Xfer Library) is the memory-abstraction layer: it exposes HBM, DRAM, NVMe, and S3 as uniform “memory sections” and negotiates the optimal transport underneath (NVLink, GPUDirect Storage, InfiniBand, RoCE) without the application caring which one is used. UCCL (Unified Collective Communication Library) handles P2P transfers without burning GPU compute on data movement (NCCL uses streaming multiprocessors for the transfer itself; UCCL doesn’t), supports both NVIDIA and AMD GPUs, and exposes both NCCL-style collective and explicit read/write APIs, so you get NCCL’s ergonomics without its SM tax, and skip the out-of-band metadata coordination that NIXL’s read/write path requires. Mooncake’s Transfer Engine is the bandwidth specialist, consulting a hardware topology matrix to pick the most proximate NIC per transfer and route around NUMA bottlenecks. It hits 87 GB/s on 4×200 Gbps and up to 190 GB/s on 8×400 Gbps, roughly 2.4× to 4.6× faster than optimized TCP.

How do they actually stack up? The UCCL team benchmarked all four on a pair of 8-GPU AMD MI300X nodes wired with 400 Gbps NICs (50 GB/s per link). On the 256KB–1MB messages typical of KV transfers, NIXL and UCCL P2P both saturate the link; NCCL/RCCL runs 30–50% slower: the SM tax. The gap closes at 10MB+ messages, where SM overhead amortizes. The surprise: Mooncake TE couldn’t saturate even a single 50 GB/s link at 100MB messages, a tension with its impressive multi-NIC aggregate numbers that the UCCL authors flagged but couldn’t fully explain.

Which one you pick depends on your actual pain: NIXL for heterogeneous memory tiers, UCCL for GPU-efficient P2P transfers across vendors, Mooncake TE for raw aggregate bandwidth when you can parallelize across many NICs.

Where This Is All Going

Disaggregation doesn’t eliminate the bottleneck in LLM serving. It moves it. The monolithic era of “buy a bigger GPU” is turning into a distributed-systems era where your scheduling sophistication, the policies governing your cache tiers, and the quality of your data plane matter more than which accelerator you bought. Cache-aware routing, content-hashed tiered storage, RDMA transports: none of these are ML concepts. They’re the same primitives distributed caching platforms have been refining for a decade.

The clearest production signal so far: in March, AWS and Cerebras announced a Bedrock service that runs prefill on AWS Trainium and decode on the Cerebras CS-3 (different vendor, different chip, different phase), stitched together over Elastic Fabric Adapter. The architectural pattern this series has been describing is shipping as a managed product, not a research demo. And when the chips on either side of the wire are built by different companies, the data plane connecting them stops being an implementation detail and becomes the product.

LLM serving infrastructure is becoming a globally-addressable, tiered cache network with a compute layer bolted on top. The teams that ship the best user experiences won’t be the ones with the most H100s. They’ll be the ones who treat their inference pipeline like a cache first and a model-execution engine second, with all the operational discipline that implies. At Momento we’ve watched that discipline emerge in caching systems the hard way, over years of chasing tail latency, congestion, and hit-rate math. It’s surprisingly familiar watching it happen again, one layer up.

Disaggregated Inference,Part 2: Moving the KV Cache Without Stalling the Decode

Hien Luu — Wed, 06 May 2026 18:09:57 GMT

In Part 1 we covered when disaggregation is worth the trouble and how requests find the right GPU. Once a request is routed, the next problem is moving the KV cache between prefill and decode without stalling the decode stream.

A fast transport moves bytes from prefill to decode. It doesn’t tell you when to start sending, where the cache lives between hops, or how to reuse it across requests. Those are orchestration questions, and they’re where most of the practical performance lives.

The first move is layer-wise streaming. Mooncake calls it “Layer-wise Prefill”: as soon as prefill finishes computing the KV cache for layer 0, that layer’s blocks start streaming to the decode node while the prefill GPU is still computing layer N. The transfer hides behind compute that would otherwise stall, dropping visible transfer latency significantly on long prompts.

The second move is treating the KV cache as a tiered store, with hot blocks in HBM, warm blocks in CPU DRAM, and cold blocks on networked SSDs. Mooncake Store keys blocks by content hash, evicts LRU, and replicates hot blocks across nodes so a popular system prompt doesn’t bottleneck on one location. DeepSeek’s Fire-Flyer File System (3FS) takes it further, using NVMe SSDs over RDMA to break the dependency on node-local DRAM. A 180-node 3FS cluster delivers 6.6 TiB/s aggregate read throughput and 40 GiB/s per client for KV lookups, fast enough that “the cache lives on disk” becomes viable rather than a fallback.

The third move is choosing the handoff itself. DistServe pulls: decode fetches on demand, using prefill memory as a queuing buffer. Mooncake’s Conductor pushes: prefill streams each layer to decode and frees its memory. 3FS-backed designs use shared storage. Push minimizes decode-side latency at the cost of memory pressure on the prefill pool. Pull keeps memory pressure where the work is queued. Shared storage decouples both, at the cost of a network hop.

Perplexity’s KV Messenger shows what this looks like in production. Built on RDMA via libfabric, it polls a counter that’s incremented after the output projection of each layer. Because that projection reduces across tensor-parallel ranks, the counter implicitly synchronizes them, letting the system track per-layer completion without breaking CUDA graphs. The moment the counter ticks, RDMA writes start flying. The decoder doesn’t even need an explicit completion signal: it counts incoming RDMA operations against the expected total. Serving DeepSeek-R1, mixed prefill-decode struggled to exceed 50 TPS due to prefill interruptions; after disaggregating, a single prefiller kept three decoders saturated at 90+ TPS for a 100ms TTFT cost.

The payoff is reuse. Most tokens in a typical prompt have been processed before, by someone: a system prompt shared across all users of an app, earlier turns of an active conversation, the boilerplate of a few-shot template. When the KV store is content-hashed and tiered across HBM, DRAM, and SSD, those tokens don’t have to be prefilled again. The prefill work that would have been redone on every turn is instead amortized across every request that shares a prefix.

Combine that with the layer-wise streaming above, and the whole shape of the workload changes. This is how Moonshot AI runs Kimi at 100+ billion tokens per day, handling 115% more requests than non-disaggregated baselines on the same A800 hardware: orchestration turns what would be a compute problem into a cache problem.

Up next

Layer-wise streaming, tiered storage, and clever handoff semantics all assume the data plane underneath can actually deliver multiple gigabytes between GPUs in milliseconds. Picking the right transport for that, and understanding why the obvious choices fall short, is its own problem.

Next in Part 3: Why Your Networking Stack Might Not Be Ready.