Inference is scaling faster than the serving architectures around it. Prefill and decode are often treated as one pipeline, but they are fundamentally different workloads.
Prefill is compute-heavy, with an order of magnitude more arithmetic intensity than decode. Decode is sensitive to memory bandwidth and to latency. Prefill wants high-FLOPS accelerators. Decode wants large, fast memory. Put both on the same GPU and you tune it for one profile while it absorbs the other, and under load prefill interferes with decode. Neither phase gets the hardware it would choose.
The cache becomes the connection
Disaggregation separates these two phases. Prefill nodes do prefill, decode nodes do decode, each on hardware suited to its bottleneck.
Separation removes the interference, but it opens a gap. Inside one accelerator, prefill flows straight into decode and the intermediate state never leaves the chip. Pull the two onto different machines and that state, the KV cache, has to be handed across. Prefill produces it, decode consumes it, and disaggregation turns it into the object that connects them.
On a single node, the KV cache is largely an implementation detail the inference engine manages for you. But once prefill and decode are separate systems, the cache is the connection between them, and every request depends on getting it from one to the other.
What the split asks of the cache
Once the cache has to travel between machines, it takes on the requirements of any object moving through a distributed system.
The cache has to move from prefill nodes to decode nodes, which makes transfer latency, serialization format, and network bandwidth first-order concerns. It has to land in the right place, so deciding which decode node receives which cache turns routing into a scheduling problem. It has to expire, which raises the question of who evicts an entry and when, work the engine handled in a colocated system and that now needs coordination. And it has to live somewhere across GPU memory, host memory, NVMe, and remote storage, each tier with its own latency and capacity tradeoffs.
These are distributed systems problems that present themselves whenever you separate storage and compute. What was colocated becomes independently addressable, connected by a transfer layer. The techniques for solving them, placement, routing, eviction, and tiered storage, are well understood. What is new is applying them to the KV cache inside inference serving.
From implementation detail to system design
The major inference stacks are already built around this split. NVIDIA Dynamo describes disaggregated inference as a prefill engine that computes the prefill phase and generates KV cache, hands that cache to a decode engine, and lets the decode engine run the decode phase. AWS is building the split into its infrastructure, with Trainium for compute-heavy prefill, and the Cerebras partnership likely points the same way, since wafer-scale SRAM suits memory-bound decode.
In each of these, the KV cache is the object the tiers hand between them. On a single node it was an optimization you could run, skip, or tune, and the architecture did not care. Disaggregation moves those same problems up a level, from implementation details the engine used to hide to system design concerns the architecture has to own. The cache now has to be transferred, routed, stored, and expired across machines, and the whole system is built around getting it from prefill to decode.
What began as transient state inside a single request becomes the handoff between two systems. Once we have that handoff, cache management becomes a first-class primitive of the architecture.