< back to blog

The concurrency cliff is a memory limit

Latency looks fine on p50 until the KV cache fills. Then p99 jumps 15x in one step while your average dashboard shows nothing wrong. The cliff is memory.

Add concurrent users to an inference server and latency usually creeps up. KV cache serving does not creep. It holds flat, then falls off a cliff.

On a single L4 running a Qwen3-4B coding agent, the server holds 12 concurrent sessions at p99 under 2.6 seconds. Add two more and p99 jumps to 39 seconds, a 15× increase in a single step. The cliff is the moment the KV cache fills.

An agentic coding workload on commodity hardware

Our setup is a single g6.4xlarge EC2 instance with one NVIDIA L4 GPU (24 GB VRAM), running Qwen3-4B with FP8 weights on vLLM 0.20.2, automatic prefix caching (APC) enabled.

The workload models a lightweight coding agent mid-task. Each session opens with a 10,000-token shared system prompt (repository context plus agent instructions), followed by a 12-turn conversation where each turn appends about 1,000 tokens of unique context (tool call inputs, code snippets, responses). At turn 12, total context reaches roughly 22,800 tokens per session. Output is capped at 75 tokens per turn. The workload is heavily input-dominated, as agentic workloads tend to be.

System prompt:          10,000 tokens  (shared across sessions → APC cached)
Per-session turns:      ~12,800 tokens across 12 turns  (unique per session)
Output per turn:        75 tokens
—————————————————
Total at turn 12:       ~22,800 tokens

APC caches the system prompt once and amortizes its KV cost across all sessions (the blocks still occupy GPU memory, but only one copy exists). Within a session, APC also caches the growing turn history. Turn n+1 extends the exact prefix from turn n, so each turn only prefills the new ~1,000 tokens, as long as the prior turns’ KV blocks survive in cache. But the per-session turn history is unique (different code, different tool outputs) and cannot be shared across sessions. When concurrency pressure forces eviction of a session’s blocks, the next request in that session has to re-prefill the full accumulated history (up to ~12,800 tokens at turn 12). That re-prefill cost drives the TTFT cliff.

We swept concurrency from 1 to 48 sessions, measuring TTFT at each level, across three KV cache precisions, fp16, fp8, and TurboQuant 4-bit. We also ran best-case (all context cached) and worst-case (all context re-prefilled) bounds to bracket where realistic performance should land.

The concurrency cliff

The chart below shows TTFT percentiles (p50, p95, p99) and throughput across the full concurrency sweep with fp8 KV cache. The Y axis is logarithmic, and even on a log scale the cliff is sharp.

chart: TTFT p50/p95/p99 + throughput vs concurrent sessions, fp8 KV TTFT p50, p95, and p99, with throughput on the right axis, versus concurrent sessions. fp8 KV cache, Qwen3-4B FP8 on a g6.4xlarge (L4). The shaded band marks the 12 to 16 session collapse, and the dashed line marks the KV cache filling at 14c.

Below 12 concurrent sessions, p99 TTFT stays under 2.6 seconds and throughput climbs to a peak of 1.36 req/s. At 14 sessions, p99 jumps to 38.9 seconds in a single step, a 15× increase, and throughput drops 23 percent at the same time.

At 12 sessions, everything fits in KV cache. At 14, filling it forces eviction of another session’s blocks. That evicted session has to re-prefill its full 12,800 tokens of unique context at its next turn, and the cascade collapses latency.

The cliff sits between 12 and 14 sessions, and throughput never recovers past the peak at 12c. For a 2-second p99 SLA, the safe operating point is 6 concurrent sessions.

After the cliff, p50 lies to you

Above the cliff, p50 and p99 live in different regimes. At 20c, p50 is 2.2 seconds, which looks manageable, while p99 is 44.8 seconds, which is not. The distribution is bimodal, because APC creates two populations of requests. Lucky requests hit warm cache entries for their session context, prefill only the latest turn, and finish fast. Unlucky requests arrive after their session’s blocks were evicted, re-prefill the full 12,800 tokens, and take 30 to 50 seconds under load.

The p50 reflects the lucky cohort and the p99 reflects the unlucky one. A single TTFT average is meaningless past the cliff. You have to look at the tail to see the failure.

A few points from the sweep show the whole shape, flat through 12 sessions then the cliff at 14 and the widening p50/p99 gap past it.

SessionsTTFT p50TTFT p95TTFT p99req/s
13884874930.42
61.06s1.43s1.75s1.18
101.19s1.89s2.31s1.33
121.25s2.16s2.60s1.36
141.29s8.93s38.9s1.05
202.25s25.5s44.8s0.68
485.74s40.5s43.2s0.89

KV cache precision moves the knee

Running the same workload with 16-bit KV cache (vLLM defaults to the model’s dtype, bf16 for Qwen3, when --kv-cache-dtype is not set) halves the token capacity, and the knee shifts left in proportion.

KV dtypeBits/elementToken capacityKnee (sessions)2s p99 ceiling
bf16/fp1616~89K~8~4
fp88~178K~14~6
TurboQuant 4-bit~4.2~275K~23 (est.)pending

The fp16 to fp8 shift is confirmed, with fp16 knees at about 8 and fp8 at about 14, a 1.75× shift for a 2× capacity increase. The slight compression below 2× is expected, since KV management overhead and block table fragmentation consume some of the headroom regardless of precision.

chart: TTFT p99, fp8 KV vs fp16 KV, knees marked TTFT p99 for fp8 KV against fp16 KV, same workload and GPU. The markers sit at the observed knees, 8 sessions for fp16 and 14 for fp8. The tq4 knee near 23 is a capacity-based estimate, pending the experiment.

Quantization buys concurrency headroom directly. Halving KV precision from fp16 to fp8 nearly doubles how many concurrent sessions fit before the cliff. TurboQuant 4-bit (about 3.8× fewer bytes per element than fp16, partially offset by the lower gpu_memory_utilization it needs for autotuning scratch space) predicts a knee at about 23c, roughly 3× more concurrent sessions than fp16.

The accuracy tradeoff may be small. KV cache quantization at 4-bit typically reports low single-digit perplexity impact, though the exact effect depends on model and task. The knee shifts from about 8 sessions (fp16) to about 14 (fp8) to an estimated 23 (tq4), roughly 3× more sessions before eviction onset, from the same GPU.

Best case, worst case, realistic

To separate the latency budget that is fundamental (prefill compute) from the part that is avoidable (cache misses), we ran two controlled bounds alongside the realistic workload.

In the best case (miss_rate=0.0), every request hits the same cached content. APC holds the full 12,800-token session context, so only about 200 unique tokens need prefilling, which is perfect KV utilization.

The worst case (miss_rate=1.0) gives every request a unique prefix that breaks APC for the user context. The 10K system prompt still hits the cache, but all ~12,800 tokens of per-session turn history are re-prefilled on every request. Every miss lands at peak session depth (turn 12), forcing the maximum re-prefill cost each time.

chart: TTFT p50, best vs realistic vs worst, log scale TTFT p50 for best (fully cached), realistic (natural APC), and worst (always re-prefilled), on a log scale. The band between best and worst is the envelope any real workload lands in.

What the bounds tell us

At a single concurrent session, with zero contention, the three bounds separate cleanly.

WorkloadTTFT p50What’s happening
Best (cached)57 msOnly ~200 unique tokens prefilled; rest is cached
Realistic (APC)388 msSystem prompt cached; 12,800 unique tokens prefilled
Worst (evicted)2,500 msSystem prompt cached; ~12,800 user tokens re-prefilled at peak depth every request

On an absolute scale, realistic (388 ms) is much closer to best (57 ms) than to worst (2,500 ms). But realistic is still 7× slower than best. That gap is the cost of prefilling about 12,800 tokens of per-session unique context on each request. APC removes the system prompt cost, but the per-session turn history still has to be computed.

The gap between realistic and worst is about miss depth. In the realistic workload, cache misses happen at any turn. A session evicted at turn 3 re-prefills about 3,000 tokens, while eviction at turn 12 costs about 12,800. The worst case forces every miss to peak session depth, paying the maximum re-prefill on every request. Real traffic produces a distribution of miss depths, which is why realistic latency stays close to best.

The best-case result is the surprising one. With perfectly cached session context, the L4 handles more than 48 concurrent sessions within a 2-second p99 SLA. The 6 session realistic ceiling is the cost of per-session context uniqueness, the turn histories that cannot be shared. It is not a GPU compute limit.

The worst case grows linearly at about 2.1 seconds per additional concurrent session, reaching 102 seconds at 48c. Throughput saturates at 0.32 req/s from 6 sessions onward. The GPU is fully consumed re-prefilling 12,800 tokens per request, and extra concurrency just lengthens the queue.

Why the knee is where it is

The L4 has 24 GB of VRAM, but far less than that is available for KV cache. The memory that actually holds KV cache is roughly half of raw VRAM.

Where the memory goes

vLLM’s gpu_memory_utilization was set to 0.9 for the fp8 and fp16 experiments, reserving about 21.6 GB. After model weights, CUDA graph capture, activation tensors, and block table overhead, about 13 GB remains for KV cache. The TurboQuant experiment used 0.8 (it needs about 2 GB of extra scratch for torch.inductor autotuning at startup), leaving about 10.6 GB.

KV cache per token

Qwen3-4B uses GQA with 36 layers, 8 KV heads, and head_dim 128. The per-token KV cache size depends on precision.

2 (K+V) × 36 layers × 8 KV heads × 128 head_dim × bytes_per_element

FP16/BF16: ... × 2 bytes = 147,456 bytes/token  → ~89K tokens in ~13 GB
FP8:       ... × 1 byte  =  73,728 bytes/token  → ~178K tokens in ~13 GB
TQ4:       ~0.53 B effective (4-bit + quantization metadata)
           =  ~38,700 bytes/token  → ~275K tokens in ~10.6 GB

The capacity arithmetic (with APC)

With APC, the 10,000-token system prompt is stored once and shared. Only the per-session unique context (about 12,800 tokens at peak depth) needs its own blocks.

FP8 KV
Token budget:     ~178K
Shared prefix:     10K (1×)
Available:        ~168K
Per-session:      ~12.8K
Max sessions:   168K / 12.8K ≈ 13      Observed knee: ~14c

FP16 KV
Token budget:      ~89K
Shared prefix:     10K (1×)
Available:         ~79K
Per-session:      ~12.8K
Max sessions:    79K / 12.8K ≈ 6       Observed knee: ~8c

TQ4 KV (estimated)
Token budget:     ~275K (0.8 util)
Shared prefix:     10K (1×)
Available:        ~265K
Per-session:      ~12.8K
Max sessions:   265K / 12.8K ≈ 21      Predicted knee: ~23c

The arithmetic predicts the knees within 1 to 2 sessions of the observed values. The slight overshoot (observed 14 sessions against predicted 13) is because sessions are not all at peak depth at once. Earlier turns have smaller contexts, which buys a few extra sessions before capacity runs out.

In this setup, the concurrency cliff is a memory limit. The binding constraint is how many sessions’ KV caches fit in VRAM at once. The best-case bound supports this. With perfect caching, the same GPU handles more than 48 sessions within 2-second p99. Compute, scheduling, and continuous batching also contribute, but memory capacity sets the ceiling.

How to find the knee for your workload

The knee location depends on three variables. Available KV cache memory is total VRAM minus model weights, CUDA graphs, activations, and fragmentation, typically about half of raw VRAM, and vLLM reports the exact number at startup. Per-session unique context is the total session tokens at peak depth, minus any shared prefix cached by APC. KV precision is the bytes per element, where halving it from fp16 to fp8 to 4-bit roughly doubles token capacity at each step and shifts the knee right.

The estimate is max concurrent sessions ≈ (token capacity − shared prefix) / per-session unique context.

For this setup (Qwen3-4B, L4, 22.8K-token agentic sessions with a 10K shared prefix), the arithmetic predicts about 13 sessions (fp8) and 6 (fp16). The observed knees are about 14 and 8. The arithmetic gives a first-order estimate, and a concurrency sweep gives the precise number. The gap between estimate and observation comes from session depth staggering, block fragmentation, and APC reuse patterns.

Different workloads shift each variable. A single-turn QA workload with 2K tokens per session has a much higher knee. A code review agent with 50K-token inputs has a much lower one. A GPU with more VRAM (A100, H100) raises the budget. The method is the same. Estimate the budget, divide by per-session cost, then verify with a sweep.

What this means for deployment

Know your KV budget before you set your concurrency limit. Below the knee you get the best throughput with stable latency and effective caching. Above it you get worse throughput, worse latency, and a wasted APC investment.

KV quantization is a direct concurrency multiplier. On this L4, switching from fp16 to fp8 KV cache moves the 2-second p99 SLA ceiling from about 4 to 6 (50 percent more sessions) and the eviction knee from about 8 to 14 (75 percent more sessions). The gain is a direct consequence of halving the bytes per KV element. Quantization buys memory, and memory buys concurrency.

Monitor tail latency, not averages. After the cliff, p50 looks manageable while p99 is catastrophic. The bimodal distribution means some users get sub-second responses while others wait 40-plus seconds, and an average-based dashboard hides it until users complain.

If some of your sessions are latency-tolerant background work, run them off the interactive path. They do not need to compete for cache memory, and keeping them off it frees KV budget for the sessions that need low TTFT.

Several caveats temper the numbers. These experiments use synthetic token content, not real code. The workload has a fixed 12-turn structure, while real agent sessions vary widely in depth. Poisson arrivals do not capture bursty agentic traffic, where agents send follow-up requests immediately. p99 at high concurrency is noisy, since with about 200 requests per run it is only the second-worst request. Chunked prefill, not enabled here, could smooth the knee transition. The numbers are specific to a single L4 with Qwen3-4B. Larger models, multi-GPU setups, and different context lengths shift the absolute numbers while the pattern holds.

KV cache behaves like a systems problem, and the concurrency knee is where that meets a specific GPU, a specific model, and a specific workload shape. The math is simple. The discipline is running it before production tells you the hard way.