The Concurrency Cliff is a Memory Limit

June 08, 2026

Inference

Khawaja Shams

An agentic coding workload on commodity hardware

The setup is a single g6.4xlarge EC2 instance with one NVIDIA L4 GPU (24 GB VRAM), running Qwen3-4B with FP8 weights on vLLM 0.20.2, automatic prefix caching (APC) enabled.

The workload models a lightweight coding agent mid-task: a 10,000-token shared system prompt (repository context plus agent instructions) followed by a 12-turn conversation where each turn appends ~1,000 tokens of unique context (tool call inputs, code snippets, responses). At turn 12, total context reaches roughly 22,800 tokens per session. Output is capped at 75 tokens per turn. This is heavily input-dominated, as agentic workloads tend to be.

🌐

filename.html

System prompt:          10,000 tokens  (shared across sessions → APC cached)
Per-session turns:      ~12,800 tokens across 12 turns  (unique per session)
Output per turn:        75 tokens
—————————————————
Total at turn 12:       ~22,800 tokens

APC caches the system prompt once and amortizes its KV cost across all sessions (the blocks still occupy GPU memory, but only one copy exists). Within a session, APC also caches the growing turn history: turn n+1 extends the exact prefix from turn n, so each turn only prefills the new ~1,000 tokens if the prior turns’ KV blocks survive in cache. But the per-session turn history is unique (different code, different tool outputs) and cannot be shared across sessions. When concurrency pressure forces eviction of a session’s blocks, the next request in that session must re-prefill the full accumulated history (up to ~12,800 tokens at turn 12). That re-prefill cost is what drives the TTFT cliff.

We swept concurrency from 1 to 48 concurrent sessions, measuring TTFT at each level, across three KV cache precisions: fp16, fp8, and turboquant 4-bit. We also ran best-case (all context cached) and worst-case (all context re-prefilled) bounds to bracket where realistic performance should land.

The concurrency cliff

The chart below shows TTFT percentiles (p50, p95, p99) and throughput across the full concurrency sweep with fp8 KV cache. The Y axis is logarithmic. The cliff is sharp.

TTFT p50 / p95 / p99 and throughput (req/s, green dashed, right axis) vs concurrent sessions. fp8 KV cache, Qwen3-4B FP8 on g6.4xlarge (L4). Shaded zone: 12c–16c collapse transition. Dashed line: KV cache fills at 14c.

Below 12 concurrent sessions, p99 TTFT stays under 2.6 seconds and throughput climbs steadily to a peak of 1.36 req/s. At 14 sessions, p99 jumps to 38.9 seconds in a single step, a 15× increase. Throughput drops 23% simultaneously.

At 12c, everything fits in KV cache. At 14c, filling the 14th session forces eviction of another session’s blocks. That evicted session must re-prefill its full 12,800 tokens of unique context at its next turn, which triggers a cascade that collapses latency.

The cliff is between 12c and 14c. Below 12c, every session fits in KV cache and p99 stays under 2.6s. At 14c, p99 explodes to 39s. Throughput peaks at 12c (1.36 req/s) and never recovers. For a 2-second p99 SLA, the safe operating point is 6 concurrent sessions.

After the cliff: p50 lies to you

Above the cliff, p50 and p99 live in completely different regimes. At 20c, p50 is 2.2s (looks manageable) while p99 is 44.8s (catastrophic). This bimodal distribution arises because APC creates two populations of requests:

Lucky requests hit warm cache entries for their session context. They prefill only the latest turn and complete quickly.
Unlucky requests arrive after their session’s blocks were evicted. They re-prefill the full 12,800 tokens, taking 30–50 seconds under load.

The p50 reflects the lucky cohort. The p99 reflects the unlucky cohort. A single TTFT average is meaningless past the cliff. You must look at tail latency to see the failure.

Full sweep data

C	TTFT p50	TTFT p95	TTFT p99	req/s
1	388	487	493	0.42
2	541	931	965	0.68
4	717	990	1.13s	1.01
6	1.06s	1.43s	1.75s	1.18
8	1.14s	1.73s	2.16s	1.23
10	1.19s	1.89s	2.31s	1.33
12	1.25s	2.16s	2.60s	1.36
14	1.29s	8.93s	38.9s	1.05
16	2.10s	25.4s	35.4s	0.78
18	2.27s	24.3s	50.5s	0.66
20	2.25s	25.5s	44.8s	0.68
24	2.40s	32.5s	45.4s	0.72
28	2.55s	31.5s	34.9s	0.74
32	2.58s	34.1s	49.7s	0.77
40	3.00s	39.2s	50.5s	0.84
48	5.74s	40.5s	43.2s	0.89

The Concurrency Cliff is a Memory Limit

An agentic coding workload on commodity hardware

The concurrency cliff

After the cliff: p50 lies to you

Full sweep data

On this page

Keep Reading

Your KV Cache Benchmark Is “hi hi hi”

vLLM’s Hash Chain, SGLang’s Radix Tree

Disaggregation Makes KV Cache a System Primitive