An agentic coding workload on commodity hardware
The setup is a single g6.4xlarge EC2 instance with one NVIDIA L4 GPU (24 GB VRAM), running Qwen3-4B with FP8 weights on vLLM 0.20.2, automatic prefix caching (APC) enabled.
The workload models a lightweight coding agent mid-task: a 10,000-token shared system prompt (repository context plus agent instructions) followed by a 12-turn conversation where each turn appends ~1,000 tokens of unique context (tool call inputs, code snippets, responses). At turn 12, total context reaches roughly 22,800 tokens per session. Output is capped at 75 tokens per turn. This is heavily input-dominated, as agentic workloads tend to be.
System prompt: 10,000 tokens (shared across sessions → APC cached)
Per-session turns: ~12,800 tokens across 12 turns (unique per session)
Output per turn: 75 tokens
—————————————————
Total at turn 12: ~22,800 tokens
APC caches the system prompt once and amortizes its KV cost across all sessions (the blocks still occupy GPU memory, but only one copy exists). Within a session, APC also caches the growing turn history: turn n+1 extends the exact prefix from turn n, so each turn only prefills the new ~1,000 tokens if the prior turns’ KV blocks survive in cache. But the per-session turn history is unique (different code, different tool outputs) and cannot be shared across sessions. When concurrency pressure forces eviction of a session’s blocks, the next request in that session must re-prefill the full accumulated history (up to ~12,800 tokens at turn 12). That re-prefill cost is what drives the TTFT cliff.
We swept concurrency from 1 to 48 concurrent sessions, measuring TTFT at each level, across three KV cache precisions: fp16, fp8, and turboquant 4-bit. We also ran best-case (all context cached) and worst-case (all context re-prefilled) bounds to bracket where realistic performance should land.
The concurrency cliff
The chart below shows TTFT percentiles (p50, p95, p99) and throughput across the full concurrency sweep with fp8 KV cache. The Y axis is logarithmic. The cliff is sharp.
Below 12 concurrent sessions, p99 TTFT stays under 2.6 seconds and throughput climbs steadily to a peak of 1.36 req/s. At 14 sessions, p99 jumps to 38.9 seconds in a single step, a 15× increase. Throughput drops 23% simultaneously.
At 12c, everything fits in KV cache. At 14c, filling the 14th session forces eviction of another session’s blocks. That evicted session must re-prefill its full 12,800 tokens of unique context at its next turn, which triggers a cascade that collapses latency.
The cliff is between 12c and 14c. Below 12c, every session fits in KV cache and p99 stays under 2.6s. At 14c, p99 explodes to 39s. Throughput peaks at 12c (1.36 req/s) and never recovers. For a 2-second p99 SLA, the safe operating point is 6 concurrent sessions.
After the cliff: p50 lies to you
Above the cliff, p50 and p99 live in completely different regimes. At 20c, p50 is 2.2s (looks manageable) while p99 is 44.8s (catastrophic). This bimodal distribution arises because APC creates two populations of requests:
- Lucky requests hit warm cache entries for their session context. They prefill only the latest turn and complete quickly.
- Unlucky requests arrive after their session’s blocks were evicted. They re-prefill the full 12,800 tokens, taking 30–50 seconds under load.
The p50 reflects the lucky cohort. The p99 reflects the unlucky cohort. A single TTFT average is meaningless past the cliff. You must look at tail latency to see the failure.
Full sweep data
| C | TTFT p50 | TTFT p95 | TTFT p99 | req/s |
|---|---|---|---|---|
| 1 | 388 | 487 | 493 | 0.42 |
| 2 | 541 | 931 | 965 | 0.68 |
| 4 | 717 | 990 | 1.13s | 1.01 |
| 6 | 1.06s | 1.43s | 1.75s | 1.18 |
| 8 | 1.14s | 1.73s | 2.16s | 1.23 |
| 10 | 1.19s | 1.89s | 2.31s | 1.33 |
| 12 | 1.25s | 2.16s | 2.60s | 1.36 |
| 14 | 1.29s | 8.93s | 38.9s | 1.05 |
| 16 | 2.10s | 25.4s | 35.4s | 0.78 |
| 18 | 2.27s | 24.3s | 50.5s | 0.66 |
| 20 | 2.25s | 25.5s | 44.8s | 0.68 |
| 24 | 2.40s | 32.5s | 45.4s | 0.72 |
| 28 | 2.55s | 31.5s | 34.9s | 0.74 |
| 32 | 2.58s | 34.1s | 49.7s | 0.77 |
| 40 | 3.00s | 39.2s | 50.5s | 0.84 |
| 48 | 5.74s | 40.5s | 43.2s | 0.89 |
Khawaja Shams

