Your KV Cache Benchmark Is “hi hi hi”

Khawaja Shams headshot Khawaja Shams

The default benchmark document

LMCache ships with a long-document QA benchmark (long_doc_qa.py) for measuring KV cache offloading performance. When run without a corpus file, it generates documents like this:

🌐
filename.html
warmup_prompts = [
    str(i) + " " + " ".join(["hi"] * args.document_length)
    for i in range(args.num_documents)
]

A “10,000-token document” is literally:

🌐
filename.html
0 hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi ...

The token count is accurate: the Qwen3-8B tokenizer encodes "hi" as a single token, so 10,000 repetitions produce 10,001 tokens. The problem is what the tokens represent. The entire document contains only 2 unique tokens. This matters for KV cache benchmarks in three ways.

Compression and activation patterns

KV cache connectors often compress tensors before sending them to remote storage. Repetitive inputs like hi hi hi produce more compressible data at every level. The raw text is obviously repetitive, but the KV tensors are also significantly more compressible. Transformers do not produce identical activations for repeated tokens — positional encoding, attention mixing, and residual connections ensure each position’s KV tensors are distinct — but repetitive inputs produce more structured activation patterns than diverse text, which lowers tensor entropy and inflates compression ratios. Benchmarks using only synthetic prompts overestimate how much compression helps in a network-bound pipeline.

Cache access patterns

Real workloads involve documents with different prefixes, shared terminology, and overlapping but not identical content. The hi hi hi pattern creates a workload that shares nothing meaningful between documents except the same two tokens.

Benchmarking KV cache offloading with hi hi hi is like benchmarking a database with SELECT 1. The numbers come back fast, but they do not tell you how the system performs under realistic load.

Synthetic vs. realistic: measured on Qwen3-8B-FP8

We compared the default hi hi hi document against a medical corpus document, both at 10,000 tokens on Qwen3-8B-FP8 with real trained weights.

MetricDefault (hi hi hi…)Medical Corpus
Tokens10,00110,000
Unique tokens21,329
Token diversity0.02%13.3%
Characters30,00138,202
Chars/token3.003.82

The token count is the same. Everything else is different. Token diversity affects activation patterns, which affects tensor value distributions, which affects compression ratios and transfer sizes. Benchmarks that only test the synthetic case are measuring a different workload than the one that will show up in production.

Compression behavior with real model outputs

When we first started working with LMCache, we followed the tutorial setup and used --load-format dummy to skip real weight loading. In our Valkey connector, we were using Zstandard as the compression layer. With dummy weights, KV tensors compressed extremely well. The numbers were encouraging.

When we switched to Qwen3-8B-FP8 with real trained weights, naive Zstandard compression provided only modest gains. Real model weights produce tensor values that behave more like high-entropy floating-point data. Where dummy weights might produce low-variance activations that compress easily, trained weights produce the kind of value distributions that general-purpose compressors struggle with. (LMCache addresses this with CacheGen, a learned compressor specifically designed for KV tensor distributions. Our experiments used a general-purpose compressor, which is a harder baseline.)

Separately, input text diversity also affected compression. As discussed above, repetitive inputs produce more structured activation patterns that compress better even though the activations are not identical. Realistic documents with varied vocabulary, numeric values, and formatting produced weaker compression ratios.

The net effect: benchmarks using dummy weights or highly repetitive prompts can both overestimate how much compression reduces bytes on the wire. For a system where KV cache offloading is network-bound, that difference matters.

Making float tensors more compressible

Since naive Zstandard on raw FP16 tensor data was not giving us useful compression with real weights in our Valkey connector, we introduced bit shuffling as a pre-compression step.

Each FP16 value is 16 bits: 1 sign bit, 5 exponent bits, and 10 mantissa bits. When stored as raw bytes, the bit fields from many FP16 values are interleaved in memory, which often appears high-entropy to byte-oriented compressors. Bit shuffling rearranges the data so that corresponding bit positions from consecutive values are grouped together: all the sign bits in one block, all the exponent bits together, then the mantissa bits. Grouping similar bit positions across values exposes low-entropy regions that byte-oriented compressors like Zstandard can exploit effectively.

We used AVX2 intrinsics to keep the shuffle and unshuffle operations fast enough that CPU time did not become the new bottleneck. The goal was to reduce bytes on the wire without moving the constraint from network to CPU.

This is a practical tradeoff, not a silver bullet. Bitshuffle + Zstandard reduced transfer sizes enough to be useful on real model outputs, but the gains are more modest than what you see with dummy weights or synthetic inputs. That gap is exactly why we needed benchmarks that use realistic workloads.

There are also purpose-built approaches that work at the quantization level rather than treating tensors as opaque bytes. TurboQuant (Google Research, ICLR 2026) applies a random orthogonal rotation to each KV vector, which reshapes the coordinate distribution into a known Beta distribution (approximately Gaussian in high dimensions). Because the post-rotation distribution is predictable regardless of the input data, an optimal scalar quantizer can be precomputed once and applied to every model without calibration. The practical operating point is ~3-3.5 bits per coordinate on average (asymmetric: 3-4 bit keys + 2-bit values, reflecting the large norm gap between K and V), yielding roughly 6x KV cache compression with minimal quality loss. Other approaches like KIVI (ICML 2024) target similar bit widths with different tradeoffs. Both TurboQuant and KIVI are now integrated into vLLM and SGLang. These quantization-aware methods are complementary to transfer compression: they reduce the tensor size before it ever reaches the wire.

Why we built this corpus

While optimizing KV cache offloading performance with Valkey as the remote backend, we ran into the benchmarking problem described above. The KV cache benchmarks we were using (long_doc_qa.py and multi_doc_qa.py) generate documents from synthetic hi hi hi sequences by default. LMCache does include a RAG benchmark that accepts real datasets and a multi-round QA benchmark that supports ShareGPT conversations, but the long-document benchmarks used for measuring offloading performance do not.

We needed a reproducible corpus of realistic documents to understand what performance actually looks like under representative conditions. We observed that compression ratios, tensor value distributions, and cache access patterns all changed materially when we moved from synthetic inputs to realistic text, and again when we moved from dummy weights to real trained weights.

Reproducibility with open weights

We use Qwen/Qwen3-8B-FP8 for both document generation and benchmarking. Using an open model means anyone can reproduce the corpus exactly: same tokenizer, same generation behavior, same KV cache tensor shapes. No proprietary API dependency.

The FP8-quantized variant fits on a single NVIDIA L4 (24 GB VRAM). The vLLM server exposes an OpenAI-compatible API at localhost:8000/v1. Generation takes roughly 7 minutes per 10K-token document.

🌐
filename.html
# Install vLLM (0.15.0 was used for corpus generation)
pip install vllm

# Start the model server (single GPU, ~21 GB VRAM)
vllm serve Qwen/Qwen3-8B-FP8 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9

# Verify it's running
curl http://localhost:8000/health

For the corpus included in the repo, we ran on an AWS g6.8xlarge instance (NVIDIA L4, 24 GB VRAM). Any GPU with 24+ GB VRAM should work.

Preview the documents

The corpus spans two domains: 15 medical documents and 15 legal documents, totaling 300,000 tokens. All documents are 10,000 tokens each, generated by Qwen3-8B-FP8 with noise injection enabled.

Medical documents average 14.2% token diversity (1,424 unique tokens per 10K). Legal documents average 9.4% (lower, reflecting the repetitive nature of formal legal language). Both are roughly 700x more diverse than the hi hi hi baseline at 0.02%.

Why this matters for the rest of the series

Posts 01 and 02 in this series discuss the systems argument for KV caching and how prefix caching works in vLLM and SGLang. Both assume the reader understands that benchmark methodology matters: that the shape and diversity of input documents affect prefill cost, cache behavior, and transfer performance in ways that synthetic inputs do not capture.

This post fills in that assumption. The prefix caching experiments in Post 01 used realistic documents for exactly this reason. When the previous posts refer to compression ratios or cache access patterns, they are talking about the behavior you get with a diverse corpus, not the behavior you get with hi hi hi.

The corpus, generation scripts, and benchmark tooling are all available in the kvbenchdocs repository.

On this page