< back to blog

Your KV cache benchmark is “hi hi hi”

Default KV cache benchmarks run on repetitive "hi hi hi" text, which makes compression and transfer look far better than they do on real workloads.

Before you commit to a KV cache offloading system, you benchmark it to make sure it performs well below your SLA. You see that it has excellent compression and cheap transfers. Seems like an easy win.

But there’s some trouble with what standard KV cache benchmarks run on.

LMCache ships with a long-document benchmark for measuring KV cache offloading performance. Run it without a corpus file and it generates documents like this:

warmup_prompts = [
  str(i) + " " + " ".join(["hi"] * args.document_length)
  for i in range(args.num_documents)
]

A 10,000-token document comes out looking a little underwhelming.

0 hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi …

Technically speaking, it is the 10K token count you were looking for, but it doesn’t represent a real 10K token workload.

KV cache systems do not run on token count alone. Compression ratios, activation patterns, transfer sizes, and cache behavior all depend on the shape of the input. Two documents of the same length can be two entirely different workloads.

Unfortunately, much of the current KV cache ecosystem is benchmarked on synthetic inputs that look nothing like the workloads people run.

The benchmark is not representative

The default benchmark document contains a numeric identifier and the token “hi” repeated thousands of times.

Transformers do not produce identical activations for repeated tokens. Positional encoding, attention mixing, and residual connections keep every position distinct. To an LLM, distinct and varied are not the same thing. Repeating a single token produces far more regular activation patterns than diverse text does.

Compression improves, transfer sizes shrink, and cache behavior becomes easier to predict with non-varied workloads. The benchmark is measuring something, but it is not measuring a realistic production workload.

Benchmarking KV cache offloading with “hi hi hi” is like benchmarking a database with SELECT 1. The numbers come back fast, but they do not tell you much about real workloads.

The difference shows up immediately

We compared the default benchmark document against a realistic medical document using Qwen3-8B-FP8. Both ran about 10,000 tokens. The synthetic one carried two unique tokens, a token diversity of 0.02 percent. The medical one carried 1,329, or 13.3 percent. The token count is the same. Everything else is different.

Token diversity affects activation patterns. Activation patterns affect tensor value distributions. Tensor distributions affect compression ratios and transfer sizes. A benchmark built from highly repetitive inputs can diverge sharply from one built on realistic text.

Compression behaves differently too

We first ran into this while building a Valkey-backed KV cache connector. We followed the LMCache tutorials and used dummy weights, and compression ratios looked excellent. Then we switched to Qwen3-8B-FP8 with real trained weights, and the results were night and day.

Real model weights produce tensor values that behave more like high-entropy floating-point data. General-purpose compression still helps, but the gains are smaller than they appear with dummy weights. Repetitive inputs create more structured activation patterns that compress more effectively, making the benchmark results appear better than they actually are.

So we built a more realistic corpus

To see how KV cache systems behave under representative inputs, we built a corpus of 30 long-form documents across medical and legal domains. We wanted documents that resemble the structure, formatting, vocabulary, and variability that real systems process.

Example medical and legal document text

Medical documents averaged 14.2 percent token diversity. Legal documents averaged 9.4 percent. Both are hundreds of times more diverse than the synthetic baseline at 0.02 percent.

The corpus holds 300,000 tokens and was generated with Qwen3-8B-FP8. The corpus and generation scripts are open source for those interested.

Benchmarking the workload you actually have

While token count is easy to generate, it can also be the least informative. Match the diversity, structure, and vocabulary of the text your system serves, and the compression ratios and transfer sizes start to represent values you can trust.

The input comes first. Before ranking cache connectors, compression schemes, or storage backends, run them on inputs that look like your traffic. Compare them on “hi hi hi” and you are ranking them on a workload nobody runs.

Before you evaluate your next KV cache offloading system, be sure to ask “Are the benchmark documents representative of the workload I actually run?”