When 10MB blobs wreck your tail latency. How Valkey 9.0 tames large objects ->

Large Objects Ruin the Party – Valkey 9 Tames Them

Khawaja Shams headshot Khawaja Shams
Brian Martin

We had a Valkey cluster humming along at 100K requests/second serving 1KB objects. Latency was tight. Then someone started fetching a few 10MB blobs. Ten requests per second. The small object workload fell apart.

10MB items are common in media use cases. Momento as a live origin, caching video segments. Tail latencies really matter here. A p99 spike means buffering. Buffering means users leaving. This is especially problematic in multi-tenant systems where one workflow’s large items can ruin the experience for everyone else.

The Problem


Our baseline: 100K req/s total of 1KB GETs distributed across 256 connections, each pipelining 32 requests. Then we introduced 10 req/s of 10MB GETs as background traffic. Just 10 requests per second of large objects.

Here’s what happened to the 1KB request latency on Valkey 8.1:

1KB Latency p50 p90 p99 p99.9 p99.99 Max
8.1 baseline
295μs
352μs
416μs
489μs
578μs
2.8ms
8.1 + 10MB noise
289μs
350μs
500μs
26.2ms
30.1ms
37.2ms

p50, p90, and p99 barely moved. But tail latencies exploded. p99.9 went from 489μs to 26.2ms. That’s 53x worse. A handful of large object fetches were destroying the experience for everyone else.

But wait. Valkey has I/O threads. We saw throughput scale nearly linearly with I/O thread count. Shouldn’t they handle the network traffic without blocking the main thread?

The Hypothesis


We knew Valkey 9.0 shipped with reply copy avoidance. The idea: instead of memcpy’ing large objects into reply buffers on the main thread, just pass a pointer reference and let the I/O threads handle the actual data transfer.

If the main thread was blocking on 10MB memcpy operations, that would explain why small requests were getting stuck. Remove the copy, remove the block, problem solved. That was the theory.

How Copy Avoidance Works


Prior to 9.0, returning a large string meant the main thread memcpy’d the entire object into a reply buffer before moving on. Two copies per GET:

BEFORE (Valkey 8.1)

  ┌─────────────────┐
  │   Object Store  │
  │  ┌───────────┐  │
  │  │ obj->ptr  │  │     10MB string data
  │  │ ██████████│──┼─────────────────────────────────┐
  │  └───────────┘  │                                 │
  └─────────────────┘                                 │
                                                      │ COPY #1 (memcpy)
                                                      │ main thread
                                                      ▼
                                          ┌───────────────────┐
                                          │   Reply Buffer    │
                                          │ ██████████████████│  10MB copy
                                          └───────────────────┘
                                                      │
                                                      │ COPY #2 (write)
                                                      │ I/O thread
                                                      ▼
                                          ┌───────────────────┐
                                          │      Socket       │
                                          └───────────────────┘
                                                      │
                                                      ▼
                                                   Client

  Total memory bandwidth: 20MB per GET
 

Valkey 9.0 flips the script. Instead of copying 10MB, the main thread writes a 16-byte reference and moves on:

AFTER (Valkey 9.0)

  ┌─────────────────┐
  │   Object Store  │
  │  ┌───────────┐  │
  │  │ obj->ptr  │  │     10MB string data
  │  │ ██████████│──┼──────────────────────────────────────────┐
  │  └───────────┘  │                                          │
  └─────────────────┘                                          │
         │                                                     │
         │ write 16-byte reference                              │
         │ (no data copy)                                      │
         ▼                                                     │
  ┌─────────────────┐                                          │
  │  Reply Buffer   │                                          │
  │ ┌─────────────┐ │                                          │
  │ │ bulkStrRef  │ │  just 16 bytes:                          │
  │ │ {obj, str}──┼─┼──────────────────────────────────────────┤
  │ └─────────────┘ │                                          │
  └─────────────────┘                                          │
                                                               │
                        I/O thread builds iovec ───────────────┤
                        pointing directly to obj->ptr          │
                                                               │
                                          ┌───────────────────┐│
                                          │      Socket       ││
                                          │    writev(iov) ◄──┼┘
                                          └───────────────────┘
                                                      │
                                                      ▼
                                                   Client

  Main-thread copy bandwidth: ~0 (just pointer/reference management)
 

The I/O thread builds an iovec pointing directly to obj->ptr and calls writev(). The heavy lifting happens in parallel with command execution.

Back to the Party


Would copy avoidance fix the noisy neighbor problem? We ran the mixed workload test on both versions:

In 9.0, the large object traffic has essentially no impact on the small object workload. p99 stays at 8ms. The main thread isn’t blocking on memcpy, so small requests keep flowing even when large ones are in flight.

The party crashers got kicked out. Well, not really. They were ushered to the dance floor where they now play nicely with everyone else.

The Secret Menu


You don’t need to tune anything to get these gains. The optimization is controlled by three configs that aren’t in the default config file. The secret menu, if you will. The defaults are sane:

Config Default Effect
min-io-threads-avoid-copy-reply
7
With 7+ I/O threads, always use copy avoidance
min-string-size-avoid-copy-reply
16KB
Size threshold in single-threaded mode
min-string-size-avoid-copy-reply-threaded
64KB
Size threshold with I/O threads enabled

The defaults work for most use-cases. But now you know where to look if you want to tune for your specific workload.

This is one of many community-driven optimizations in Valkey. Individually, they’re incremental. Together, they compound. I’m excited about upcoming changes like PR #2976, which offloads read commands to worker threads entirely, removing the main thread from the read path.

If you’re hitting performance problems with your cache in real-world scenarios, I’d love to hear about them. It helps me get better at characterizing these workloads, and I enjoy sharing reproducible results with everyone.

Special thanks to Madelyn Olson for technical guidance, Valkey knowledge, and inspiring this blog 💚

On this page