LLM inference is becoming a distributed systems problem. Explore the architecture patterns reshaping AI infrastructure ->

Why Snap Was Willing to Fork, and Why They Still Came Back

Allen Helton

I have no intention of ever forking a database. The amount of bravery and engineering mastery that goes into it scares me to no end. But Snap did. They committed to it so hard that they acquired the company building it, open sourced the entire commercial codebase, and ran 100% of their caching infrastructure on it for years. KeyDB powered Snapchat at a scale most companies can only dream of.

And then they migrated to Valkey anyway.

At Unlocked San Jose, Ovais Khan, Principal Software Engineer at Snap, walked through that migration. As interesting as it was to hear how they did it, it was all the more interesting to hear why. Why it happened, why it wasn’t worth staying on the fork, and why when they came back, they came back to Valkey. 

The case for forking in 2019

KeyDB started in 2019 as a project by John Sully and Ben Schermel at EQ Alpha Technology. The premise was simple. Redis ran a single-threaded event loop. Modern servers had 32, 64, 96 cores. To get peak throughput out of a single machine, you had to run a cluster of Redis nodes on it. That was wasteful, and Salvatore Sanfilippo, the creator of Redis, was on record arguing against changing it: “I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it’s a lot of complexity without a good reason.” Simplicity of the codebase was a value he was actively protecting. 

KeyDB took the other side of that bet. It added real multithreading, with per-thread event loops and lock-based synchronization on shared state. It also added active-active replication and FLASH storage for cost-efficient large datasets. On the same hardware, it could move several times the operations per second that Redis could.

This is the textbook case for forking. The upstream project had made a deliberate architectural choice. That choice was the right one for them and the wrong one for a certain kind of user (Snap) who needed to push a single node harder. A fork was the only way forward.

By 2021, Snap was running KeyDB across enough of their caching infrastructure to want a permanent stake in it. They acquired the team in May 2022 and brought the formerly commercial KeyDB Pro features into the open source codebase under BSD-3. For about two years after that, all of Snap was running on KeyDB.

What forking buys you

The benefits of forking are easy to articulate when you ship. Snap got features that were important for their specific operating model:

  • Multithreaded command execution, which let them get more out of every node
  • Zone-aware read routing, which kept cross-AZ traffic down and cut data transfer costs considerably
  • Forkless background saves, which made snapshots predictable at high memory
  • Same-zone replica behavior that reduced timeout blast radius during upgrades

These features weren’t going to make it into Redis on Snap’s timeline. The fork gave them room to build it as soon as they were ready.

As far as forking goes, that’s usually the part written in blog posts and talked about on the conference loop. You wanted a feature, the upstream said no, you built it yourself, and now it works. Forking feels like freedom.

What forking costs you

Every change to upstream Redis after the fork point became a decision. Does it get ported over? Rewritten? Skipped? There’s a long tail at the end of whatever decision was made. Porting means you carry merge conflicts forever. Rewriting means you have two implementations of the same idea drifting apart. Skipping means your fork stops being a superset of upstream and starts being something else.

Ovais addressed this specifically in his talk. Snap could not easily move from KeyDB’s Redis 6.2 base to Redis 7.2. The cost of staying current with upstream had become high enough that they were stuck on a flavor of 6.2 while everyone else moved on. That meant they were also stuck without features the broader community had built on top of 7.2.

The same goes for the ecosystem. Every client library, operator, monitoring tool, and benchmark gets tested against upstream first. Your fork either matches upstream behavior closely enough that those tools just work, or it doesn’t, and you start maintaining your own.

While forking might have started off feeling like an accelerator, it quickly became a drag.

The Redis license change

In March 2024, Redis Ltd. changed the Redis license from BSD-3 to a dual SSPL and RSALv2 model. Neither license is OSI-approved. For any company offering Redis as a managed service, this was an immediate problem. AWS, Google Cloud, Oracle, and Ericsson responded by forking the last BSD release, Redis 7.2.4, and donating it to the Linux Foundation. Eight days after the license change, Valkey existed.

Up until then, the case for staying on KeyDB was obvious. The KeyDB team was inside Snap. The codebase was theirs. The performance was what they needed. 

But Valkey made them pause. The project had open governance under the Linux Foundation, with a Technical Steering Committee across multiple companies and no single controlling vendor. It was BSD-licensed and would stay that way. Its roadmap included the things Snap had previously forked to get: I/O threading, dual-channel replication, and a path toward features Snap wanted. And every major cloud provider was committing serious engineering effort to it.

The KeyDB story also got more complicated from the inside. In January 2025, John Sully, KeyDB’s original creator, left Snap. His parting note on the KeyDB repository said it plainly:

“When we made KeyDB we wanted to prove that caches should have great performance and I think we succeeded. Now there are many options, including Valkey which is fully open source and based on my testing has matched KeyDB’s performance. I’m not sure what Snap will do with the project, but I think that development effort should move to Valkey moving forward as they have clear momentum and are the most up to date.”

When the person who started the fork tells you the fork is done, the fork is done.

The secret migration back

Snap runs caching at a scale where you can’t just swap a binary. The migration had to be invisible to application teams, comparable in cost, and safe across radically different workload types. Ovais walked through the major decisions that made their migration as easy as possible.

Abstraction layers are key to managing workloads at scale

Snap had built a storage abstraction with a RESP proxy in front of every cluster. Applications never talked to KeyDB directly. They talked to the proxy, which spoke Redis wire protocol back to whatever was running behind it. That layer of indirection made this migration possible. Without it, every application team at Snap would have needed to know about the change. With it, nobody had to.

These layers let them migrate around 30 caches per week. By the time Ovais gave this talk, 70 to 80 percent of workloads were on Valkey.

Do a gap analysis before changing any code

Snap did a feature-by-feature comparison between KeyDB and Valkey before touching anything in production. KeyDB’s multithreading and Valkey’s I/O threading work differently, so they benchmarked carefully to confirm comparable throughput. 

Some KeyDB features were blockers and had to be ported to Valkey. Zone awareness was the first one Snap contributed. Replica MOVED behavior during upgrades was another. CPU throttling at high utilization was a third. 

A hidden gap that wasn’t found until much later was with MGET. KeyDB supported it across slots, but Valkey does not. So after moving to Valkey, Snap had issues with command parsing pressure in large batching workloads. They quickly ported cross-slot MGET to their internal build, and are working with the core maintainers to get it added upstream.

Pick a stable version for a base, not a new one

Snap started on Valkey 8.2 RC, ported the features they needed, and immediately ran into crashes at 9 to 10k QPS. The root cause was new TLS offloading work. They rolled back to 8.0.2, ported the necessary fixes onto that, and benchmarked from there. New releases need a baking period, and a migration is the wrong time to find out.

Categorize and prioritize your workloads

Snap divided their caches into three categories: CPU-bound, high-memory, and high-write-rate. Each category needed different validation. CPU-bound workloads were primarily a throughput question. High-memory workloads were really about replication buffer behavior during full syncs, because if the buffer fills before a snapshot completes, you enter a sync loop that never finishes. High-write workloads required tuning replica buffer sizes and primary write throttling, because Valkey’s dual-channel replication puts buffers on replicas rather than primaries. Inside each category, they went lowest-criticality first, highest-criticality last.

Lessons from going full circle

The fork was the right call in 2019. Redis was not going to go multithreaded, and the workloads Snap was running needed it. KeyDB was a solid piece of engineering that pushed the ceiling on what a single Redis-compatible node could do.

The migration back was the right call in 2025 because the conditions that justified the fork had changed. The upstream that resisted features they needed was no longer the upstream they cared about. Valkey’s governance was open. Its roadmap included the work Snap had previously done alone. And every additional year on a Redis 6.2 build was another year of compounding distance from where the ecosystem was going.

Forks are leverage. They are also debt. Be honest with yourself about which one you are accumulating at any given moment. Snap was. They forked when forking gave them speed, and they came back when the fork started to cost more than it earned.

I don’t want you to take away from this that forking is bad. Sometimes it’s the right thing to do. The decision to fork is not permanent, and treating it like it is permanent is how you end up running a five-year-old codebase while your competitors are shipping on a roadmap you helped fund.

When the world moves, move with it.

Happy coding!

On this page