Cache hit rate (CHR) is a popular metric that captures the percentage of requests that were found in the cache. Unfortunately, over-indexing on CHR risks overlooking some critical implications.
Real-life workloads often have highly skewed data access, allowing production caches to routinely achieve over 90% hit rate, and sometimes exceeding 99.9%. The recent DynamoDB paper, for instance, cites 99.75% cache hit rate to its metadata cache. The paper also covers the profound implications of a high cache hit rate: the bimodal behavior if the cache gets cold. Even the distributed systems experts at DynamoDB team have had firsthand experience on the consequences of the cache hit rate dropping. They are not alone. Many teams end up overlooking the downside of seductively high cache hit rates.
A simple reframing can help engineers internalize the core implications of this concept: cache miss rates (CMR). CMR spikes correlate directly to load spikes, meaning it’s a powerful metric for keeping teams honest about the implications of increased load on the database.
Consider the simple example of a CMR of 1%. Doubling that to 2% sounds catastrophically worrisome—it is a 100% increase in the requests that now need to hit your backend. On the other hand, the same scenario framed with the CHR metric shows a drop from 99% to 98%. A 1% drop is trivial. Doubling the load on your database is much more alarming.
On its own, CHR requires arithmetic to figure out the load:
Load = CURRENT_CHR / (1 - NOMINAL_CHR)
This is not hard arithmetic, but it is complicated enough to trick our brains into overlooking the profound implications of a CMR spike, when perceived as a CHR drop. Paying attention to CMRs helps to quickly contextualize the impact on your broader system from seemingly minor issues during maintenance windows or scaling events.
Take a look at your operational dashboard. Almost all the alarms are on spikes (latencies, errors, etc). While there are some good use cases for alarming on the dips (e.g. request rates), we are trained to look for spikes! If your cache is cold, cache miss rates will spike.
What causes CMR spikes?
Restarts can be caused by things like deployment and failed nodes. In both cases, a new node is introduced into the ecosystem without any data (cold cache). On AWS ElastiCache, the deployments are limited to maintenance windows selected by users, while failed nodes can periodically occur as EC2 instances fail. Unfortunately, even if you have replication and multi-AZ setup, node failures and maintenance windows impact your CMR. Furthermore, the cache may take meaningfully longer to warm up than the maintenance windows—and in certain scenarios, this means a meaningfully longer impact on CMR.
Rehashing typically occurs when you scale-in or scale-out your cache. Introducing a new node or removing a node changes the cache topology, causing the keys to be redistributed. In the early phase of a rehash, cache clients may have differing opinions on where a particular key ought to live. If client A ‘set’ a key at node X, but client B gets it from node Y, you’d get a cache miss. After the early phase of a rehash, some keys may not have been reset at their newly designated location. In this scenario, clients will continue getting misses from the new location until a client issues a ‘set’ on the key again at the new location. Additionally, it often triggers `set` from multiple clients that observe the item not existing. This increases the load on the backend while eroding cache performance through redundant `set' commands.
How can CMR spikes be minimized?
Replication creates multiple copies of the data. This increases the potential read throughput on a given key, but it also creates multiple locations where a key can be found. For example, if there are 3 replicas in a system—with requests evenly distributed among them—a single node failure would only cause 33% of the requests to end up on a cold node and return a miss.
Cache warming allows a new node to observe a few `set` commands before becoming available on the read path. Facebook’s mcrouter, for instance, warms new nodes as they are coming in. Pinterest uses a similar pattern on their self-managed caching nodes on EC2. Cache warming is simple: instead of abruptly terminating a node and throwing in a new cold node to replace it, you replicate a portion of the traffic to the new node for a limited period of time until it has a meaningful portion of the keys. This is a very elegant technique if you take into account the fact that popular keys often have the biggest impact on customers—and the more popular the key, the more likely it is to get populated in the new node. Cache warming does not eliminate the CMR spikes, but in most cases, it meaningfully dilutes it.
What happens when CHR and CMR impacts are trivialized?
Customers and providers expect some misses as nominal behavior. Nevertheless, a lack of clear specification of miss scenarios blurs the ownership that cache teams ought to feel for them. Misses caused by node failures (or deployments), topology changes (scale-in/scale-out), and memory pressure-driven eviction can be largely mitigated. Unfortunately, caching services do not invest the engineering effort required to mitigate the CMR spikes caused by deployments, scale-in, and scale-out. Instead, we commonly see providers categorically blaming the customers for all cache misses: “If a customer did not give me the key, how can I be expected to return it?” Services like ElastiCache go as far as encouraging customers to disable scale-in as a best practice to reduce the impact on CMR from repetitive scale-in and scale-out:
Disable Scale-In – Auto scaling on Target Tracking is best suited for clusters with gradual increase/decrease of workloads as spikes/dip in metrics can trigger consecutive scale-out/in oscillations. In order to avoid such oscillations, you can start with scale-in disabled and later you can always manually scale-in to your need.
Minimizing scale events does have a positive impact on CMR, but it comes at the cost of a wastefully overprovisioned cluster with under 10% utilization. In the end, the customers suffer with a higher bill.
Momento’s take on CMR
CMR is a fundamental focus at Momento. We understand that a spike in CMR corresponds to greater load on our customers’ databases, and if we can minimize CMR spikes, we can give our customers true elasticity and continuous availability without any maintenance windows. This has been a driving insight in architecting Momento.
Momento’s proxy fleet is on a low-latency, gRPC-backed messaging bus
This allows our proxy fleet to rapidly learn about state changes in the cache topology, minimizing the period of disagreement between the nodes on where a key ought to reside.
Momento SDKs are oblivious to server-side cache topology changes
Today, modern cache clients (SDKs) have to keep track of the cache topology. By getting rid of this leaky abstraction, we are able to generate much simpler SDKs that do not even notice state changes behind the scenes.
Momento warms nodes during deployments, scale-in, and scale-out
Not only do we abstract away topology changes, we bring in nodes after they have warmed up. This eliminates maintenance windows, enabling us to continuously deploy without impacting CMRs.
Next time you see your team putting up CHR on a chart or discussing it in a design review, consider reframing it as a conversation around CMR and see if it unveils new insights. Meanwhile, if you see a caching vendor offering “elasticity” or “autoscaling,” ask what happens to your CMRs during deployments, scale-out, and scale-in.