Building and maintaining high-performance systems is hard! Here are four practices we live by to maintain and improve performance at Momento.
Focus on tail latencies instead of median or average
When building a performance-sensitive software or service, it is always tempting to advertise your P50 or average latencies. Those numbers typically look way better than P99 or P999, and we often see latencies going up exponentially at p99 or p999—hitting 10x to 100x the average. Unfortunately, P50 latencies are not representative of a typical (or even P50) customer experience. Untamed tail latencies in the cache can actually be worse than the core database, often defeating the purpose of the cache entirely. Modern applications often make multiple calls to the same backend—sometimes dozens—which meaningfully increases the impact of tail latencies as customers wait for the slowest request to respond.
Invest in a test harness to rapidly evaluate new techniques or technologies
Having a solid test harness enables teams to quickly evaluate a broad range of performance optimizations. At Twitter, Brian Martin built rpc-perf—a general purpose tool to benchmark RPC services and used extensively to evaluate caching services at Twitter. rpc-perf features high-resolution latency metrics, support for multiple protocols including Memcached and Redis, powerful test configurations, and waterfall visualization of latencies.
We have added Momento’s grpc protocol support into rpc-perf as well as integration with OpenTelemetry, enabling us to visualize the results of our performance benchmarks in Grafana and LightStep. We also have infrastructure-as-code to rapidly deploy a fleet of rpc-perf instances to start sending at-scale load towards Momento, which then gets aggregated in our metrics provider—and gives us instant visibility into service-level changes. We are actively adding a performance qualification stage in our deployments to qualify the performance impact of any new changes before they hit production.
Have service-level objectives (SLOs) for each component in the system
SLOs are important in grounding the outcomes we are looking for in each test. We use SLOs in two ways. First, there is a minimum bar that each component must meet to qualify to be placed into the service. This applies, for example, to code changes or engine upgrades. Second, we have aspirational SLO milestones that we strive towards reaching. Once those milestones are hit, we turn them into our new minimum SLOs. This approach allows us to be balanced and pragmatic around our pace of innovation and iterative performance improvements that we want to deliver to our customers over a longer period of time.
Continuous perf testing
Test harnesses and SLOs are great for assessing the efficacy of major updates or tuning. Unfortunately, each deployment—no matter how minor—has a meaningful risk to degrade your performance. Having a performance or soak stage in your CICD pipeline ensures that regressions don’t make it to prod, while performance canaries go a long way in helping ensure that your regressions in prod are detected promptly.
Read about how we employed this framework in our deep dive on optimizing Pelikan for Google’s Tau T2A VMs.