4 tips for building high-performance systems

Keep these best practices in mind when performance is primary.

Khawaja Shams

Author

Daniela Miao

Author

Performance

Building and maintaining high-performance systems is hard! Here are four practices we live by to maintain and improve performance at Momento.

Focus on tail latencies instead of median or average

‍When building a performance-sensitive software or service, it is always tempting to advertise your P50 or average latencies. Those numbers typically look way better than P99 or P999, and we often see latencies going up exponentially at p99 or p999—hitting 10x to 100x the average. Unfortunately, P50 latencies are not representative of a typical (or even P50) customer experience. Untamed tail latencies in the cache can actually be worse than the core database, often defeating the purpose of the cache entirely. Modern applications often make multiple calls to the same backend—sometimes dozens—which meaningfully increases the impact of tail latencies as customers wait for the slowest request to respond.

Invest in a test harness to rapidly evaluate new techniques or technologies

‍Having a solid test harness enables teams to quickly evaluate a broad range of performance optimizations. At Twitter, Brian Martin built rpc-perf—a general purpose tool to benchmark RPC services and used extensively to evaluate caching services at Twitter. rpc-perf features high-resolution latency metrics, support for multiple protocols including Memcached and Redis, powerful test configurations, and waterfall visualization of latencies.

We have added Momento’s grpc protocol support into rpc-perf as well as integration with OpenTelemetry, enabling us to visualize the results of our performance benchmarks in Grafana and LightStep. We also have infrastructure-as-code to rapidly deploy a fleet of rpc-perf instances to start sending at-scale load towards Momento, which then gets aggregated in our metrics provider—and gives us instant visibility into service-level changes. We are actively adding a performance qualification stage in our deployments to qualify the performance impact of any new changes before they hit production.

Have service-level objectives (SLOs) for each component in the system

‍SLOs are important in grounding the outcomes we are looking for in each test. We use SLOs in two ways. First, there is a minimum bar that each component must meet to qualify to be placed into the service. This applies, for example, to code changes or engine upgrades. Second, we have aspirational SLO milestones that we strive towards reaching. Once those milestones are hit, we turn them into our new minimum SLOs. This approach allows us to be balanced and pragmatic around our pace of innovation and iterative performance improvements that we want to deliver to our customers over a longer period of time.

‍Continuous perf testing

‍Test harnesses and SLOs are great for assessing the efficacy of major updates or tuning. Unfortunately, each deployment—no matter how minor—has a meaningful risk to degrade your performance. Having a performance or soak stage in your CICD pipeline ensures that regressions don’t make it to prod, while performance canaries go a long way in helping ensure that your regressions in prod are detected promptly.

Read about how we employed this framework in our deep dive on optimizing Pelikan for Google’s Tau T2A VMs.

Khawaja Shams

Author

Khawaja is the CEO and Co-Founder of Momento. He is passionate about investing in people, setting a bold vision, and team execution. Khawaja has experience at AWS where he owned DynamoDB, and subsequently owned product and engineering for all 7 of the AWS Media Services. He was awarded the prestigious NASA Early Career Medal for his contributions to the Mars Rovers.

Daniela Miao

Author

Daniela Miao is the Co-Founder of Momento, where she’s humbled everyday by her awesome teammates! Previously, she led Platform Engineering at Lightstep, where she launched their new Metrics Product. She was also tech lead at AWS DynamoDB, and released cross-region replication. Daniela has spoken at many events including re:Invent, QCon, and Kubecon. At Momento, she works on distributed system performance, observability, security, and the intersection of engineering with business.

4 tips for building high-performance systems

Focus on tail latencies instead of median or average

Invest in a test harness to rapidly evaluate new techniques or technologies

Have service-level objectives (SLOs) for each component in the system

‍Continuous perf testing

Related Posts

What is real-time data processing?

How we turned up the heat on Node.js Lambda cold starts

Quick Primer on ElastiCache Redis Maintenance Windows