This post is part of a series detailing how Momento thinks about our developer experience. Getting started with our serverless cache is shockingly simple, but that is only half of the story.
We believe a delightful user experience includes writing and running code that interacts with your cache, and making our client libraries shockingly simple takes deliberate effort. To say it another way: our cache clients are a first-class concern.
This post is for those of you who want to geek out with me about p999 latencies and gRPC channels. For everyone else, if you’re just interested in getting up and running with Momento…you’re good to go! Check out our Getting Started guide to get up and running with a free serverless cache in minutes, or head to our GitHub org to download the Momento client for your favorite language!
Previously, on Shockingly Simple
- Momento clients are built using gRPC, a high-performance framework for remote procedure calls, created by Google.
- As such, a lot of our tuning efforts will be focused on gRPC configuration.
- Like Node.js, the default Python runtime only runs user code on a single CPU core, so CPU is likely to become our first and primary performance bottleneck.
- The more concurrent requests executed against our client, the more CPU is required to handle all of their network callbacks.
- Thus, it’s important for us to identify the maximum number of concurrent requests we can handle before a CPU bottleneck starts to impact our client-side latencies.
- gRPC “channels” are connections between the client and the server. Most gRPC servers are configured to limit the maximum number of concurrent requests per channel to 100.
- Therefore, when executing more than 100 concurrent requests, it can sometimes help to create more than one gRPC channel between the client and server.
Tuning the Momento Python gRPC Client
Now we know a single channel is sufficient. However, these latencies are still unacceptably high, so we need to continue tuning.
Varying the number of concurrent requests
As expected, the best values are in the 50-100 range. Latencies increase slightly when moving from 50 to 100, but throughput increases quite a bit, so 100 looks like a good starting point for further testing.
Moving from the laptop to the cloud
Now we have an idea of how performance looks in a dev environment, let’s test things out from an AWS EC2 instance running in the same region as the cache service. This should eliminate most of the network latency, so we expect the client-side latency numbers to look much closer to the server-side numbers. (We use a c6i.4xlarge instance type because we have observed more consistent network performance in this class than with smaller instances.)
- Client-side p999 latency of 20ms: This is a reasonable target for applications which are caching data that is extremely expensive to compute (e.g. a big JOIN query in a complex relational database schema) and when the application’s own latency requirements are lenient.
- Client-side p999 latency of 5ms: This is a better target for an application whose own latency requirements are absolutely critical.
asyncio and uvloop
Since we were tuning for maximum performance on a single Python process and we knew the bottleneck is CPU, we were looking for opportunities to reduce CPU usage. A lot of the overhead in our load generator program is driven through Python’s asyncio library, for which there are different engines available. Thus far in this post. we have been using the default engine.
We decided to experiment with the uvloop engine, which moves some of the work from Python into native code; specifically, it uses the same libuv C library that is used by the node.js event loop. Let’s take a look at how performance was impacted. We’ll compare the performance of 50 and 100 concurrent requests, with and without uvloop:
Revisiting the number of concurrent requests
Here are the data for client-side latency and throughput when running on the EC2 instance using uvloop with varying numbers of concurrent requests:
From these charts we can see we are able to hit our first latency target (p999 latency less than 20ms) with 50 concurrent requests.
For an application whose latency requirements are more stringent, we are able to hit our target of a p999 latency of less than 5ms with 5 concurrent requests. This does reduce the throughput from 9100 requests per second to about 3500, but the tradeoff will be the right choice for certain applications.
There are also options for increasing overall throughput on a machine by running multiple instances of your Python process. For today, we have focused on optimizing a single Python process, but in a future post we may do another deep dive to explore the performance tradeoffs that come into play with those options.
With these findings, we now have a solid basis for defining pre-built configurations for dev and prod environments. We used these to initialize the default settings in the Momento Python client, and in an upcoming release, you’ll be able to choose between the pre-built configurations like Configurations.Laptop, Configurations.InRegion, and Configurations.InRegion.LowLatency and ProdEnvironmentConfig.Of course, you’ll also be able to manually tune the settings as you see fit, but our goal is to cover the 90% cases so you don’t have to!
Next up: C#
The next post in this series will cover our tuning work for the C# client. This will be the first where we cover a language that supports using multiple CPU cores out of the box. This should provide some interesting twists, since CPU won’t become a bottleneck as quickly as it has in the first two explorations. Stay tuned!
Or…don’t! This is all just for fun and insight, and our position is that you shouldn’t actually need to worry about this stuff with Momento. We want to take care of it for you—so you can focus on the things unique to your business. Go check out our Getting Started guide, and see how easy it is to create a cache for free in just minutes.
And if you want to talk shop on any of this, join our Discord server!