Latency is a well-known culprit in subpar performance. It is a measure of request response time; for example, how long it took the directions in your favorite GPS app to load, or for your browser or CDN’s cache to serve the webpage data. Along with traffic, error rate, and saturation, latency is one of the four golden signals of observability. With many of today’s applications handling requests in dynamic, cloud-based, multi-tenanted environments, there are more potential sources for latency than ever.
For these highly scaled, distributed applications, it is critical to implement performance solutions that involve monitoring high percentile latency—also known as tail latency. Though relatively rare in frequency, tail latencies can lead to significant performance impacts, especially in applications handling hundreds of thousands or millions of requests per second. Developing a clear picture of tail latency in your application and how it is affecting user experience is a key part of delivering software that delights customers and drives business.
What is tail latency?
Tail latency is high percentile latency, representing requests that have a response time longer than 98.xxx–99.xxx percent of all requests handled by a service or application.
These tail latencies can be caused by any number of aspects in the execution of a request, be it networking, page faults, garbage collection, sudden increases in load, resourcing, or any number of dependency issues.
You are working on a payment processing service and you have reduced average latency to 10ms for 90% of requests. An additional 2% of requests have an average latency of 1 sec. This all falls well within performance expectations. But .01% of requests are experiencing an average 10 sec of latency. These heavily delayed requests are tail latencies.
Why does tail latency matter?
When looking to optimize latency in an application, a tempting, and not wholly incorrect, reaction is to focus on engineering solutions for the most frequent and consistent problems. But, this approach by itself can leave big blind spots/gaps in preparedness. Services deep in the call chain experiencing tail latencies can snowball degrading performance as more services and resources are tied up in pending requests.
Take the above example of the payment processing service. Save for the occasional alert, users and clients seem happy. All the while, .01% of requests have an average latency of 10 seconds. As it happens, the team is only monitoring latencies up to the 95th percentile, not wanting to devote the time or resources to resolve or better understand the service’s tail latencies.
Now, imagine that the marketing team runs a campaign that goes viral and traffic starts to increase dramatically. As the traffic increased these longer transactions that happened with real user data and hiding at lower loads started exhausting the total available request limits to a dependent upstream service. 15 minutes into the marketing push transactions started failing causing a cascading effect across all the transactions and a much larger availability outage now. By the time the on call engineers could respond and scale up the dependent system to handle excess concurrent requests the viral moment had passed and it was too late. When it mattered most, the tail latencies were the system’s Achilles’ heel.
Tail latencies matter because they can negatively affect user experience and client confidence. Proactively addressing tail latencies is important for site reliability, and site reliability is important for making (and keeping) money.
Ultimately, all the engineering power in the world doesn’t amount to anything if it can’t be channeled into an engaging and performant user experience. There are studies that show just how important latency is to web users, showing positive relationships between low latency and engagement (or vice versa), especially if users are able to quickly switch to another service with less latency. Google has developed the RAIL model to help developers keep up with user expectations around app performance. For example, they have found that users are particularly sensitive to motion, and the tolerable latency for elements with some type of animation or motion is only 10 ms.
Since we are considering user experience, we need to ask ourselves: who is going to care most about the performance of just 99.99th percentile of requests? Typically a service’s most frequent or most demanding users. As their usage trends further from the mean, they become more likely to encounter the tail latencies in the system, directly threatening the continued engagement of your application’s most important users.
Clients are a different breed of user, deploying your software within or alongside a service of their own. To help ensure quality performance for *their* users, clients will often only enter into business agreements with certain performance guarantees put in place. These will typically take the form of service level agreements (SLAs), where the performance of a service level indicator (SLI), like latency, is contractually agreed upon.
Understanding and providing solutions for the tail latencies in your system will make it more robust and performant. Towards this, establish service-level objectives (SLOs) with stricter demands around tail latencies to help insulate against performance influences outside of your control. If sub-second latency is guaranteed for 99% of requests, maintaining an SLO of sub-second latency for 99.9% or 99.99% of requests will help ensure client expectations are met.
High percentile latencies are an important performance target for systems that need to be resilient at scale. Because a system’s most influential users tend to initiate the most requests, they are more likely to experience these more painful tail latencies. Analyze and optimize your system’s tail latencies to protect the user experience of your system’s most valuable users. This work will allow your team to set more accurate performance guarantees for clients and add performance robustness during high stress incidents.