This is part 2 of a 2-part series on scaling cloud applications. In part 1, we introduced some scalability concepts, including horizontal scaling, vertical scaling, and stateless applications.
In this post, we will talk about techniques that can be used to approach “linear” scaling for your application.
Linear Scaling: aka the End of the Rainbow!
If you have designed the perfect application, then your application will be able to scale “linearly”. Awesome!
> Wait, what does that mean?
Linear scaling means that for every `x` amount of new resources you add to your application, you will be able to handle `y` more load.
> Blank stare
Here’s an example. If you have a web application deployed on Kubernetes and one docker container (running a stateless instance of your web service) can handle 100 requests per second (RPS), then, ideally, scaling horizontally by adding a second docker container would increase your maximum throughput to 200 OPS. 10 containers would give you 1,000 OPS, 100 containers would give you 10,000 OPS, and so on.
That is the ideal of linear scaling in a nutshell!
However, in the real world, it is vanishingly unlikely that this can be achieved. Why? Because linear scaling is like the pot of gold at the end of the rainbow. You can never quite reach it: there is always a “Next Bottleneck™”!
To build from the example above: if you have a stateless web service that runs in a container, which you can scale horizontally as we described above, it’s extremely likely that it is not the only component of your application. There is probably a database or other central persistence store that these containers are interacting with. And at some point if you have too many of these containers running at maximum capacity, they will likely overwhelm that persistence store and you will have to figure out how to scale it.
The load balancer is another example: depending on how many resources you have devoted to your load balancer, it may reach a point where it can no longer handle the inbound traffic no matter how many containers you put behind it.
In any of the above scenarios: you can achieve linear scaling! For a while. :)
There will be one component of your system that is the current bottleneck. You will figure out how to scale it (usually horizontally), and you will be able to scale linearly… for a while. Then you will hit a different bottleneck elsewhere in your system, and the game begins anew! Not quite to the end of the rainbow yet!
Essential Practices for Achieving Linear Scaling (for a while)
Any part of your system that you can build to be stateless is going to make your scaling life easier. If there are parts of your system that handle web requests, read data from a database, do some computations on them, and return a response, these are great candidates for “stateless”. For a given request, they get all of the state they need from the database. They don’t need to maintain any local state. When they finish handling a request they are ready to handle the next one with no state carrying over. This is the easiest type of component to scale horizontally, and platforms like Kubernetes were largely created for exactly this type of component.
If you can build parts of your system using serverless compute services, such as AWS Lambda or Google CloudRun, then you no longer have to think about how many VMs or containers you need in order to handle your traffic. These services will automatically detect when more capacity is needed or when excess capacity can be shed, and they will scale themselves up and down accordingly!
If you want any parts of your system to scale horizontally, you will most likely need to introduce a load balancer. The load balancer is responsible for accepting all of the incoming traffic and routing it to your application resources to do the work. When you add more nodes to the load balancer, it can distribute the load evenly so that you can scale.
Platforms like Kubernetes provide built-in solutions for load balancing, and cloud providers like AWS offer feature-rich load balancing services as well so that you don’t have to create and manage your own.
The best way to improve the scalability of your application is to find ways to reduce the amount of work it needs to do. Caching is a great way to achieve this.
Any computation that your application needs to execute more than once might be a candidate for caching. If you do the computation once, store the result in a cache, and then re-use the data from the cache rather than performing the computation again, you can free up compute resources so that your application can handle more load.
Database queries are a prime example; if you have a sophisticated database query that joins data from several tables and may take seconds to execute, if you are able to cache the results and re-use them for handling multiple requests, you will dramatically reduce the load on your database and improve the scalability of your application.
There are many options for caching:
- You can set up a cache cluster yourself using software such as Redis or Memcached.
- You can use a service that manages a cache cluster for you, such as Redis Cloud or AWS ElastiCache
- You can use a serverless distributed cache such as Momento or AWS ElastiCache Serverless.
Content Delivery Networks (CDNs)
CDNs such as CloudFlare, fast.ly, and AWS CloudFront provide “edge caching”. Edge caching allows you to build your application such that your end user’s requests first go through the “edge cache”, a regional cache that is located in the closest possible geographic location to where the request originates, before they even reach your application. If the value needed for handling the request is available in the edge cache, then the request can be handled without ever even reaching your application. Again, there is no better way to make your application scalable than to reduce the amount of work that it needs to do!
Edge caching is most often used by applications that serve large volumes of data, repeatedly, to many requesters. Video streaming services like Netflix are a great example; large binary video files can be cached at the edge to reduce the load on Netflix’s core servers.
Horizontal and vertical scaling are great, but when do they happen? And how?
If you’ve come up with a great design that allows you to scale a component of your application horizontally, you need to know when it is time to do that. You may have created some metrics that alert you when you are getting close to maximum capacity so that an operator can take action to add capacity to your system. But this relies on human intervention and can be error-prone.
Auto-scaling is a technology that allows you to configure key metrics as indicators that you are approaching your maximum capacity, and then the system can automatically detect when it is time to add additional capacity. It can also be used to detect when it is safe to scale back down, to save costs.
This is another capability that was front-of-mind in the design of Kubernetes; it provides many out-of-the-box options for auto-scaling. You can also find auto-scaling features in many services provided by the major cloud providers.
Databases are very often the most challenging part of an application to scale. If you rely on a relational database, it’s important to understand how certain types of queries can be optimized; for example, by creating an index. Query optimizations can result in dramatic scalability improvements. Most databases provide tooling that you can use to monitor query performance, and often even get recommendations about particular queries that could be optimized.
It is also worth thinking about whether it is possible to define your database schemas such that your data can be sharded or partitioned. For example, if you know that all of your database queries only need access to records from a given month, or only records for a single customer, you can partition your database such that each partition can be run on a separate server. This gives you some ability to scale your database horizontally, rather than vertically.
Better still, let someone else worry about your database scaling problems. Serverless databases such as AWS DynamoDB and Aurora provide auto-scaling features that can take care of the most difficult scaling challenges for you, so that you can spend your time on other things. These services may cost more than managing your own database, but the savings in engineering costs for maintenance and scaling problems can be enormous.
We have mentioned Kubernetes several times already in this article, but it is worth calling out explicitly: container orchestration platforms were designed specifically for these types of scaling challenges. They provide a lot of functionality out of the box so that you can spend most of your time worrying about what you will run in your containers, rather than how you will load balance them or scale them up and down.
Platforms like Kubernetes can come with a steep learning curve, but they are very powerful and can shield you from the need to invent your own solutions for many scaling problems. They are also becoming more ubiquitous and more portable across cloud providers as time goes on.
Monitor performance metrics diligently
Last, but definitely not least: the most important thing that you need to know about scaling is when it is time to scale. Make sure that you have good metrics illustrating the health of your application, and concise, easy-to-understand dashboards visualizing these metrics and showing you whether or not your users are having a good experience. The earlier you notice that you are approaching your Next Bottleneck™, the better your odds of finding a scaling solution for it before it impacts your users!
Momento's Role in Scaling Your Application
In a few places in this article we’ve talked about how one of the best solutions to solving your scaling problems is to let someone else solve them for you :)
Using a serverless database like DynamoDB can save you all sorts of scaling headaches compared to managing your own database cluster. Using a platform like Kubernetes can give you load balancing and auto-scaling with far less effort than trying to build those things yourself. And using a serverless compute platform like AWS Lambda can prevent you from needing to think about scaling your compute resources in and out at all.
Momento serverless cache can play a similar role to meet your caching needs. With Momento, you simply create your cache and then start reading and writing to it from wherever you please! No resources to manage, and no need to worry about scaling up and down as your traffic patterns fluctuate.
Serverless technologies can take the hassle out of scaling so that you can focus on the things that actually differentiate your business. When you’re building your next application or tackling your next scaling challenge, make sure you look around to see if there is a serverless product that might fit your needs!