Sign Up Free


Home / Blog

Scaling Up (or Down) Smartly with the Kubernetes Horizontal Pod Autoscaler

If you’ve been in a modern high-rise, you may have encountered a ‘smart elevator’. Like older elevators, they’re designed to move people up and down between floors, but that’s where the similarities end.

With smart elevators, there’s no pushing an Up or Down button, and no hoping that others won’t get on and turn your express ride into a local. Instead, you first go to a separate kiosk and select your desired floor. Then the smart elevator system directs you to the specific elevator that will get you there the fastest. The system is also efficient, combining your ride with others who have the same destination, minimizing the number of stops. 

For your cloud-native technology stack, Kubernetes, with its Horizontal Pod Autoscaler (HPA) using default settings, is the traditional elevator in our analogy. It will get you where you want to go (scaling up or down) eventually. 

What many people don’t realize, however, is that they already have smart elevator capabilities with their K8s/HPA resources.  They just need to unlock those features with intelligent HPA tuning. With that tuning, you can automatically and rapidly scale up your K8s pods to meet increases in demand, and automatically spin them down when demand wanes.

With intelligent, automated, and more granular tuning, HPA helps Kubernetes to deliver on its key value promises, which include flexible, scalable, efficient and cost-effective provisioning.

There’s a catch, however. All that smart spin-up and spin-down requires Kubernetes HPA to be tuned properly, and that’s a tall order for mere mortals. If the tuning results in too-thin provisioning, performance can suffer and clusters can fail. If tuning results in over-provisioning, your cloud costs can go way up. 

Let’s look more closely at Kubernetes HPA tuning challenges and how they can be solved. But first, some level-setting is appropriate.

What Kubernetes HPA Is and How It Works

The HPA is one of the scalability mechanisms built-in to Kubernetes. It’s a tool designed to help users manage the automated scaling of cluster resources in their deployments. Specifically, the HPA automatically scales up or down the number of pods in a replication controller, replica set, stateful set, or deployment.

The HPA conducts its autoscaling based on metrics set by the user. A common choice of DevOps teams is to use CPU and memory utilization as the triggers to scale more or fewer pod replicas. However, HPA does allow users to scale their pods based on custom or external metrics. 

Whatever metrics are chosen, the user sets the average utilization of all the replicas in a deployment, and then the HPA takes over. It handles adding or deleting replicas as needed to keep the utilization rates at the target values. 

For teams that are early in their transitions to Kubernetes, it’s easy for them to take a “set it and forget it” approach with HPA. In many cases, they don’t even “set it”. They simply go with the ‘out-of-the-box’ HPA settings – the default target CPU utilization or memory utilization. For teams that want to do more with HPA, they can do that by using custom metrics, but that can quickly get complicated. Furthermore, for teams using a hosted K8s service, some of those customization options may not be available from their providers. 

The Problem:  Subpar Performance and Opportunities Missed

Applications are not created equal. To deliver exceptional user experiences, applications need to be fed with different types of resources at varying rates, and different times. Since the HPA’s default settings have a ‘one size fits all’ orientation, they’re certainly not optimal for individual applications and their specific workloads.

If the application is a web server for example, the speed at which the HPA adds replicas is critical in accommodating bursts in traffic. Without the ability to set a higher speed for replica additions for that specific app, the result would be slower scaling, which in turn, could negatively impact the user experience.

Without being able to change the policies for scaling up, the user is left with only a small number of values to work with, mainly a CPU utilization target and the number of maximum replicas. Work-arounds can be found, but they nearly always have drawbacks. With our web app, a simple fix would be to drop the CPU utilization target to a much lower value, like 20-25%. That way, slow-downs would be avoided because the HPA would be triggered early during an upswing in traffic. And there’s the drawback. With premature scaling, apps get overprovisioned, replicas get underutilized, and cloud costs increase significantly.

A Smarter Approach – Expanded and Pre-tested HPA Settings  

Only using a couple of settings, and only using their default values with HPA will not get you optimal performance, nor will it get you the highest cost-effectiveness across all the varied apps you’re transitioning to Kubernetes. Instead, you need your settings in HPA to reflect the nature (needs and wants) of the various apps. 

CPU and memory isn’t the full picture of how an app behaves, however. They may stay flat or be very spiky, but not necessarily an indication of performance. Things like latency and throughput as much better signals for many apps. This is where the custom metrics come into play. Many times the real power comes in understanding the performance, not just the footprint of the application. Every app behaves differently, and therefore every HPA needs to be tuned to the app. One size certainly does not fit all, not even close.

But doing that is crazy-complicated.  You need tests (experiments) that yield precise and actionable results. You also have to run the experiments continually because apps act/need differently as they run through their cycles, plus Day 1 stuff is different than Day 2 and Day N.  You also need to run the experiments at high speeds (with automation), interpret the results – also at high speeds (w/automation), and lastly use automation to apply the recommendations –and this at high speeds (w/automation…are we seeing a trend here?) 

Carbon Relay’s Red Sky Ops solution does all of this.  Teams can use it to support their HPA efforts to significantly improve what they have now, which is probably a hot mess. And here’s a standard, 1-paragraph description about how RSO does it.

Red Sky Ops by Carbon Relay approaches this in a very scientific way. We create experiments and test your applications under the expected or unexpectedly high load(called trials). For each experiment we then find the best settings for the HPA keeping the performance you desire balanced against some other metric like cost. If you are like most organizations, cost is a factor, if not then just scale like crazy and leave it up high for best performance(come back when you have to cut back). 

With more intelligence and automation, smart elevators get you to your floor faster and more efficiently. It’s the same deal with HPA and app performance and reliability in Kubernetes – intelligent automation is clearly the way to go. Want a deeper dive into HPA tuning? Here’s a blog by one of our engineers on the topic.

mountain scape

Using Machine Learning to Find the ElasticSearch Performance Peak

How do you know where the highest peak of a mountain range is? Well, probably through a quick google search. But in the absence of the internet, you pick a random place to start, climb to the top and look around (Okay, yes you can use a helicopter, plane or maybe even a drone, but stay with me a second). Now that you are at the top, you see some of the mountains are still taller and make an informed decision on where to go next. Climb down, go to the next mountain and climb up again. Look around, see there are still taller peaks and try again. Tired yet?

The truth is, no one has time to climb all of the mountains to find out which one is the tallest, and certainly no one has time to try different parameter combinations to reach the highest performance of their application (…you knew we’d get here). So how do you know when you’ve got the best parameters for your application? If you’re anything like us, most of the time you try a lot of different options, and once you find one that performs pretty well, you keep it. Behavioral economists call this satisficing. When the trade-off is more time spent searching for a better configuration, it’s better to stop when you’ve crossed a reasonable threshold. That’s why we built Red Sky Ops, to help you find the tallest mountain peak without ever climbing a mountain yourself.

When the trade-off is more time spent searching for a better configuration, it’s better to stop when you’ve crossed a reasonable threshold.

We put our machine learning (ML) to the test against the preset values in the ElasticSearch Helm chart and compared the performance and cost to the configurations our ML found (view our Elasticsearch example “recipe” here). Performance is measured as the duration  it takes to process a dataset configured with the rally benchmarking tool (in seconds). For our example, cost is specific to Google Kubernetes Engine, measured as the total cost of CPU and memory per month. Here’s what we got:

The above graphic shows the cost and duration measurements for each configuration the ML tried. The ML searched the parameter space, trying 80 different configurations of four parameters (memory, CPU, heap size, and replicas). The triangle represents the helm default, while the pink dots represent the best options of all the configurations the ML tried. Unstable configurations are parameter combinations that prevented the application from deploying, in this case it was 28% of trials. Each trial helped inform where to search next, ultimately giving us 16 best options to choose from. Since we are optimizing for more than one metric, there is not a single “peak”, but at a given cost, there is a single best duration. 

The helm configuration had a duration of 864 seconds and a cost of $85 per month. Point A has a duration of 628 seconds and cost of $76 per month. So that’s a 10% reduction in cost with a 8% boost in performance. Point B has a duration of 720 seconds and a cost of $81 per month, meaning for almost no change in cost you can get a 17% improvement in performance. If you are willing to increase your budget a bit, Point C costs $92/month and gives you 32% improvement of performance (589ms). Not bad options to have without ever changing a configuration yourself. 

Want to see how Red Sky Ops does against your current configuration? Sign up to optimize your first application free-for-life and view our quick-start for documentation on adding your own baseline.


Steph Rifai
Product Manager


With Kubernetes, It’s Not All About Horsepower

I make my living in the software business. I started out as a software developer, moved up to systems engineering and product management, and then on to solutions architecture. 

It has made for a great career, but I’ll let you in on a little secret.  My true passion is cars and racing. That’s right, I’m a race car guy at heart.

Spending time in both worlds, I started thinking about the changing requirements we often get when we’re creating applications or solutions. Having built and modified a few cars of my own, I also considered what a similar approach would be like for teams building race cars. Challenging is the word that comes to mind.  Here’s why.

Let’s say our sponsor comes to us with a simple directive. “Build the fastest car you can.” So, you go and build a top-fuel dragster. Designed for short, straight-line races from a standing start, your light, high-powered car has the quickest acceleration in the world, reaching speeds of over 339 mph (546 kph) in less than three seconds.

In the world of containerized apps, that raw power and speed is the equivalent of major scaling capabilities.

It’s good, but the sponsor wants more. They want the car to be able to handle curves in the track, so the aerodynamics, suspension and steering need to be different. And it needs to be able to handle much longer races, so it needs to be fuel efficient. It also needs real-time monitoring of things like its tires, brakes, clutch, transmission, etc.

You’re essentially being asked to turn your dragster into a Formula One race car.  In software, it’s the equivalent of handling completely different functions and workloads by adding all new functionality.

You go back to the drawing board in your garage to see if there’s a way to modify your dragster to meet the new requirement. But as any car person will tell you, it’s just not possible to morph a dragster into an F1 kit. So, you start from scratch and build a car that Jackie Stewart would be proud of.

The sponsor is happy for a few minutes, but then comes back with the need for your car to compete in the Pikes Peak International Hill Climb Race (a rain or shine event). So, you need to retool with an engine that can handle among other things the rapidly descending air density on a track with rapidly rising altitudes. Or is electric the way to go? Last year, a new course record was set by an electric car. If your experience so far has been in fuel cars, how will you handle the very different challenges of electric design?  

You get the idea; I don’t want to overdo the car-building analogy.  But if you and your team are building applications and solutions today, you’re likely having to deal with these types of fundamental, unanticipated requirement changes. Maybe its competitive pressures or advantage, an acquisition, maybe a complete pivot to fundamentally new technology.

In the old days, these sorts of radical changes in direction would utterly disrupt the development process. But today, using cloud-native architectures and dynamically assembled microservices, one can change a dragster app into an F1 solution, and then pivot your design so the resulting vehicle can get to the top of the mountain first.

But here’s the rub. You can’t do any of this type of transformative work without understanding what the new direction will require in terms of performance and architectural changes. What was once a finely-tuned and largely upfront effort, now needs to be tuned again. What are the peaks and valleys of high traffic, throughput, latency, or concurrent users going to do to your application? How do you best optimize for those scenarios? When the pivot comes, how do you yet again validate the proper optimization? Is the new architecture eveIn relevant to the new tasks at hand?

What if your toolkit included a solution that simplifies and automates the process of continuous Kubernetes optimization?

  Brad Ascar  Sr. Solutions Architect, Carbon Relay

1 2 3