How do you know where the highest peak of a mountain range is? Well, probably through a quick google search. But in the absence of the internet, you pick a random place to start, climb to the top and look around (Okay, yes you can use a helicopter, plane or maybe even a drone, but stay with me a second). Now that you are at the top, you see some of the mountains are still taller and make an informed decision on where to go next. Climb down, go to the next mountain and climb up again. Look around, see there are still taller peaks and try again. Tired yet?
The truth is, no one has time to climb all of the mountains to find out which one is the tallest, and certainly no one has time to try different parameter combinations to reach the highest performance of their application (…you knew we’d get here). So how do you know when you’ve got the best parameters for your application? If you’re anything like us, most of the time you try a lot of different options, and once you find one that performs pretty well, you keep it. Behavioral economists call this satisficing. When the trade-off is more time spent searching for a better configuration, it’s better to stop when you’ve crossed a reasonable threshold. That’s why we built Red Sky Ops, to help you find the tallest mountain peak without ever climbing a mountain yourself.
We put our machine learning (ML) to the test against the preset values in the ElasticSearch Helm chart and compared the performance and cost to the configurations our ML found (view our Elasticsearch example “recipe” here). Performance is measured as the duration it takes to process a dataset configured with the rally benchmarking tool (in seconds). For our example, cost is specific to Google Kubernetes Engine, measured as the total cost of CPU and memory per month. Here’s what we got:
The above graphic shows the cost and duration measurements for each configuration the ML tried. The ML searched the parameter space, trying 80 different configurations of four parameters (memory, CPU, heap size, and replicas). The triangle represents the helm default, while the pink dots represent the best options of all the configurations the ML tried. Unstable configurations are parameter combinations that prevented the application from deploying, in this case it was 28% of trials. Each trial helped inform where to search next, ultimately giving us 16 best options to choose from. Since we are optimizing for more than one metric, there is not a single “peak”, but at a given cost, there is a single best duration.
The helm configuration had a duration of 864 seconds and a cost of $85 per month. Point A has a duration of 628 seconds and cost of $76 per month. So that’s a 10% reduction in cost with a 8% boost in performance. Point B has a duration of 720 seconds and a cost of $81 per month, meaning for almost no change in cost you can get a 17% improvement in performance. If you are willing to increase your budget a bit, Point C costs $92/month and gives you 32% improvement of performance (589ms). Not bad options to have without ever changing a configuration yourself.