How ProphetStor Uses Reinforcement Learning for Optimizing Resource Management in Kubernetes

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Kubernetes has become a de facto standard as the orchestration platform for automating deployment, scaling, and operations of application containers across clusters of hosts in a cloud environment. However, most application workloads running on Kubernetes are highly dynamic, and enterprises have limited knowledge about the resources (CPU, memory, etc) needed to support each of their dynamic workloads. Consequently, enterprises would often over-provision their resources to ensure application performance and service levels, or might sometimes under-provision their resources leading to application outage. This makes resource management in Kubernetes a critical and challenging problem., the flagship product of ProphetStor, is an AI-enabled solution providing resource and performance optimization for applications running on Kubernetes.  In this white paper, we will explain the essential ideas underlying the algorithms used in, based on the theoretical framework of reinforcement learning.  This framework allows us to address the challenging problem of resource management for different applications and system environments with the same technical approach, not limited to Kubernetes.

Reinforcement Learning

One of the best-known applications of reinforcement learning in recent years is the design of the computer program AlphaGo (and its subsequent versions) by Google’s DeepMind Technologies team, which plays the board game Go and beat the top-ranked human player in the world in 2017.  By using a similar reinforcement learning framework, DeepMind’s machine learning solution has been applied to Google’s data centers and managed to reduce the amount of energy used for cooling by up to 40 percent.

As an academic subject, reinforcement learning has been studied in many disciplines under different names in the past. The key ideas can be traced back to the 1950’s when R.E. Bellman studied the problem of dynamic programming and came up with the Bellman Equation and an iterative algorithm in solving the equation and providing a solution to the problem.
Figure 1: A general reinforcement learning framework
Figure 1: A general reinforcement learning framework
The objective of reinforcement learning is to design a sequence of actions based on the observations of a system, so that as a result of the actions, the outcomes of the system are optimized (with respect to certain metrics).

The general framework is illustrated in Figure 1 above, and involves 1) an Agent, and 2) an Environment, in which at every time step t:
1) Agent
      a) observes the state St
      b) executes an action At
      c) receives a reward Rt
2) Environment
      a) receives an action At
      b) gives out a reward Rt+1
      c) results in a state St+1
For each state St  with action At, the state St+1 at the next time step can be a single, deterministic value, or St+1 could have multiple possible values, in which case the transition from state St  to St+1 with action At could be a random event with an underlying distribution defining the transitional probability.

The objective is to determine an optimal policy, i.e. a sequence of actions At that maximizes the cumulative rewards (weighted sum of Rt for all t), called the value function. In practice, it is sufficient to find a good enough policy that could approximate closely the maximum of the value function.

Surprisingly, many seemingly different problems can all be addressed successfully by the above reinforcement learning framework. These include many computer games, control of cooling power in data centers (as in Google’s), navigating an autonomous vehicle, and many other robotic applications.  Each of these problems can be formulated mathematically as a Markov Decision Process (MDP), and there are well-known algorithms to solve or approximate the solutions of MDPs. In order to apply reinforcement learning:
  1. First formulate your problem by defining the possible actions At and the observable states St.
  2. Come up with the reward function Rt, so that when your designed sequence of actions (policy) maximizes the cumulative rewards, it will lead to the desired outcome.
Once your problem is formulated with a reinforcement learning framework, there are well-studied algorithms that provide solutions to your problem, including Monte-Carlo, Temporal-Difference, SARSA, Q-learning, etc. All these algorithms involve an update of the estimated cumulative rewards (the value function) at each time step t based on At and St, and then use this update to guide the policy and determine action At+1 at the next time step. We will not go into any detail about how these algorithms work.  Under a general set of assumptions about the problem, all of these algorithms can be proven mathematically to determine (or approximate closely) the optimal solution.

However, different formulations of the problem and different approximate algorithms will yield solutions with different efficiency (time to reach the solution) and computational complexity (computing resources needed to execute the algorithm), even if they all solve the same problem with the desired outcome. The key challenge is to formulate your problem correctly and select an efficient algorithm so that the solution to your problem reflects the desired outcome and requires the least amount of computing time and resources.

Formulating Resource Management as Reinforcement Learning

Resource management problems in IT systems often represent difficult online decision tasks where appropriate solutions depend on understanding the workload and the system environment.  These problems are difficult because the workloads are typically unknown and highly dynamic, and their impact on the system environments and related parameters are often hard to characterize.

Imagine an application running on a Kubernetes cluster with a highly fluctuating workload. The objective is to determine the right amount of resources such as CPU and memory that varies with the dynamic workload over time, and with the right level of application performance assurance.

Let us now formulate this resource management problem on Kubernetes into a reinforcement learning framework, as shown in Figure 2 below:
Figure 2: How employs reinforcement learning for resource management
Figure 2: How employs reinforcement learning for resource management
The Kubernetes cluster becomes the Environment, and becomes the Agent. At every time step t:
    1. observes the workloads, latency, CPU, memory utilization etc. (state St)
    2. allocates the amount of resources (action At)
    3. receives a reward Rt. For example, Rt can be defined as (performance / amount of resources) = performance-cost ratio.  Performance can be defined as the application throughput or (negative) latency.
  2. Kubernetes cluster
    1. executes the amount of resources recommended by (executes the action At)
    2. Rt+1 (reward) and St (workload and other metrics) are observed to determine the next action At+1.
The objective is to design an algorithm (the policy), which specifies the sequence of actions At such that the performance-cost ratio (the reward) is maximized over the life of the application or a specified time-interval. As discussed earlier, there are many well-studied algorithms for solving the problem, and the challenge is to find a most efficient algorithm that determines (or approximates) the optimal solution quickly.

Kubernetes Applications and Resource Optimization

The above reinforcement learning formulation only gives a high-level view of how addresses the challenging problem of resource management in Kubernetes. Many details in designing our algorithm have been left behind in the above discussion.

A key consideration in our algorithm design is the trade-off among a) the application performance, b) the computing resources needed to execute our algorithm, c) the Kubernetes resources saved by applying our algorithm.  Such trade-off will highly depend on our choice of the observable states St, the possible actions At, and the reward Rt.  If the observable states are too many (i.e. if the state space is too large), it would require too many CPU resources or too long to execute our algorithm. The choice of the reward function should reflect the system metrics that are to be optimized. The set of possible actions would also determine how long the algorithm would take to settle to a solution, and how close the solution is to the optimal value.

We have successfully applied reinforcement learning and have taken into consideration the above trade-offs to come up an efficient algorithm that can optimize and auto-scale the amount of CPU and memory resources for many applications running on Kubernetes, including Kafka.

Using the Kafka application as an example, given the producer rate (external traffic to the Kafka system) is unknown and fluctuating, our objective is to minimize the latency or consumer lag, by allocating the right amount of consumer pod resources. The challenge is that in Kafka, there is a delay when the amount of consumer pod resources is increased or decreased; there will be a rebalancing time, and therefore, a delay before the allocated resources take effect.

By appropriately selecting the observable states St, the allowable set of actions At, and the reward Rt, we have designed an effective algorithm that allows us to auto-scale the resources for Kafka and general applications on Kubernetes effectively.  For Kafka, our algorithm can reduce the latency for up to 90%, when compared with the default horizontal pod autoscaling (HPA) algorithm in native Kubernetes.