...

ProphetStor’s CrystalClear Time Series Analysis Engine— Analytical Excellence Is All about Speed

Introduction

Deploying applications in a serverless cloud platform or a Kubernetes system in the cloud has become a popular option for users. Figure 1 is an illustration of an example of a microservice architecture. Usually, web requests sent to a web server are forwarded to the various microservices behind, and they trigger a series of communications between related microservices. Since the number of web requests to the webserver represents the workload of this entire application, we name it the Primary Workload. As shown in Figure 1, these microservices include stateless microservices such as web servers and stateful microservices such as databases. There are complicated and complex communications between these microservices, for example, inter-process communications, remote invocations, or indirect communications [1]. A webserver (MS-A) sends data to the backend database (MS-B), an example of inter-process communications. An example of remote invocations is the two-way communications between the downstream microservice (MS-C) and its upstream microservice (MS-D), and an example of indirect communications is the traffic from message consumers (MS-F and MS-G) consuming messages from a message queue (MS-E). These communication behaviors between applications in a microservice system introduce a variety of workload patterns, especially for a serverless cloud platform [2].
Figure 1. An Illustration for a General Microservice Architecture 

Prediction-based Resource Management

For typical resource management of an application with many microservices deployed in a Kubernetes system, it is desired to allocate the right amount of resources to all the microservices so that performance would not be suffered because of resource constraints and, at the same time, no wasted expenses for excessive allocations. There are several benefits of utilizing predictions when considering resource management for microservice-based applications. First, predictions based on the past workload patterns provide a better forecast for resource usage patterns of microservices. It gives an insight into how many resources will be used at what time by a microservice. Second, the workload prediction will be very beneficial when considering autoscaling some stateless microservices. For example, the system can continuously predict the workload of a stateless microservice every few minutes and proactively scale this microservice in time for the upcoming workload.

There are some factors to consider when adopting prediction-based resource management for applications deployed in a Kubernetes environment. For example, generating 100 workload predictions every minute requires an average runtime per the prediction of about 0.6s with a minor resource impact on computation costs introduced by predictions. In addition, microservices in Kubernetes may be restarted, failed, or terminated for some reason. This would cause unusual patterns like change points or unpredictable events. Therefore, using predictions for resource management in a Kubernetes system needs to handle these unusual patterns for a good result.

Figure 2 shows the prediction steps of traditional time series schemes. There are four steps in modeling time series data: model identification, model estimation, residual diagnostic checking, and forecasting [3]. When users want to generate a forecast for a microservice workload, they need to preprocess missing values and identify the specific model orders for the workload metric in advance. Next, users would estimate these model parameters, find a suitable model for predictions, and check whether the model is overfitting or not by the residual diagnostic checking step. Finally, users start to generate predictions for the microservice workload.

Figure 2. Traditional Timeseries Schemes
FBProphet [4] significantly simplifies the process of modeling time series data. Users can easily use parameters similar to the example in Figure 4-a to generate predictions and visually inspect and adjust these forecasts. Greykite [5][6] integrates more regressors and adaptable functions to provide more flexible functions such as forecasts and anomaly detections for users. When users input time-series data, Greykite analyzes the workload features (i.e., seasonality, trend growth, holidays, or recurring events), detects the change points by considering trend and seasonality, finds a fitted model by machine learning (ML) technologies, and generates predictions for a specific microservice workload. However, users still need to visually inspect the features of microservice workloads for a period and manually adjust the model parameters based on these application requirements. Therefore, the above schemes might not be reasonable solutions to automatically generate hundreds of workload predictions for a typical Kubernetes system.

Predictions Based on Cross-correlations

Figure 1 shows that communications among microservices form a communication graph, which illustrates the inter-dependency among these microservices. It is easy to see that the resource usage of these microservices will be impacted by the changing values of the primary workload, which is the web requests from the users. Here we propose a novel concept of prediction based on application correlations. More specifically, the prediction is based on the cross-correlation between the different workload patterns of microservices and the application’s primary workload. The proposed prediction algorithm has been implemented in ProphetStor’s CrystalClear Time Series Analysis Engine. Figure 3 illustrates the primary construct of the proposed prediction algorithm. With the input of the primary workload metric and resource usage metrics of microservices, the algorithm analyzes the dependencies between the primary workload and these microservice metrics by using cross-correlations. While a resource usage metric has a high correlation with the primary workload metric, it chooses a fitted model. It generates resource usage predictions based on the correlations and the features of the primary workload metric. Suppose a resource usage metric has a medium correlation or a low correlation with the primary workload metric. In that case, the system analyzes the application’s features (such as trend, seasonality, change points, events, etc.) and builds a suitable model for prediction generation.

Figure 3. The Architecture Diagram for ProphetStor’s CrystalClear Time Series Analysis Engine

Test Setups

The experiments are executed as a system equipped with 32GB RAM and Intel Xeon CPU E5-2640@2.40GHz. The CPU has one Core and one thread. We compare the prediction accuracy of the proposed algorithm with Facebook Prophet [4] and LinkedIn Greykite [5][6] using the datasets with different correlations with the primary workload. Figures 4-a and 4-b show the execution of Facebook Prophet and LinkedIn Greykite. The prediction accuracy is measured by Mean Average Percentage Error (MAPE).
Figure 4-a. An Example of Running FBProphet 
Figure 4-b. An Example of Running Greykite

We use the following datasets to validate the accuracy of our proposed algorithm:

1. Azure Function Traces in 2019 (called Azure2019 Dataset here).

    • We use 3446-time series data from the traces – invocations per function for two weeks [7]. Each owner has several applications and functions which trigger different actions: HTTP, Timer, Event, Queue, Storage, Orchestration, and Others in Figures 5-a and 5-c.
    • The primary workload time series is chosen by the time-series data with the maximum invocations, as shown in Figure 5-a.

2. Microservices traces of Alibaba production clusters in 2021 (called Alibaba2021 Dataset here).

    • We use 3899 time-series data, including CPU, Memory, and consumer metrics from 1303 microservices for 12 hours [8] in a production cluster in Figures 5-b and 5-d.
    • We choose the time-series of CPU metric from a microservice with the highest CPU usage as the primary workload time series, as shown in Figure 5-b.

3. Figures 5-e and 5-f show the percentage of time series data with high cross-correlations between the primary workload time series and other time-series data in Azure2019 Dataset and Alibaba2021 Dataset, respectively.

    • For Azure2019 Dataset, as shown in Figure 5-e, only 53% of time series data with high cross-correlations (>=0.7), and 47% of time series data are medium or low correlations (<0.7). This is because these time-series data from serverless functions are of different owners.
    • For Alibaba2021 Dataset, as shown in Figure 5-f, we can find that almost 95% of time series data have high cross-correlations (>=0.7) with the primary workload, and 5% of time series data have medium and low cross-correlations (<0.7) with the primary workload time series. The reason is that these microservices all work in the same production system and form a communication graph with a strong dependency.

4. Figures 5-g and 5-h show the two datasets’ characteristics of time series data.

    • For Azure2019 Dataset, as shown in Figure 5-g, about 52% of time-series data are discrete since users simply upload the code of their functions to the cloud, and functions are executed when triggered by events, such as receiving a message or a timer going off.
    • For Alibaba2021 Dataset, as shown in Figure 5-h, only 0.5% of time series data are discrete, and most time-series data are continuous. Discrete-time series data bring multiple change points or unpredicted events. Therefore, they increase the difficulty of predictions and reduce the accuracy of predictions.
Figure 5-a
Figure 5-b
Figure 5-c
Figure 5-d