A GPU-based Supercomputer Sees 60% Increase in Utilization with Federator.ai

Overview

A national-level super-computing center in Taiwan possesses a large computing and networking platform facilities for use by domestic academia and general public. A major supercomputer at the center provides the Taiwan computing cloud service through managed container services via Openshift. ProphetStor entered into contract with the center to deliver Federator.ai on those servers to monitor/ predict just GPU usages, in the first stage. The center is expected to standardize Federator.ai as part of its resource optimization solution across all clouds.

Challenge: Tremendous Computing Power Waste

Being a government official organization, it provides an AI development platform to support the research and development of the “digital country” for Taiwan. Also, the center promotes industry-university collaboration, conducts forward-looking research and development of smart technologies, big data, artificial intelligence applications, and plans to build a national AI R&D center and cloud service foundation to provide related application services.

One of the critical issues is that the AI initiatives’ workloads and their assigned GPU core resources are often incompatible, resulting in a massive computing power waste. The resources’ planning and scheduling are done by guesswork or based on the users’ requests, mostly over-provisioning, rather than the real future workloads. To maximize the utilization and have cost-effective operation becomes the central pain point, and they need to resolve quickly.

Solution: Federator.ai provides AIOps solution for optimized planning and allocation

ProphetStor’s Federator.ai® is a patented, AI-enabled solution that provides predictive analytics needed for the GPU-based supercomputer’s effective operation at the center. CrystalClear Time Series Analysis Engine, dealing with data correlation and impact prediction, is the core of Federator.ai that provides predictive analytics, such as event/accident correlation, anomaly detection, and impact analysis, to improve the utilization of the resources. Major features include:

  1. Aggregate the real-time and non-real-time system record data gathered by its existing GPU computing resources.
  2. Provide a maintenance management system with API to connect to the center. The alarm messages generated by the system are connected to the current maintenance management system, including the integration of Grafana and other maintenance services.
  3. Analyze the behavior of GPU containers configured by users (or tenants) with deep learning and predict the future resource usage of workload.
  4. Provide data cleansing and ETL (extract, transfer, load) mechanisms for CrystalClear Time Series Analysis Engine training.
  5. Provide workload anomaly detection and provide alarms for maintenance work.

Results: Improved Operational Efficiency with Automation and Intelligence

In addition to a broader and more in-depth understanding of the entire data center stack to reduce MTTR (mean time to recovery) through the ProphetStor Federator.ai platform, Federator.ai can prevent incidents, enhance resource optimization, and meet future business needs.The benefits include:
  1. Save up to 60% of resources: ProphetStor Federator.ai® reduces unnecessary expenditures and improves enterprises and cloud providers’ service quality. It can recommend the Just-In-Time-Fitted resources for the workload to make sure both cost and performance are optimized.
  2. Enhance resource allocation visibility: Users can use the prediction of GPU usage in the future through the graphical interface provided by Federator.ai to help customers make the right planning and adjustments.
  3. Improve operation efficiency by detecting the anomaly: The resource usages in operation are continuously checked with the predicted and assigned resources to catch any out of ordinary uses to ensure that operational efficiency is maintained.
The resource planning provided by Federator.ai can help the data center make smart decisions on allocation and billing, resulting in a more than 60% increase in utilization, while also helping the operations align with the trend of Green IT/ ESG compliance. The GPU-based supercomputer now can run smoothly and service far more users than it planned initially. ProphetStor continues to work with the the center for their future Cloud resource planning needs.

60%

Resources

Make intelligent allocation to
save up to 60% of resources