Challenge: Tremendous Computing Power Waste
Being a government official organization, it provides an AI development platform to support the research and development of the “digital country” for Taiwan. Also, the center promotes industry-university collaboration, conducts forward-looking research and development of smart technologies, big data, artificial intelligence applications, and plans to build a national AI R&D center and cloud service foundation to provide related application services.
One of the critical issues is that the AI initiatives’ workloads and their assigned GPU core resources are often incompatible, resulting in a massive computing power waste. The resources’ planning and scheduling are done by guesswork or based on the users’ requests, mostly over-provisioning, rather than the real future workloads. To maximize the utilization and have cost-effective operation becomes the central pain point, and they need to resolve quickly.
Solution: Federator.ai provides AIOps solution for optimized planning and allocation
ProphetStor’s Federator.ai® is a patented, AI-enabled solution that provides predictive analytics needed for the GPU-based supercomputer’s effective operation at the center. CrystalClear Time Series Analysis Engine, dealing with data correlation and impact prediction, is the core of Federator.ai that provides predictive analytics, such as event/accident correlation, anomaly detection, and impact analysis, to improve the utilization of the resources. Major features include:
- Aggregate the real-time and non-real-time system record data gathered by its existing GPU computing resources.
- Provide a maintenance management system with API to connect to the center. The alarm messages generated by the system are connected to the current maintenance management system, including the integration of Grafana and other maintenance services.
- Analyze the behavior of GPU containers configured by users (or tenants) with deep learning and predict the future resource usage of workload.
- Provide data cleansing and ETL (extract, transfer, load) mechanisms for CrystalClear Time Series Analysis Engine training.
- Provide workload anomaly detection and provide alarms for maintenance work.
Results: Improved Operational Efficiency with Automation and Intelligence
- Save up to 60% of resources: ProphetStor Federator.ai® reduces unnecessary expenditures and improves enterprises and cloud providers’ service quality. It can recommend the Just-In-Time-Fitted resources for the workload to make sure both cost and performance are optimized.
- Enhance resource allocation visibility: Users can use the prediction of GPU usage in the future through the graphical interface provided by Federator.ai to help customers make the right planning and adjustments.
- Improve operation efficiency by detecting the anomaly: The resource usages in operation are continuously checked with the predicted and assigned resources to catch any out of ordinary uses to ensure that operational efficiency is maintained.