High Performance Computing

image image

Use Case Descriptions

Integration Environment
HPC cluster for data processing with over 250 nodes.
Customer / Partner Types
China National Petroleum Corporation (CNPC). Oil and gas industry, land exploration and surveying data processing.
Company and Solution Background
A Fortune 500 company, one of the largest energy companies in the world, relies heavily on the collection and analysis of geophysical data for oil exploration. Fully understanding the importance of natural resources, it focuses its mission on sustainable development and environmental protection, providing quality geophysical products and services. Powering its operations is an HPC data center with over 3000 servers, each connected to Directly Attached Storage.

Requirements and Challenges

Geophysical data is processed in parallel across all chosen servers in the data center. 100s or 1000s of nodes need to work uninterrupted. Even one node failure can result in reload of partial or all task jobs. Hardware failures, especially disks, are unavoidable at large scale, high-density clusters, due to their intensive data access during computation. To minimize disruptions by hardware failure, the company can only rely on new and abundant hardware to process the jobs. The selection criteria result in more than 30% waste in hardware utilization.


Solution Benefits

With Federator.ai® disk failure prediction, the HPC data center reliably selects qualified hardware for any jobs without delay of service deliveries. Integrating with task schedulers prevents loading jobs on risky nodes before a task starts, guaranteeing the health of the entire cluster during the job lifespan. Data center operators perform hardware maintenance in between jobs to prepare servers for coming tasks. Federator.ai® also keeps performance metrics of any hosts and disks, which can be used to track unusual performance patterns at any point in time.


Shorten data processing time by more than 30% by eliminating task reloading


Leverage aged hardware by having accurate disk data predictions. No more swapping out aged, but healthy hardware


Save money by reducing redundancy, so that other nodes can be used for active production tasks


Simplify hardware management and maintenance by transforming unexpected failures, into planned events