Google, AWS, and Azure Are Done Letting Vendors Break Their AI Servers

With the recent boom in AI, the footprint of AI workloads and AI supported hardware servers deployed in Cloud Data Centers has grown exponentially. This growth is spread across multiple regions worldwide over various data centers. To support this growth and to ensure leadership over various Cloud Competitors (like Azure, AWS, and GCP) have started building a fleet of specialized high performance computing servers. The AI workloads that perform a huge amount of data processing, training and inference of data models require a special kind of hardware unlike the traditional general purpose Compute servers. Hence, all cloud service providers are investing heaving on GPU, TPU and NPU based servers that are effective in hosting AI workloads. Majority of these servers are Buy Model and cloud service providers are dependent on ‘Other equipment Manufacturer’ (OEM) for diagnostics and maintenance of the hardware. This dependency has caused a lot of pain for cloud service providers as the repair SLAs are uncertain and their expensive impacting the fleet availability. Hence, the cloud providers are shifting from simple Buy to Make (OEM designed servers’ maintenance to In-House server maintenance). This shift in business model has resulted in transition in Service Model in Data Centers from OEM Reliant to Self-Maintainer. To support this Self Reliance and the growth of AI hardware fleet, every cloud service provider is aiming to reduce the cost of service and build Swift, Remote, Accurate, Automated and Economical HW Diagnostics.

Why Hardware Diagnostics Matter for AI

AI workloads are unique in nature and require parallel processing and compute-intensive hardware that are Reliable and Stable. However, HW components often fail, and at times without notice. A single degraded GPU or memory failure can derail hours of training or crash real-time inference endpoints. Some common hardware-related issues impacting AI workloads:

GPU memory errors (ECC failures, Tray issues)
GPU Thermal throttling
GPU Infini band failures
CPU IErrs and Uncorrectable errors

So, to support the needs of the customers for a High Availability and Uninterruptable service, Cloud providers need accurate HW Diagnostics that pinpoint the faulty component.

The hardware diagnostics engine for AI hardware will be broken down into the following components:

1) Telemetry collection layer: This layer focuses on collecting real time hardware telemetry regarding various components.

GPU drivers
Firmware versions and error logs (BMC, BIOS)
On-node data (temperature, utilization, power draw)
OS-level counters (oom-kill, system crashes, dmesg logs)

Platform will use cloud-based agents to collect and publish hardware telemetry to a centralized location.

2) Hardware risk scoring layer: This layer is used by diagnostics to lock a hardware risk score based on hardware failure patterns. Diagnostics engine will sample errors such as ECC error rates over time, Thermal headroom across workloads, GPU performance degradation from baseline, Firmware mismatches vs golden config and Hardware retry count per VM allocation.

Sample logic: Node_Health_Score = weighted_sum (ECC_rate, Thermal_Throttle, Firmware_Drift, Allocation_Retry)

The risk score will be used by diagnostics engine to predict and mitigate hardware failure.

3) Prediction, Mitigation and Remediation Layer: Diagnostic engine will use telemetry data across various hardware components and risk rating scores to take various mitigation and remediation actions.

A. Prediction of Hardware Faults

Takes places during LIVE state of the server with LIVE customer workloads running on it.
Hardware Diagnostics engine will collect Hardware Health Attributes (i.e., Hardware telemetry) from telemetry layer and collaborate with other cloud platform level machine learning services to predict HW Faults.
Hardware Diagnostics will also perform Predictive Failure Analysis to anticipate impending HW Faults based on the risk rating scores and take proactive action to move AI workload to a healthy server without interrupting the workload.

Mitigation of Hardware Faults

It takes place during LIVE state of the Node with LIVE customer workloads.
If Hardware Fault prediction is not feasible, then Hardware will attempt to mitigate the HW Faults to ensure continuity of the HW Service. Some of the mitigation actions that are currently being used are Disk Mirroring, Memory page off-lining, Error Detection and Correction and auto reset of GPU driver on fault.

Remediation of Hardware Faults

Takes places during OFFLINE state of the Node when customer workloads are vacated.
If HW Fault mitigation is not feasible, then HW Diagnostics will work to efficiently attribute the failures based on device telemetry collected in telemetry layer. Once the attribution of failure is complete, the hardware faults go through Service and component repair at the data centers.

4) Diagnostics metrics and AI hardware Fleet Wide Insights

Build a reporting dashboard to expose GPU/node health metrics:

Failure rate trends by GPU SKU, zone, or region.
Repeat failure nodes.
Heatmaps of thermal or utilization anomalies
Top skus and hosts contributing to model training failures.
Correlated workload impact analysis (e.g., job retry trends, latencies)

Conclusion:

Building robust and reliable diagnostics will help baseline AI hardware health and find how the hardware health looks like across GPU SKUs and host models. We can correlate hardware failure events with AI model degradations.