Key Insight: “100% GPU Util” ≠ “100% Heat”

What "GPU Util" Actually Measures

The utilization.gpu metric in NVML/DCGM reflects the percentage of time at least one CUDA kernel was resident on the Streaming Multiprocessors (SMs).
It does not account for:

  • Functional unit activity (FP32, Tensor Cores, memory controllers).
  • SM occupancy (active warps per cycle).
  • Voltage/frequency state (DVFS or clock-gating).

Example cases where "100% Util" masks variable heat output

Example cases where "100% Util" masks variable heat output

Case

What the counter sees

What the silicon does

Resulting power / heat

Compute-bound GEMM
(FP16/FP8 tensor cores)

Kernel always resident → 100 %
SMs at P0 clocks, all tensor pipes switching, memory traffic modest

~TDP (e.g., 700 W H100)

Memory-bound BFS / inference decode

Kernel resident, but SMs stall waiting on HBM
GPU down-clocks to keep DRAM running full; only 30–40 % functional units toggling
30–50 % of TDP

PCIe copy / encode / decode

Copy engine busy triggers “GPU util”
Core SMs mostly gated off
< 20 % of TDP

DVFS power-cap (data-center powerLimit)

GPU flag ≡ 100 %

Clocks limited to stay < cap

Power exactly at cap, temperature often 10–15 °C lower

MIG partition (⅛ H100)

MIG reports 100 %, physical GPU sees ~12 %
Remaining 7/8 of SMs idle, gated
< 20 % of TDP

Metrics to Track Real Heat Generation

Metrics to Track Real Heat Generation

Metric

NVML / DCGM field

What it tells you

Instantaneous board power

nvmlDeviceGetPowerUsage

Direct proxy for heat → use for pump control

SM active cycles (occupancy)

DCGM field 203 (sm_active)

% of cycles any warp issued an instruction

Tensor Core active

DCGM 1002 (tensor_active)

Distinguish GEMM vs memory jobs

Memory controller active

DCGM 1003 (dram_active)

Flags memory-bound kernels

Clocks & P-state

nvmlDeviceGetClockInfo, pstate

See DVFS throttling in real time

Ultimately, the best metric to gauge the thermal load is using the nvmlDeviceGetPowerUsage metric.  And together with the pstate, we can decide how much heat is generated by workloads running on a GPU and if thermal throttling has happened because of inefficient cooling.

How Federator.ai monitors and manages thermal energy generated by GPU workloads

It will be beneficial to define a heat index to model the generated heat regardless the different GPU models. A reasonable way to define such heat index (HI) is

HI = (GPU Power DrawGPU Idle Power)/(GPU Max PowerGPU Idle Power)

The range of the heat index will be between 0 and 1 based on this definition. Federator.ai monitors the scheduling and orchestration of GPU workloads and the fluctuation of the heat index of GPUs of the servers from the same rack, which are cooled by the same CDU.  It also monitors in real time the CDU temperature sensors and coolant flow rate, and other CDU metrics.  With this information, Federator.ai dynamically adjusts the CDU coolant flow rate that maintains optimal GPU operation temperature range while reducing energy used by the CDU.

It is also important to raise alerts and notifications in case any GPU temperature reaches its operation maximum operation temperature and is experiencing thermal throttling. Federator.ai monitors the GPU’s pstate metric for this purpose.

Federator.ai Smart Cooling system consists of the following three management planes for efficient thermal management.

  1. Real-time GPU Metrics Monitoring at the Edge
    An edge agent is installed at each GPU server to collect and monitor DCGM metrics (power usage, temperatures, pstate) and compute the heat index of each GPU at 1 1-second interval. An alert is triggered if GPU thermal throttling occurs or GPU temperature reaches to a predefined max boundary.
  1. Thermal-aware Workload Placement
    Using metrics collected from the DCGM as well as from the liquid cooling system (e.g., CDU), Federtor.ai places the new GPU workloads to appropriate GPU servers so that it avoids hotspots and, at the same time, has the most efficient energy use of CDUs
  1. Intelligent Smart Cooling Control
    Federaor.ai interfaces with the external liquid cooling hardware, such as rack-based or in-row CDUs, and adjusts flow rate/valves so that GPUs are operating in the optimal temperature range with the least amount of energy.

The following table summarizes how Fedeartor.ai GPU Booster integrates the workload-aware IT plane and liquid cooling system facility plane into an intelligent smart cooling solution.

Intelligent Smart Cooling Control

Layer

Concrete action

Why it matters in the “100 % util but low heat” reality

1. Telemetry ingestion

  • Edge agent pulls DCGM board-power, GPU Temperature, pstate every 1 s.

  • Computes Heat Index
Board power and functional-unit counters track real joule-generation; utilization.gpu does not.

2. GPU Booster –
workload placement

  • Tags every pod / Slurm job with heat budget (watts) and heat pattern (flat, bursty, decode). For new pod/slum job without any prior data, assume the highest usage for the resource (whole GPU or MIG) assigned.

  • Packs memory-bound or MIG-slice jobs together so a single rack can run at lower pump RPM while compute-bound jobs fill a high-flow rack.

  • Schedules gradient-sync phases out-of-phase across racks to flatten 10 % duty ripple.
Separating “hot” and “cool” jobs raises total cluster throughput without over-cooling cold racks.

3. Smart Liquid Cooling –
rack loop control

  • Switches the pump PID from ΔT feedback to Heat Index feed-forward.

  • Flow adapts to actual heat, not the misleading 100 % util flag.

Reference

  1. NVIDIA Developer Forum, ” Nvidia-SMI reporting 0% gpu utilization “, 2023. [Online]. Available: https://forums.developer.nvidia.com/t/nvidia-smi-reporting-0-gpu-utilization/261878.
  2. NVIDIA Developer, ” System Management Interface SMI “, NVIDIA. [Online]. Available: https://developer.nvidia.com/system-management-interface.
  3. NVIDIA Developer, “Measuring the GPU Occupancy of Multi-stream Workloads”, NVIDIA Blog, 2024. [Online]. Available: https://developer.nvidia.com/blog/measuring-the-gpu-occupancy-of-multi-stream-workloads/.
  4. Wang, “DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information,” arXiv preprint arXiv:2407.13096, 2024. [Online]. Available: https://arxiv.org/abs/2407.13096.
  5. Open Compute Project Cooling Environments Project, Reservoir and Pumping Unit (RPU) Specification, Version 1.0, Nov 2022:
    https://www.opencompute.org/documents/ocp-reservoir-and-pumping-unit-specification-v1-0-pdf.

Bottom line: a single “100 % GPU util” flag is a poor proxy for thermal load; Federator.ai should key its cooling logic on power and functional-unit activity, not the coarse utilization bit.

Please select the software you would like a demo of:

Federator.ai GPU Booster ®

Maximizing GPU utilization for AI workloads and doubling your server’s training capacity

Federator.ai ®

Simplifying complexity and continuously optimizing cloud costs and performance