What "GPU Util" Actually Measures
The utilization.gpu metric in NVML/DCGM reflects the percentage of time at least one CUDA kernel was resident on the Streaming Multiprocessors (SMs).
It does not account for:
- Functional unit activity (FP32, Tensor Cores, memory controllers).
- SM occupancy (active warps per cycle).
- Voltage/frequency state (DVFS or clock-gating).
Example cases where "100% Util" masks variable heat output

Case
What the counter sees
What the silicon does
Resulting power / heat
Compute-bound GEMM
(FP16/FP8 tensor cores)
~TDP (e.g., 700 W H100)
Memory-bound BFS / inference decode
PCIe copy / encode / decode
DVFS power-cap (data-center powerLimit)
Clocks limited to stay < cap
Power exactly at cap, temperature often 10–15 °C lower
MIG partition (⅛ H100)
Metrics to Track Real Heat Generation

Metric
NVML / DCGM field
What it tells you
Instantaneous board power
nvmlDeviceGetPowerUsage
Direct proxy for heat → use for pump control
SM active cycles (occupancy)
DCGM field 203 (sm_active)
Tensor Core active
DCGM 1002 (tensor_active)
Memory controller active
DCGM 1003 (dram_active)
Clocks & P-state
nvmlDeviceGetClockInfo, pstate
See DVFS throttling in real time
Ultimately, the best metric to gauge the thermal load is using the nvmlDeviceGetPowerUsage metric. And together with the pstate, we can decide how much heat is generated by workloads running on a GPU and if thermal throttling has happened because of inefficient cooling.
How Federator.ai monitors and manages thermal energy generated by GPU workloads
HI = (GPU Power Draw−GPU Idle Power)/(GPU Max Power−GPU Idle Power)
The range of the heat index will be between 0 and 1 based on this definition. Federator.ai monitors the scheduling and orchestration of GPU workloads and the fluctuation of the heat index of GPUs of the servers from the same rack, which are cooled by the same CDU. It also monitors in real time the CDU temperature sensors and coolant flow rate, and other CDU metrics. With this information, Federator.ai dynamically adjusts the CDU coolant flow rate that maintains optimal GPU operation temperature range while reducing energy used by the CDU.
It is also important to raise alerts and notifications in case any GPU temperature reaches its operation maximum operation temperature and is experiencing thermal throttling. Federator.ai monitors the GPU’s pstate metric for this purpose.
Federator.ai Smart Cooling system consists of the following three management planes for efficient thermal management.
- Real-time GPU Metrics Monitoring at the Edge
An edge agent is installed at each GPU server to collect and monitor DCGM metrics (power usage, temperatures, pstate) and compute the heat index of each GPU at 1 1-second interval. An alert is triggered if GPU thermal throttling occurs or GPU temperature reaches to a predefined max boundary.
- Thermal-aware Workload Placement
Using metrics collected from the DCGM as well as from the liquid cooling system (e.g., CDU), Federtor.ai places the new GPU workloads to appropriate GPU servers so that it avoids hotspots and, at the same time, has the most efficient energy use of CDUs
- Intelligent Smart Cooling Control
Federaor.ai interfaces with the external liquid cooling hardware, such as rack-based or in-row CDUs, and adjusts flow rate/valves so that GPUs are operating in the optimal temperature range with the least amount of energy.
The following table summarizes how Fedeartor.ai GPU Booster integrates the workload-aware IT plane and liquid cooling system facility plane into an intelligent smart cooling solution.

Layer
Concrete action
Why it matters in the “100 % util but low heat” reality
1. Telemetry ingestion
- Edge agent pulls DCGM board-power, GPU Temperature, pstate every 1 s.
- Computes Heat Index
2. GPU Booster –
workload placement
- Tags every pod / Slurm job with heat budget (watts) and heat pattern (flat, bursty, decode). For new pod/slum job without any prior data, assume the highest usage for the resource (whole GPU or MIG) assigned.
- Packs memory-bound or MIG-slice jobs together so a single rack can run at lower pump RPM while compute-bound jobs fill a high-flow rack.
- Schedules gradient-sync phases out-of-phase across racks to flatten 10 % duty ripple.
3. Smart Liquid Cooling –
rack loop control
- Switches the pump PID from ΔT feedback to Heat Index feed-forward.
- Flow adapts to actual heat, not the misleading 100 % util flag.
Reference
- NVIDIA Developer Forum, ” Nvidia-SMI reporting 0% gpu utilization “, 2023. [Online]. Available: https://forums.developer.nvidia.com/t/nvidia-smi-reporting-0-gpu-utilization/261878.
- NVIDIA Developer, ” System Management Interface SMI “, NVIDIA. [Online]. Available: https://developer.nvidia.com/system-management-interface.
- NVIDIA Developer, “Measuring the GPU Occupancy of Multi-stream Workloads”, NVIDIA Blog, 2024. [Online]. Available: https://developer.nvidia.com/blog/measuring-the-gpu-occupancy-of-multi-stream-workloads/.
- Wang, “DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information,” arXiv preprint arXiv:2407.13096, 2024. [Online]. Available: https://arxiv.org/abs/2407.13096.
- Open Compute Project Cooling Environments Project, Reservoir and Pumping Unit (RPU) Specification, Version 1.0, Nov 2022:
https://www.opencompute.org/documents/ocp-reservoir-and-pumping-unit-specification-v1-0-pdf.
Bottom line: a single “100 % GPU util” flag is a poor proxy for thermal load; Federator.ai should key its cooling logic on power and functional-unit activity, not the coarse utilization bit.