Telemetry That Lies: GPU Thermal Monitoring

The “Everything Is Green” Problem

Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers):

A training run is supposed to take ~3–4 weeks.

Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun.

The dashboards look perfect:

GPU utilization: 98–100%
Temperatures: 72–76°C
Power: steady
Memory: fine
No alerts, no incidents, nothing “red”

Then someone checks the one thing most teams don’t graph: delivered clocks.

They run:

nvidia-smi -q -d CLOCK

And see something like:

Clocks
    Graphics                         : 1095 MHz
    SM                               : 1095 MHz
    Memory                           : 1593 MHz
Max Clocks
    Graphics                         : 1410 MHz
    SM                               : 1410 MHz
    Memory                           : 1593 MHz

Nothing is “overheating.” Nothing is “down.”

But the GPUs are clearly not running where they normally run for this workload. They’re busy—fully utilized—while delivering less work per second than expected.

In a real fleet, that can happen because of:

a power cap policy applied to a subset of nodes
a sync-boost / group behavior holding clocks to the slowest GPU in a chassis
application clocks locked below normal
driver/daemon state changes after maintenance (persistence toggles, MIG changes, driver reloads)
a subtle airflow issue that never crosses a hard threshold but eats boost headroom

The fix in these cases is often boring:

remove the cap, correct the policy
reset the GPU
normalize clocks settings
update a driver / fabric manager
fix airflow in one rack that’s dragging clocks down across a group

The expensive part isn’t the fix. The expensive part is how long you can run “green” while being slow.

That’s the theme of this article:

Telemetry is easy to collect and hard to trust.

That’s also why this post sits right next to the first two articles in this series: HVAC Doesn’t Create Cold — It Removes Heat and Predictive Power Conditioning for GPU Clusters. If your thermal behavior shifts (inlet drift, airflow pockets, containment leaks, a “normal day” turning into a hot day), you can trigger the exact “everything is green but we’re slower” conditions described below. And as I argue in the predictive-power-conditioning piece, data centers behave like organic systems: coupled layers, laggy control loops, and local anomalies that only show up once you correlate facility + node + GPU telemetry.

The Metrics Everyone Collects

Most monitoring stacks grab the obvious stuff:

GPU temperature (junction / memory / board)
GPU utilization (SM activity %)
power draw (watts)
memory usage
“a clock number” (sometimes)

They’re available in nvidia-smi, DCGM, exporters, dashboards. They look great in Grafana.

And they lie by omission.

What the Metrics Don’t Tell You

Temperature in isolation is incomplete

A GPU at 76°C can mean “healthy” or “one degree away from losing performance,” depending on the environment.

76°C with 20–22°C inlet is one world.
76°C with 30–35°C inlet is another world.

If you don’t know inlet conditions, you don’t know your margin.

Utilization is not performance

Utilization answers: “Were the SMs busy?”

It does not answer: “How fast were they running?”
A GPU can report 100% utilization while running at lower clocks, lower power, or constrained boost behavior.

Same utilization. Different throughput.

Power draw is ambiguous

A GPU pulling 400W instead of 700W might mean:

the workload is lighter
a power cap is applied
thermal or power limits are constraining boost
clocks are locked lower than normal
the node is in a configuration state you didn’t intend

Power alone doesn’t tell you why. It’s a clue, not a diagnosis.

Delivered clocks are one of the few “truth signals”

Clocks are not the whole story (memory-bound kernels exist, comm stalls exist), but clocks are one of the few signals that directly reflects delivered compute rate under sustained load.

If your fleet normally sustains X MHz for this job and suddenly it sustains X–20%, that’s not “normal variation.” That’s a problem—even if temps are “fine.”

The P-State Trap (and Why It’s Confusing)

P-states exist to manage performance and power. But two points matter:

P-state labels aren’t enough.
Seeing P2 vs P0 isn’t a verdict by itself. Different products and modes behave differently.
The failure mode isn’t “wrong P-state.”
The failure mode is: busy GPU + lower-than-expected sustained clocks + no obvious thermal alarm.

That’s the silent killer: you’re “fully utilized,” but slower.

Versioning and Correctness Make It Worse

Now add the uncomfortable layer: sometimes it’s not just speed. It can be correctness.

Driver release notes have documented cases where a GPU can end up in an invalid state after certain operations (driver reloads, toggling persistence, recoveries after errors). In edge cases, behavior can be unpredictable: performance can change, and correctness can be at risk.

That’s not “thermal telemetry.” That’s the bigger point:

your fleet is a hardware + firmware + driver + config system.
If you don’t track versions and state transitions, you don’t know what you’re measuring.

The Signals You Actually Need

Trustworthy monitoring isn’t “more metrics.” It’s the right correlations.

Environment (what the building is doing)

inlet temp (front-of-rack / front-of-node if possible)
outlet temp
rack/row ΔT trends
airflow proxies (fan RPM, pressure differential, containment leakage indicators)

Power (what the node is allowed to do)

GPU power draw
node power draw
enforced power limits / caps
any “sync boost” / group clocking policy

Thermal response (how fast you’re losing margin)

junction + memory temps
board/ambient temps
dT/dt (rate of temperature rise)
hotspots by position (top of rack, end of row, near returns)

Performance delivery (what you’re actually getting)

sustained SM clocks (not just instantaneous spikes)
clocks vs your own fleet baseline for this workload
utilization at those clocks
throttle / limit reason codes (thermal, power, reliability, etc.)

Software state (the stuff that “shouldn’t matter” but does)

driver version / fabric manager version
persistence mode state
MIG mode state
recent resets, XIDs, ECC events, compute engine errors

The magic is correlation, not collection.

What Correlation Reveals (Four Common Patterns)

1) Local airflow restriction (rack problem)

inlet is normal
GPU temps creep up
ΔT looks weird
clocks sag
thermal throttle flags appear

Fix is airflow/containment in that rack—not “turn the CRAC down.”

2) Facility supply is warm (building problem)

inlet is high across a zone
GPUs run “within spec” but lose boost headroom
more nodes drift slower together

Fix is facility-side (capacity, containment leaks, control loops), not per-node fan tuning.

3) Policy/config cap (the silent fleet killer)

inlet looks fine
temps look fine
utilization looks fine
power is lower than expected
clocks are lower than expected
limit/throttle flags may be empty (because it’s not “throttling,” it’s “limited”)

Fix is policy/config: caps, sync-boost grouping, locked application clocks, etc.

4) “No incident, just expensive”

Nothing crosses a threshold, but you’re leaving 5–15% performance on the table for weeks.

That’s not an outage. That’s a budget leak.

Why This Costs Real Money

GPU economics are time × hardware.

Even a 10–15% slowdown can turn into:

a longer training calendar
missed internal deadlines
more GPU-hours burned
more staff time spent “tuning the model” when the platform is the bottleneck

Example math (illustrative):

256 GPUs × $30/GPU-hour = $7,680/hour
15% slowdown = $1,152/hour of waste
over 2 weeks of continuous run = six figures of overhead

And the scariest part?

Most teams never notice—because the dashboards stay green.

Building Telemetry You Can Trust

1) Measure inlet, not just GPU junction.
2) Track sustained clocks and compare to baseline for that workload.
3) Capture throttle/limit reason codes, not just temps.
4) Track power caps / policies as first-class telemetry.
5) Track versions (driver, fabric manager, firmware) and state transitions.
6) Correlate the stack in one view: environment + power + thermals + clocks + reasons + version.

The Uncomfortable Truth

Most GPU fleets are running slower than they should be.

Not because the GPUs are “bad.”
Because the telemetry is incomplete, and the interpretation is hard.

If you want trustworthy thermal monitoring, you need to stop asking “what’s the temperature?” and start asking:

“Are we delivering the clocks we should be delivering, and do we know why?”

The “Everything Is Green” Problem#

The Metrics Everyone Collects#

What the Metrics Don’t Tell You#

Temperature in isolation is incomplete#

Utilization is not performance#

Power draw is ambiguous#

Delivered clocks are one of the few “truth signals”#

The P-State Trap (and Why It’s Confusing)#

Versioning and Correctness Make It Worse#

The Signals You Actually Need#

Environment (what the building is doing)#

Power (what the node is allowed to do)#

Thermal response (how fast you’re losing margin)#

Performance delivery (what you’re actually getting)#

Software state (the stuff that “shouldn’t matter” but does)#

What Correlation Reveals (Four Common Patterns)#

1) Local airflow restriction (rack problem)#

2) Facility supply is warm (building problem)#

3) Policy/config cap (the silent fleet killer)#

4) “No incident, just expensive”#

Why This Costs Real Money#

Building Telemetry You Can Trust#

The Uncomfortable Truth#