GPU Cluster Monitoring and Observability

Context Drift Kills AI Agents Before Latency Does

A few weeks ago we hit a production issue on a cloud environment — one XCP-ng host was showing IOPS contention caused by a single guest VM. The classic noisy-neighbor race condition on shared storage. The diagnostic path was obvious: cross the dom0 guest list with iostat on the host, find the VM hammering the disk, and work the problem from there. Straightforward correlation — the kind of thing an experienced operator resolves in fifteen minutes with two terminal windows. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

Telemetry That Lies: GPU Thermal Monitoring

The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers): A training run is supposed to take ~3–4 weeks. Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun. The dashboards look perfect: ...