Welcome to URE#
GPU fleet reliability is an emerging discipline—built under pressure by teams who can’t afford to learn on the job.
URE is my research log on high-density compute operations: patterns, failures, and the gap between “it worked in the lab” and “it survives production.”
→ Start here: About · GPU, Cloud & Data Center · Articles
Featured: Atlas#
Atlas is a single pane of glass for multi-cloud spend visibility and planning—drilldowns by provider, region, and product, with capacity signals for what’s coming next.
→ Visit Atlas Demo: /atlas/
What URE covers#
URE (Usage • Resources • Economics) connects three things most teams track separately:
- Usage: what workloads actually do (demand, bursts, variability)
- Resources: what the system can actually provide (compute, network, storage, power, cooling)
- Economics: what the gap costs—in money, risk, time, and opportunity
How the site is organized#
- GPU, Cloud & Data Center: research pillars, lab setup, equipment, and methods
- Articles: working papers and benchmarks
- About: background and scope
In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%.
When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”.
First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen?
...
I’m currently working on the design of a framework for GPU fleet management.
We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part.
...
Atlas is a single pane of glass for multi-cloud cost visibility. This post documents the pipeline: ingestion, streaming, storage, query, forecasting, and visualization.
Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out.
It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency.
NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology.
...
The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers):
A training run is supposed to take ~3–4 weeks.
Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun.
The dashboards look perfect:
...
GPU clusters don’t fail from sustained load. They fail on transitions.
A pod idling at 20 kW can step toward 300 kW quickly when training begins. The peak matters, but the killer is the step: the dP/dt that forces every layer of the electrical path to react at once.
Thermals matter too—but they’re secondary and collateral. Power transients can push protection and control behavior in cycles. Thermal consequences show up later as throttling, efficiency loss, and “mysteriously slower training” that looks like a software problem until you instrument the facility.
...
Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge.
AI breaks this model completely.
Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it.
...
This is the first of a series of URE articles about thermal management in data center environments—not theory, not “best practices,” but what actually happens when heat meets physics and scale.
Here’s a simple puzzle from two idle machines.
ai01 — home lab, Threadripper 32-core with 2× NVIDIA GPUs (NVLink), rack-level liquid cooling loop, used for ML training and vLLM inference:
Tctl: +33.0°C Tccd1: +33.2°C Tccd5: +31.5°C nj01 — third-party datacenter (colo), Ryzen 12-core, air-cooled:
...