I’m currently working on the design of a framework for GPU fleet management.

We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part.

Inflection point #1: Footprint truth

This is the Day-0 onboarding schema: it produces footprint truth (spaces, layouts, responsibility boundaries) so the DCIM has a clean base before any telemetry or automation exists.

(Click the image to open the full-size flowchart.)

Footprint truth: base mapping for GPU fleet presence
Footprint truth: the base mapping for fleet presence. Open full-size to follow the flow.

This is the base mapping for footprint presence. Operations needs a minimum answer to the business-critical question: “Where do we have computing resources?” Not where we think we have them — where they actually are, across colo, owned facilities, cages, suites, halls, edge rooms, whatever shape the footprint took over time. And for some multi-billion-dollar corporations, this isn’t a nice-to-have inventory view. It’s revenue core.

AI/ML fleets also don’t behave like the server fleets most of us grew up operating. A few years ago, deployment was relatively static: a golden image, a rebuild path, and you could run that pattern for most of a server’s lifetime. Telemetry was largely there to catch rare erratic behavior — usually thermal or power outbursts that stood out from an otherwise predictable baseline.

Today, what used to be the outlier is the rule of thumb.

Now layer on the real world: M&A after M&A, each one bringing its own “DCIM” (or lack of it), its own naming conventions, its own cabling standards, its own runbooks — and somehow we call that a fleet. You inherit out-of-standard tools, spreadsheets that became systems, and the most dangerous pattern of all: multiple “Single Sources of Truth.” None of them is entirely wrong, none of them is fully correct, and all of them diverge over time.

At scale, it gets messier fast: dozens of facilities, hundreds of technicians, varying levels of rigor, mixed vendor contracts, and varying power quality. Some sites brown out a few times more than others, and nobody correlates it to job slowdowns — until a training run slips a week and the blame game starts. Meanwhile, you’re juggling RMAs across manufacturers and suppliers, swapping parts in the field, dealing with lead times, firmware mismatches, and “it passed diagnostics” ghosts that come back the moment the rack is loaded.

That’s the environment GPU operations live in now: not a clean lab, not a single blueprint — a constantly shifting map of assets, responsibility boundaries, and drift.

The GPU at enterprise scale remains uncharted territory; many old playbooks don’t apply to today’s demands. We’re hitting a data center operations wall that echoes the last decade’s SRE/DevOps inflection — but instead of “it works on my machine,” the sentence is: “it worked where I used to work.”

To be clear: some playbooks are excellent. NVIDIA’s DGX SuperPod guidance is one of the best I’ve seen — it’s basically:

“Here’s the stack. Here are 10 racks. Here are the power, cooling, and airflow specs. Give me a room.”

It’s a polite way of forcing containment: creating a controlled operating bubble inside a larger environment that’s full of entropy — mixed standards, mixed facilities, mixed constraints, and constant drift.

NVIDIA is unusual in one important way: the DGX SuperPod playbook is public. The hyperscalers — Google, Meta, Amazon, and others — obviously have their own playbooks too, forged in real incidents and scale. You just won’t find them on a PDF, and you shouldn’t expect to.

The old playbooks assume stability. But as Oren Harari put it: “Electric light didn’t come from the continuous improvement of candles.” New challenges require new approaches. The fleet is an organism. If your source of truth doesn’t breathe with it, it becomes fiction.


If you want to share annotations, sanity-check assumptions, or just start a conversation about this framework, see About.