Articles - GPU Fleet Ops & Resilience Notes

Browse by category: GPU Cluster Operations · AI Infrastructure Economics · AI Infrastructure Security · AI Power Systems · MEP and Cooling Resilience · NeoCloud Operations and Compliance · Resilience Engineering · Infrastructure Leadership — or search by Tags

GPU Fleet AIOps: 7 LLM Backends, 6 Failure Scenarios

Abstract After measuring a 292x cost gap between a rented B200 and frontier API providers on a batch inference workload, the logical next question was whether the same pattern held for operational intelligence: could a smaller, dedicated model handle the continuous judgment calls required to run a GPU fleet? The batch inference test had exposed KV cache contention as the dominant bottleneck on shared API infrastructure. Processing similar structured data at scale, but this time continuously rather than in batch, seemed like a valid test of whether that contention would degrade operational quality the same way it degraded throughput. ...

GPU Fleet AIOps: The Augmented Operator

Two in the morning, eighteen hours into the run. Seven LLM backends processing the same stream of GPU cluster anomalies. Same thermal cascades, same NVLink errors, same KV cache evictions. I’m watching the scoring dashboard update in real time and the numbers are breaking my assumptions faster than I can take notes. The $32-per-day model is getting the diagnosis wrong more often than a free one running on my workstation. ...

292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API. The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves. ...

A Million PDFs Won't Organize Themselves

Three months ago, I plugged a Blackwell GPU into my lab bench and pointed it at a million PDFs. Corporate documents: contracts, engineering reports, compliance filings, vendor proposals, maintenance logs, insurance certificates. The kind of archive that accumulates over two decades of running critical infrastructure. A million files. Not sampled, not curated, not cleaned. Raw. The plan was straightforward: parse the documents, chunk them, embed them into a vector store, wire up a retrieval layer, and start asking questions no keyword search could answer. Retrieval-Augmented Generation. The acronym that launched a thousand vendor decks. ...

Context Drift Kills AI Agents Before Latency Does

A few weeks ago we hit a production issue on a cloud environment — one XCP-ng host was showing IOPS contention caused by a single guest VM. The classic noisy-neighbor race condition on shared storage. The diagnostic path was obvious: cross the dom0 guest list with iostat on the host, find the VM hammering the disk, and work the problem from there. Straightforward correlation — the kind of thing an experienced operator resolves in fifteen minutes with two terminal windows. ...

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...

The Heat Nobody Counts - PUE Ends at the Meter

Meta’s Prometheus data center in New Albany, Ohio is scaling to 1.2 GW. To get there, they’re building behind-the-meter natural gas turbines — two 200 MW Socrates generation facilities, supplied by dedicated gas pipelines, isolated from the grid. In Virginia, the same story plays out with diesel generators, enough of them that it became the top legislative concern entering the 2026 session. The industry talks about PUE as if it were a verdict on environmental efficiency. It isn’t. PUE measures one envelope — the data center facility. Total facility power divided by IT equipment power. A PUE of 1.3 means 30% overhead for cooling, lighting, and support systems. That’s the metric everyone optimizes, the number that shows up in sustainability reports, the figure that earns applause at conferences. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

Kudos to Anthropic - Governments Bury Ecosystems

Last Friday, the White House ordered every federal agency to stop using Anthropic products within six months. The Defense Secretary designated the company a “supply chain risk to national security” — a label normally reserved for foreign adversaries like Huawei or Kaspersky. Anthropic’s crime: they refused to remove two safety guardrails from Claude before deploying it on classified Pentagon networks. No AI for mass domestic surveillance of American citizens. No fully autonomous weapons without human oversight. ...

Everybody Spies: Sovereignty and the AI Land Grab

In Brazil, when advising a customer on endpoint security, there was a mental model we never said out loud. The technical discussion would cover detection rates, false positives, memory footprint — the usual. But underneath it ran a question that never made it into the RFP: who do you want knowing what you’re doing? Russians or Americans? Kaspersky was the default for most of the market — and not because of ideology. Norton and Symantec had spent years earning their reputation for turning Windows machines into molasses, and McAfee was McAfee. Kaspersky worked. It was lighter, faster, cheaper. The fact that its telemetry flowed to Moscow rather than Langley was a feature, not a bug, depending on which side of the table you sat on. ...

The Concorde Problem in AI Infrastructure

The Concorde burned one ton of fuel per passenger to cross the Atlantic. One hundred seats. Three and a half hours. Mach 2. The most advanced commercial aircraft ever built — and every engineer who saw it wanted to believe it was the future. The 747 did the same crossing in seven hours. Four hundred seats. A quarter of the fuel per passenger. No afterburners. No sonic boom. No government subsidies keeping it alive. ...

Building Trust in Security: Part 3

This is the third and final part of a series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines — Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Quality That Can’t Be Purchased I’ve been writing around this idea for a while — in Cold Aisle Trenches, in why standards fail when you try to impose them, in how defense in depth actually works at scale. The thread is always the same: security can’t be bought. You can’t swipe a credit card and receive “secure” in a box. It’s a quality that emerges — like the lights-out data center you don’t chase but eventually arrive at, because every other piece fell into place first. ...

Building Trust in Security: Part 2

This is the second of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. From Trust to Reliance ...

Building Trust in Security: Part 1

This is the first of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Inflection Point A few years back, AMTI was at the heart of a fascinating corporate challenge. I was serving as a fractional CISO and advisor for a company standing at a critical inflection point. ...

Why Foreign AI Specialists Keep Failing

Context got commoditized. Translation is next. When my company’s acquisition closed in 2024, I thought about pursuing a psychology degree in the US. The impulse was the same one that drives URE: wanting to understand how things are wired under the hood. My wife shut it down—“Really? You know that’s not going to work”—and she was right, though neither of us fully understood why at the time. What I was actually chasing wasn’t psychology. It was context. ...

Cold Aisle Trenches: When Theory Hits the Asphalt

A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor. January 2026 · Stefano Schotten The contract was honored. The business still bled. My case manager called me from the customer site. I could hear the tension before he said a word. “The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.” ...

Cold Aisle Trenches: You Don't Chase Lights-Out

It was 2017. We had just deployed an additional ScaleIO cluster to handle the onboarding of a new customer with hundreds of VMs. Eight nodes, each with 40 Gbps at the backend. Beautiful. Efficient. The whole rack was a work of art—Dell R740s with MD1220 expansions, bezels removed so you could see all those drives blinking in perfect synchronization. The cluster was deployed less than two weeks ago. I told the customer to “burn it.” ...

AI and Society: Three Phases of Tech Adoption

I see people everywhere anxious about whether AI will disrupt their jobs, their industries, their lives. I’ve always approached this with calm. Not indifference—calm. The future rarely sends advance notice, but it is always arriving. This isn’t news. It’s the human condition. A few years ago, I attended a keynote by Michio Kaku where he framed—perfectly, for me—the relationship between humanity and technological change. What follows is my version. I can’t claim novelty, and I’m not a domain expert in sociology or economics. I’m an infrastructure builder observing the same pattern from the inside. ...

The Entropy of Sovereign AI: Map vs. Territory

A few years ago, I was having dinner with the Americas VP of a European energy supermajor — one of those companies that extracts oil from war zones, negotiates with regimes that don’t appear on polite lists, and operates in places where “political risk” means your assets might get nationalized or your personnel kidnapped. Seventy-plus countries. Active operations in Libya, Nigeria, Angola, Myanmar, Yemen. The kinds of places where security briefings come before breakfast. ...

The Lone Wolf Starves First

A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing. Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat. ...

It Took a Pandemic to Learn Why Standards Failed

In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department. My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency. The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability. On paper, it was textbook. In practice, it was a slow-motion collision with reality. ...

From Security to Resilience: Defense in Depth

Most security programs are built around preventing bad things from happening. That’s necessary but insufficient. At AMTI, where I served as CTO and led infrastructure security for a multi-tenant cloud serving customers from single-VM deployments to enterprise DRaaS contracts spanning hundreds of miles of metro fiber, I learned that mature security is about resilience: the capacity to detect, contain, and recover faster than adversaries can escalate. The Visibility Problem at Scale Operating a cloud service provider on your own ASN creates a specific governance challenge: you’re the abuse contact, but in a GDPR-compliant architecture, you have no visibility into customer data. Encrypted traffic is opaque by design. This constraint forced architectural discipline: we couldn’t inspect our way to security, so we had to instrument our way there. ...

When Lack of Guardrails Hurt the Business

Every company says security is a core value. Few embed it as a design constraint. The difference shows up when things break. I get a call from a co-founder I’ve known for years. His company just raised $400M+ Series D. His voice is flat: “We have a problem.” Same day, we’re on a call. He’s a skilled engineer — personally devastated. They leaked over 2 million user records. Home addresses. Phone numbers. The full profile. The data had been publicly accessible for three weeks before anyone noticed. ...

When the Constraint Isn’t Capacity

A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient. The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor. ...

Security Assurance - URE Case - 1/5 - The Inception

1/5 — The Inception Series: Security Assurance — URE Case — 1/5 Start from the beginning: you’re here. Next: 2/5 — Trust Boundaries This is the first of five short posts on Security Assurance Engineering. The goal is simple: separate security intent from security proof, and show what “assurance” looks like when you treat a system as real—owned, changing, and measurable. I’ll use URE as the working surface. URE is the platform where I publish research notes and operating practice generated in my lab—work that started as a few shared threads with friends and peers, and eventually became worth “productizing” into something durable and navigable. ...

Security Assurance - URE Case - 2/5 - Trust Boundaries

2/5 — Trust Boundaries Series: Security Assurance — URE Case — 2/5 Start from the beginning: 1/5 — The Inception Next: 3/5 — The Design In mature environments, we don’t start with implementation. We start with boundaries and ownership. Before anyone spins up “a simple website/blog,” we make three things explicit: What is the system? (scope and components) Who can change it? (identities and permissions) What must always remain true? (invariants + guardrails) Security should be intentional. The goal is to create guardrails the rest of the team can rely on—so delivery is fast and the system stays trustworthy under change. ...

Security Assurance - URE Case - 3/5 - The Design

3/5 — The Design Series: Security Assurance — URE Case — 3/5 Start from the beginning: 1/5 — The Inception Next: 4/5 — Security as an Enabler (and “forward agency”) Design is where “a simple website” becomes a real system. Not because the pages are complex—but because the moment you publish, you inherit real dependencies: DNS, build pipelines, third parties, telemetry, and the drift that comes with change. So before we build anything, we do one unglamorous thing: ...

Security Assurance - URE Case - 4/5 - Enabler

4/5 — Security as an Enabler (and “forward agency”) Series: Security Assurance — URE Case — 4/5 Start from the beginning: 1/5 — The Inception Next: 5/5 — Conclusion — Assurance Without Theater Security enables the business when it shows up with agency: not just identifying risk, but carrying enough context to propose solutions that preserve the mission. That requires a maturity shift. When security arrives late, it often speaks in “non-English.” It blocks because the system is already committed to choices no one can defend. ...

Security Assurance - URE Case - 5/5 - Conclusion

5/5 — Conclusion — Assurance Without Theater Series: Security Assurance — URE Case — 5/5 Start from the beginning: 1/5 — The Inception Security Assurance Engineering is not a side quest. It’s not a compliance ritual. And it’s not a “security team thing.” It’s what turns security from intent into proof—in systems that are owned, changing, and measurable. Across these chapters, the arc is consistent: Part 1/5 (Inception): Architecture sets the invariants. Assurance proves they still hold under change. Part 2/5 (Trust Boundaries): If the boundary isn’t explicit, you don’t have a system—you have assumptions. Part 3/5 (Design): The tedious questions aren’t bureaucracy; they are how you prevent accidental scope and irreversible drift. Part 4/5 (Security as Enabler): Done well, security doesn’t slow delivery—it restores optionality and keeps the mission intact under real pressure. The takeaway is simple: ...

Business Resiliency Through Security Assurance

Every company says security is a priority. Every company also ships under pressure. The gap between those two statements is where businesses bleed. I’ve watched organizations with excellent engineers and serious budgets still get humbled by the same pattern: teams optimize locally (features, velocity, “my backlog”), while the system pays globally (incidents, outages, churn, reputational drag). When things go south, it rarely takes a cinematic attacker or a once-in-a-decade failure. ...

MEP Providers Are Never in the Postmortem

In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%. When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”. First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen? ...

Why GPU Fleet Control Starts with a Map

I’m currently working on the design of a framework for GPU fleet management. We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part. ...

Project Atlas: Technical Stack

Atlas is a single pane of glass for multi-cloud cost visibility. This post documents the pipeline: ingestion, streaming, storage, query, forecasting, and visualization.

Tail Latency Killed My Beowulf Cluster in 2006

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...

Telemetry That Lies: GPU Thermal Monitoring

The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers): A training run is supposed to take ~3–4 weeks. Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun. The dashboards look perfect: ...

Predictive Power Conditioning for GPU Clusters

GPU clusters don’t fail from sustained load. They fail on transitions. A pod idling at 20 kW can step toward 300 kW quickly when training begins. The peak matters, but the killer is the step: the dP/dt that forces every layer of the electrical path to react at once. Thermals matter too—but they’re secondary and collateral. Power transients can push protection and control behavior in cycles. Thermal consequences show up later as throttling, efficiency loss, and “mysteriously slower training” that looks like a software problem until you instrument the facility. ...

AI Infrastructure Placement Is a Business Decision

Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge. AI breaks this model completely. Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it. ...

HVAC Doesn't Create Cold - It Removes Heat

This is the first of a series of URE articles about thermal management in data center environments—not theory, not “best practices,” but what actually happens when heat meets physics and scale. Here’s a simple puzzle from two idle machines. ai01 — home lab, Threadripper 32-core with 2× NVIDIA GPUs (NVLink), rack-level liquid cooling loop, used for ML training and vLLM inference: Tctl: +33.0°C Tccd1: +33.2°C Tccd5: +31.5°C nj01 — third-party datacenter (colo), Ryzen 12-core, air-cooled: ...