URE — Unified Resilience Engineering

The human brain is hardwired for the tangible. We understand what we’ve touched, carried, walked through. Ask an American to picture 2.3 kilos of meat and you’ll get a blank stare. Tell a European to walk 80 feet down a corridor instead of counting three doors, and they’ll overshoot it. We don’t process abstract units — we process experience.

Now scale that problem up.

Austin, Texas sprawls across 305 square miles. San José, California covers 180. Each one draws roughly one gigawatt from the grid. That’s an entire city — hospitals, traffic lights, air conditioning, schools, everything humming at once. One gigawatt.

Outside Houston, a single data center campus sits on less than one square mile. It draws the same gigawatt.

Condense everything Austin consumes — every house, every hospital, every streetlight — into a footprint smaller than a neighborhood park. That’s what a hyperscale data center is. And more than a dozen of them are being built across the United States right now.

Want to get serious?

A grizzly bear weighs about 600 pounds. The rat behind your local dumpster weighs about two. That’s a 300-to-1 ratio — roughly the same ratio between the sprawl of Austin, Texas, and the footprint of a single gigawatt data center campus. Except the grizzly doesn’t consume 300 times more oxygen. The data center consumes every watt the city does.

That’s the beast we’ve built. A rat with a grizzly’s appetite — lab-made, power-dense, and nothing in the old playbook was designed to feed it. What goes in as power comes out as heat. Every watt. No exceptions.

This is a physics problem. When you compress a gigawatt into a square mile, everything downstream — power conditioning, thermal capacity, transient management, reliability — behaves differently than anything we’ve operated before. The rules that governed traditional data centers don’t scale to AI infrastructure energy density. The playbooks don’t transfer. The dashboards lie.

There are no playbooks written for the AI era — not for AI Factories, Token Factories, or Deep Training Facilities. We’re talking about ML and RL training jobs spanning thousands of nodes on a typical Tuesday.

For perspective: five years ago, the most powerful computer on Earth was Fugaku — 159,000 nodes, 432 racks, a billion dollars of Japanese national investment, drawing 30 megawatts. Countries bragged about it. Scientists waited in line for access. It was a generational achievement.

Today, a NeoCloud launches a “modest” 300-megawatt facility — ten Fugakus worth of power — and it doesn’t even make headlines.

Physics doesn’t negotiate. URE starts there.

My work — building data centers from scratch in highly constrained environments — shaped a particular vision of how the stack is actually built. Not how it’s drawn in architecture diagrams, but how it behaves under load, under budget pressure, and under the laws of thermodynamics.

This world isn’t one-size-fits-all. URE connects the seams between operational layers — power, thermal, compute, network, cost, compliance — and turns them into reliable deliverables with real economic value. Not another dashboard. Not another abstraction. A reasoning method built from twenty years of scars for an era that doesn’t yet have playbooks.

I break the infrastructure stack into five operational layers: foundation, AI-era baseline, infrastructure and hardware, software and orchestration, and economics — each one inheriting the failures of the layer below it. That framework is the backbone of everything I publish here.

Stefano Schotten Founder — URE · CISSP · LinkedIn

Context Drift Kills AI Agents Before Latency Does

A few weeks ago we hit a production issue on a cloud environment — one XCP-ng host was showing IOPS contention caused by a single guest VM. The classic noisy-neighbor race condition on shared storage. The diagnostic path was obvious: cross the dom0 guest list with iostat on the host, find the VM hammering the disk, and work the problem from there. Straightforward correlation — the kind of thing an experienced operator resolves in fifteen minutes with two terminal windows. ...

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...

The Heat Nobody Counts - PUE Ends at the Meter

Meta’s Prometheus data center in New Albany, Ohio is scaling to 1.2 GW. To get there, they’re building behind-the-meter natural gas turbines — two 200 MW Socrates generation facilities, supplied by dedicated gas pipelines, isolated from the grid. In Virginia, the same story plays out with diesel generators, enough of them that it became the top legislative concern entering the 2026 session. The industry talks about PUE as if it were a verdict on environmental efficiency. It isn’t. PUE measures one envelope — the data center facility. Total facility power divided by IT equipment power. A PUE of 1.3 means 30% overhead for cooling, lighting, and support systems. That’s the metric everyone optimizes, the number that shows up in sustainability reports, the figure that earns applause at conferences. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

Kudos to Anthropic - Governments Bury Ecosystems

Last Friday, the White House ordered every federal agency to stop using Anthropic products within six months. The Defense Secretary designated the company a “supply chain risk to national security” — a label normally reserved for foreign adversaries like Huawei or Kaspersky. Anthropic’s crime: they refused to remove two safety guardrails from Claude before deploying it on classified Pentagon networks. No AI for mass domestic surveillance of American citizens. No fully autonomous weapons without human oversight. ...

Everybody Spies: Sovereignty and the AI Land Grab

In Brazil, when advising a customer on endpoint security, there was a mental model we never said out loud. The technical discussion would cover detection rates, false positives, memory footprint — the usual. But underneath it ran a question that never made it into the RFP: who do you want knowing what you’re doing? Russians or Americans? Kaspersky was the default for most of the market — and not because of ideology. Norton and Symantec had spent years earning their reputation for turning Windows machines into molasses, and McAfee was McAfee. Kaspersky worked. It was lighter, faster, cheaper. The fact that its telemetry flowed to Moscow rather than Langley was a feature, not a bug, depending on which side of the table you sat on. ...

The Concorde Problem in AI Infrastructure

The Concorde burned one ton of fuel per passenger to cross the Atlantic. One hundred seats. Three and a half hours. Mach 2. The most advanced commercial aircraft ever built — and every engineer who saw it wanted to believe it was the future. The 747 did the same crossing in seven hours. Four hundred seats. A quarter of the fuel per passenger. No afterburners. No sonic boom. No government subsidies keeping it alive. ...

Building Trust in Security: Part 3

This is the third and final part of a series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines — Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Quality That Can’t Be Purchased I’ve been writing around this idea for a while — in Cold Aisle Trenches, in why standards fail when you try to impose them, in how defense in depth actually works at scale. The thread is always the same: security can’t be bought. You can’t swipe a credit card and receive “secure” in a box. It’s a quality that emerges — like the lights-out data center you don’t chase but eventually arrive at, because every other piece fell into place first. ...

Building Trust in Security: Part 2

This is the second of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. From Trust to Reliance ...

Building Trust in Security: Part 1

This is the first of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it. This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints. Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Inflection Point A few years back, AMTI was at the heart of a fascinating corporate challenge. I was serving as a fractional CISO and advisor for a company standing at a critical inflection point. ...

Why Foreign AI Specialists Keep Failing

Context got commoditized. Translation is next. When my company’s acquisition closed in 2024, I thought about pursuing a psychology degree in the US. The impulse was the same one that drives URE: wanting to understand how things are wired under the hood. My wife shut it down—“Really? You know that’s not going to work”—and she was right, though neither of us fully understood why at the time. What I was actually chasing wasn’t psychology. It was context. ...

Cold Aisle Trenches: When Theory Hits the Asphalt

A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor. January 2026 · Stefano Schotten The contract was honored. The business still bled. My case manager called me from the customer site. I could hear the tension before he said a word. “The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.” ...

Cold Aisle Trenches: You Don't Chase Lights-Out

It was 2017. We had just deployed an additional ScaleIO cluster to handle the onboarding of a new customer with hundreds of VMs. Eight nodes, each with 40 Gbps at the backend. Beautiful. Efficient. The whole rack was a work of art—Dell R740s with MD1220 expansions, bezels removed so you could see all those drives blinking in perfect synchronization. The cluster was deployed less than two weeks ago. I told the customer to “burn it.” ...

AI and Society: Three Phases of Tech Adoption

I see people everywhere anxious about whether AI will disrupt their jobs, their industries, their lives. I’ve always approached this with calm. Not indifference—calm. The future rarely sends advance notice, but it is always arriving. This isn’t news. It’s the human condition. A few years ago, I attended a keynote by Michio Kaku where he framed—perfectly, for me—the relationship between humanity and technological change. What follows is my version. I can’t claim novelty, and I’m not a domain expert in sociology or economics. I’m an infrastructure builder observing the same pattern from the inside. ...

The Entropy of Sovereign AI: Map vs. Territory

A few years ago, I was having dinner with the Americas VP of a European energy supermajor — one of those companies that extracts oil from war zones, negotiates with regimes that don’t appear on polite lists, and operates in places where “political risk” means your assets might get nationalized or your personnel kidnapped. Seventy-plus countries. Active operations in Libya, Nigeria, Angola, Myanmar, Yemen. The kinds of places where security briefings come before breakfast. ...

The Lone Wolf Starves First

A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing. Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat. ...

It Took a Pandemic to Learn Why Standards Failed

In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department. My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency. The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability. On paper, it was textbook. In practice, it was a slow-motion collision with reality. ...

From Security to Resilience: Defense in Depth

Most security programs are built around preventing bad things from happening. That’s necessary but insufficient. At AMTI, where I served as CTO and led infrastructure security for a multi-tenant cloud serving customers from single-VM deployments to enterprise DRaaS contracts spanning hundreds of miles of metro fiber, I learned that mature security is about resilience: the capacity to detect, contain, and recover faster than adversaries can escalate. The Visibility Problem at Scale Operating a cloud service provider on your own ASN creates a specific governance challenge: you’re the abuse contact, but in a GDPR-compliant architecture, you have no visibility into customer data. Encrypted traffic is opaque by design. This constraint forced architectural discipline: we couldn’t inspect our way to security, so we had to instrument our way there. ...

When Lack of Guardrails Hurt the Business

Every company says security is a core value. Few embed it as a design constraint. The difference shows up when things break. I get a call from a co-founder I’ve known for years. His company just raised $400M+ Series D. His voice is flat: “We have a problem.” Same day, we’re on a call. He’s a skilled engineer — personally devastated. They leaked over 2 million user records. Home addresses. Phone numbers. The full profile. The data had been publicly accessible for three weeks before anyone noticed. ...

When the Constraint Isn’t Capacity

A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient. The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor. ...

Privacy & Analytics - How URE Handles Data

We collect basic, self-hosted analytics to improve content and reliability (visits, top pages, traffic sources). We don’t use this data for advertising, we don’t build third-party profiles, and we never sell or share it.

Security Assurance - URE Case - 1/5 - The Inception

1/5 — The Inception Series: Security Assurance — URE Case — 1/5 Start from the beginning: you’re here. Next: 2/5 — Trust Boundaries This is the first of five short posts on Security Assurance Engineering. The goal is simple: separate security intent from security proof, and show what “assurance” looks like when you treat a system as real—owned, changing, and measurable. I’ll use URE as the working surface. URE is the platform where I publish research notes and operating practice generated in my lab—work that started as a few shared threads with friends and peers, and eventually became worth “productizing” into something durable and navigable. ...

Security Assurance - URE Case - 2/5 - Trust Boundaries

2/5 — Trust Boundaries Series: Security Assurance — URE Case — 2/5 Start from the beginning: 1/5 — The Inception Next: 3/5 — The Design In mature environments, we don’t start with implementation. We start with boundaries and ownership. Before anyone spins up “a simple website/blog,” we make three things explicit: What is the system? (scope and components) Who can change it? (identities and permissions) What must always remain true? (invariants + guardrails) Security should be intentional. The goal is to create guardrails the rest of the team can rely on—so delivery is fast and the system stays trustworthy under change. ...

Security Assurance - URE Case - 3/5 - The Design

3/5 — The Design Series: Security Assurance — URE Case — 3/5 Start from the beginning: 1/5 — The Inception Next: 4/5 — Security as an Enabler (and “forward agency”) Design is where “a simple website” becomes a real system. Not because the pages are complex—but because the moment you publish, you inherit real dependencies: DNS, build pipelines, third parties, telemetry, and the drift that comes with change. So before we build anything, we do one unglamorous thing: ...

Security Assurance - URE Case - 4/5 - Enabler

4/5 — Security as an Enabler (and “forward agency”) Series: Security Assurance — URE Case — 4/5 Start from the beginning: 1/5 — The Inception Next: 5/5 — Conclusion — Assurance Without Theater Security enables the business when it shows up with agency: not just identifying risk, but carrying enough context to propose solutions that preserve the mission. That requires a maturity shift. When security arrives late, it often speaks in “non-English.” It blocks because the system is already committed to choices no one can defend. ...

Security Assurance - URE Case - 5/5 - Conclusion

5/5 — Conclusion — Assurance Without Theater Series: Security Assurance — URE Case — 5/5 Start from the beginning: 1/5 — The Inception Security Assurance Engineering is not a side quest. It’s not a compliance ritual. And it’s not a “security team thing.” It’s what turns security from intent into proof—in systems that are owned, changing, and measurable. Across these chapters, the arc is consistent: Part 1/5 (Inception): Architecture sets the invariants. Assurance proves they still hold under change. Part 2/5 (Trust Boundaries): If the boundary isn’t explicit, you don’t have a system—you have assumptions. Part 3/5 (Design): The tedious questions aren’t bureaucracy; they are how you prevent accidental scope and irreversible drift. Part 4/5 (Security as Enabler): Done well, security doesn’t slow delivery—it restores optionality and keeps the mission intact under real pressure. The takeaway is simple: ...

Business Resiliency Through Security Assurance

Every company says security is a priority. Every company also ships under pressure. The gap between those two statements is where businesses bleed. I’ve watched organizations with excellent engineers and serious budgets still get humbled by the same pattern: teams optimize locally (features, velocity, “my backlog”), while the system pays globally (incidents, outages, churn, reputational drag). When things go south, it rarely takes a cinematic attacker or a once-in-a-decade failure. ...

MEP Providers Are Never in the Postmortem

In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%. When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”. First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen? ...

Why GPU Fleet Control Starts with a Map

I’m currently working on the design of a framework for GPU fleet management. We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part. ...

Project Atlas: Technical Stack

Atlas is a single pane of glass for multi-cloud cost visibility. This post documents the pipeline: ingestion, streaming, storage, query, forecasting, and visualization.

Resilience Engineering - The URE Framework

The discipline and the lab. How systems survive production — and where we prove it.

URE Operational Framework - Five Layers | AI Infra

Five operational layers, from the utility connection to the billing system. The backbone of everything URE publishes.

Tail Latency Killed My Beowulf Cluster in 2006

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...

Telemetry That Lies: GPU Thermal Monitoring

The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers): A training run is supposed to take ~3–4 weeks. Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun. The dashboards look perfect: ...

Predictive Power Conditioning for GPU Clusters

GPU clusters don’t fail from sustained load. They fail on transitions. A pod idling at 20 kW can step toward 300 kW quickly when training begins. The peak matters, but the killer is the step: the dP/dt that forces every layer of the electrical path to react at once. Thermals matter too—but they’re secondary and collateral. Power transients can push protection and control behavior in cycles. Thermal consequences show up later as throttling, efficiency loss, and “mysteriously slower training” that looks like a software problem until you instrument the facility. ...

AI Infrastructure Placement Is a Business Decision

Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge. AI breaks this model completely. Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it. ...

HVAC Doesn't Create Cold - It Removes Heat

This is the first of a series of URE articles about thermal management in data center environments—not theory, not “best practices,” but what actually happens when heat meets physics and scale. Here’s a simple puzzle from two idle machines. ai01 — home lab, Threadripper 32-core with 2× NVIDIA GPUs (NVLink), rack-level liquid cooling loop, used for ML training and vLLM inference: Tctl: +33.0°C Tccd1: +33.2°C Tccd5: +31.5°C nj01 — third-party datacenter (colo), Ryzen 12-core, air-cooled: ...

Search - URE Resilience Engineering Field Notes

Stefano Schotten | Infrastructure Architect | CISSP

Board Member | Advisory Services AMTI Principal Infrastructure Advisor, post-acquisition. SDDC operations, greenfield Tier 3 design, zero-touch ops continuity. Sovereign Data Center Ops Full-stack CSP with integrated Security-as-a-Service Performance-Optimized Computing — GPU-dense and HPC workloads Founder | Head of R&D URE Unified Resilience Engineering. R&D, field notes, and operational frameworks for GPU-dense infrastructure. Hands-on AI Lab — automation use cases and integration patterns Mapping and building solutions for AI-era infrastructure challenges Flushing the seams between physics and tokens 20 years building mission-critical infrastructure - from concrete to kernel. Resilience is an emergent property. Operational safety is earned — physics doesn’t negotiate, and neither should your architecture. ...

URE — Unified Resilience Engineering#

URE — Unified Resilience Engineering