URE — Resilience Engineering for AI-Era High-Density Compute
Keeping mission-critical compute online—with security that holds under pressure.
Written for engineering and security leaders. Lessons from incidents and hard trade-offs: risk-based controls, change integrity, and resilient recovery that works in production.
Decision-grade notes on operating AI-era infrastructure: GPUs, power/thermals, and the security and reliability coupling that determines outcomes.
Notes by Stefano Schotten
What is URE
URE is where I publish field-tested practice for infrastructure that must remain available when constraints are real: power, cooling, supply chain, data sovereignty, regulatory compliance, and human error. Usage • Resources • Economics is the lens. Reliability isn’t a feature of any one layer; it emerges from how the layers interact.
I’ve spent 20+ years leading infrastructure teams across facilities/MEP, power and thermal systems, networking, security, and workload placement—from edge to core. Career history on LinkedIn.
Right now I’m using a controlled lab to pressure-test AI-era failure domains—power density, thermal limits, grid constraints, and operational coupling—so I can write what survives production, not what sounds good in a deck.
What you’ll find here
- Patterns — The recurring shapes of outages across power, network, compute, and organizations
- Mechanisms — Change controls, containment, and recovery automation
- Economics — Risk-weighted decisions: cost of downtime, capacity trade-offs, and constraints you can defend
Where to start
- Building GPU infrastructure: GPU, Cloud & Data Center → Predictive Power Conditioning → HVAC Removes Heat
- Operating production systems: Resilience Engineering → Telemetry That Lies → Why Fleet Control Starts with a Map
- Making capacity / placement decisions: Atlas → Placement as a Business Decision → Articles
Featured artifacts
Atlas
Atlas is a research framework for capacity planning and workload placement—built from 20 years of infrastructure operations experience.
It uses real-world constraints (latency, population, economic footprint) to make placement decisions visible and defensible. In practice: framing placement with public data like GDP and census population, and making the “who are we serving at what latency” question legible.
Example analyses Atlas supports:
- Given a region, what population and GDP fall within 5 ms latency?
- Where are the largest low-latency service areas for inference, and where are the gaps?
- If a workload moves from Region A to Region B, what changes in reach and blast radius?
Why it matters: inference AI cannot be cached like static objects. Placement is user experience.

→ /atlas/
Lab
Where I test what I write about: a GPU-dense rig with NVLink, data center switching, RDMA-capable NICs, and controlled power/thermal experiments.
→ Lab notes and methods: GPU, Cloud & Data Center
Cold Aisle Trenches: You Don't Chase Lights-Out—You Earn It
It was 2017. We had just deployed an additional ScaleIO cluster to handle the onboarding of a new customer with hundreds of VMs. Eight nodes, each with 40 Gbps at the backend. Beautiful. Efficient. The whole rack was a work of art—Dell R740s with MD1220 expansions, bezels removed so you could see all those drives blinking in perfect synchronization. The cluster was deployed less than two weeks ago. I told the customer to “burn it.” ...
The Entropy of Sovereign AI: Why the Map is Not the Territory
A few years ago, I was having dinner with the Americas VP of a European energy supermajor — one of those companies that extracts oil from war zones, negotiates with regimes that don’t appear on polite lists, and operates in places where “political risk” means your assets might get nationalized or your personnel kidnapped. Seventy-plus countries. Active operations in Libya, Nigeria, Angola, Myanmar, Yemen. The kinds of places where security briefings come before breakfast. ...
Why Foreign AI Specialists Keep Failing (And What Just Changed)
Context got commoditized. Translation is next. When my company’s acquisition closed in 2024, I thought about pursuing a psychology degree in the US. The impulse was the same one that drives URE: wanting to understand how things are wired under the hood. My wife shut it down—“Really? You know that’s not going to work”—and she was right, though neither of us fully understood why at the time. What I was actually chasing wasn’t psychology. It was context. ...
Cold Aisle Trenches: When Theory Hits the Asphalt
A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor. January 2026 · Stefano Schotten The contract was honored. The business still bled. My case manager called me from the customer site. I could hear the tension before he said a word. “The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.” ...
AI and Society: The Three Phases of Technological Adoption
I see people everywhere anxious about whether AI will disrupt their jobs, their industries, their lives. I’ve always approached this with calm. Not indifference—calm. The future rarely sends advance notice, but it is always arriving. This isn’t news. It’s the human condition. A few years ago, I attended a keynote by Michio Kaku where he framed—perfectly, for me—the relationship between humanity and technological change. What follows is my version. I can’t claim novelty, and I’m not a domain expert in sociology or economics. I’m an infrastructure builder observing the same pattern from the inside. ...
The Lone Wolf Starves First
A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing. Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat. ...
We Tried to Enforce Standards for Six Years. It Took a Pandemic to Learn Why It Failed.
In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department. My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency. The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability. On paper, it was textbook. In practice, it was a slow-motion collision with reality. ...
From Security to Resilience: What Running a Multi-Tenant Cloud Taught Me About Defense in Depth
Most security programs are built around preventing bad things from happening. That’s necessary but insufficient. At AMTI, where I served as CTO and led infrastructure security for a multi-tenant cloud serving customers from single-VM deployments to enterprise DRaaS contracts spanning hundreds of miles of metro fiber, I learned that mature security is about resilience: the capacity to detect, contain, and recover faster than adversaries can escalate. The Visibility Problem at Scale Operating a cloud service provider on your own ASN creates a specific governance challenge: you’re the abuse contact, but in a GDPR-compliant architecture, you have no visibility into customer data. Encrypted traffic is opaque by design. This constraint forced architectural discipline: we couldn’t inspect our way to security, so we had to instrument our way there. ...
When Lack of Guardrails Hurt the Business
Every company says security is a core value. Few embed it as a design constraint. The difference shows up when things break. I get a call from a co-founder I’ve known for years. His company just raised $400M+ Series D. His voice is flat: “We have a problem.” Same day, we’re on a call. He’s a skilled engineer — personally devastated. They leaked over 2 million user records. Home addresses. Phone numbers. The full profile. The data had been publicly accessible for three weeks before anyone noticed. ...
When the Constraint Isn’t Capacity
A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient. The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor. ...