Resilience Engineering

The academic discipline of resilience engineering — Hollnagel, Woods, Dekker — has never been systematically applied to data center operations in any published content. Every existing publication treats “resilience” as a synonym for “redundancy.” URE redefines the term.

The concept of physical infrastructure SRE is particularly powerful: software SRE (site reliability engineering) is one of the most searched and well-documented operational disciplines in technology, yet applying its principles — error budgets, blameless postmortems, observability, incident response frameworks — to physical facility operations is completely unaddressed. This bridge concept connects the enormous SRE audience to the physical infrastructure world with genuine novelty.

URE’s resilience engineering articles explore what happens when standards fail, when vendor SLAs don’t match reality, when the constraint isn’t capacity but incentives, and when the only way to understand a system is to watch it break. These are the operating principles that survive contact with production.

Cold Aisle Trenches: When Theory Hits the Asphalt

A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor. January 2026 · Stefano Schotten The contract was honored. The business still bled. My case manager called me from the customer site. I could hear the tension before he said a word. “The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.” ...

The Lone Wolf Starves First

A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing. Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat. ...

It Took a Pandemic to Learn Why Standards Failed

In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department. My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency. The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability. On paper, it was textbook. In practice, it was a slow-motion collision with reality. ...

When the Constraint Isn’t Capacity

A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient. The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor. ...