URE — Unified Resilience Engineering#
The human brain is hardwired for the tangible. We understand what we’ve touched, carried, walked through. Ask an American to picture 2.3 kilos of meat and you’ll get a blank stare. Tell a European to walk 80 feet down a corridor instead of counting three doors, and they’ll overshoot it. We don’t process abstract units — we process experience.
Now scale that problem up.
Austin, Texas sprawls across 305 square miles. San José, California covers 180. Each one draws roughly one gigawatt from the grid. That’s an entire city — hospitals, traffic lights, air conditioning, schools, everything humming at once. One gigawatt.
Outside Houston, a single data center campus sits on less than one square mile. It draws the same gigawatt.
Condense everything Austin consumes — every house, every hospital, every streetlight — into a footprint smaller than a neighborhood park. That’s what a hyperscale data center is. And more than a dozen of them are being built across the United States right now.
Want to get serious?
A grizzly bear weighs about 600 pounds. The rat behind your local dumpster weighs about two. That’s a 300-to-1 ratio — roughly the same ratio between the sprawl of Austin, Texas, and the footprint of a single gigawatt data center campus. Except the grizzly doesn’t consume 300 times more oxygen. The data center consumes every watt the city does.
That’s the beast we’ve built. A rat with a grizzly’s appetite — lab-made, power-dense, and nothing in the old playbook was designed to feed it. What goes in as power comes out as heat. Every watt. No exceptions.
This is a physics problem. When you compress a gigawatt into a square mile, everything downstream — power conditioning, thermal capacity, transient management, reliability — behaves differently than anything we’ve operated before. The rules that governed traditional data centers don’t scale to AI infrastructure energy density. The playbooks don’t transfer. The dashboards lie.
There are no playbooks written for the AI era — not for AI Factories, Token Factories, or Deep Training Facilities. We’re talking about ML and RL training jobs spanning thousands of nodes on a typical Tuesday.
For perspective: five years ago, the most powerful computer on Earth was Fugaku — 159,000 nodes, 432 racks, a billion dollars of Japanese national investment, drawing 30 megawatts. Countries bragged about it. Scientists waited in line for access. It was a generational achievement.
Today, a NeoCloud launches a “modest” 300-megawatt facility — ten Fugakus worth of power — and it doesn’t even make headlines.
Physics doesn’t negotiate. URE starts there.
My work — building data centers from scratch in highly constrained environments — shaped a particular vision of how the stack is actually built. Not how it’s drawn in architecture diagrams, but how it behaves under load, under budget pressure, and under the laws of thermodynamics.
This world isn’t one-size-fits-all. URE connects the seams between operational layers — power, thermal, compute, network, cost, compliance — and turns them into reliable deliverables with real economic value. Not another dashboard. Not another abstraction. A reasoning method built from twenty years of scars for an era that doesn’t yet have playbooks.
I break the infrastructure stack into five operational layers: foundation, AI-era baseline, infrastructure and hardware, software and orchestration, and economics — each one inheriting the failures of the layer below it. That framework is the backbone of everything I publish here.
This is the first of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it.
This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints.
Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Inflection Point A few years back, AMTI was at the heart of a fascinating corporate challenge. I was serving as a fractional CISO and advisor for a company standing at a critical inflection point.
...
This is the second of a three-part series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it.
This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints.
Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines - Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. From Trust to Reliance
...
This is the third and final part of a series based on a real-world engagement: a company that scaled from $40M to $1B in annual revenue in just five years, and the security program that had to grow with it.
This is a story about building high-performance operating systems where security, standards, architecture, and performance act as enablers rather than constraints.
Part 1: Earning credibility before you’ve earned authority. Part 2: Blurring the lines — Security at the SRE and Operations level. Part 3: Wrapping the gift — Transparency and agency. The Quality That Can’t Be Purchased I’ve been writing around this idea for a while — in Cold Aisle Trenches, in why standards fail when you try to impose them, in how defense in depth actually works at scale. The thread is always the same: security can’t be bought. You can’t swipe a credit card and receive “secure” in a box. It’s a quality that emerges — like the lights-out data center you don’t chase but eventually arrive at, because every other piece fell into place first.
...
It was 2017. We had just deployed an additional ScaleIO cluster to handle the onboarding of a new customer with hundreds of VMs. Eight nodes, each with 40 Gbps at the backend. Beautiful. Efficient. The whole rack was a work of art—Dell R740s with MD1220 expansions, bezels removed so you could see all those drives blinking in perfect synchronization.
The cluster was deployed less than two weeks ago. I told the customer to “burn it.”
...
A few years ago, I was having dinner with the Americas VP of a European energy supermajor — one of those companies that extracts oil from war zones, negotiates with regimes that don’t appear on polite lists, and operates in places where “political risk” means your assets might get nationalized or your personnel kidnapped.
Seventy-plus countries. Active operations in Libya, Nigeria, Angola, Myanmar, Yemen. The kinds of places where security briefings come before breakfast.
...
Context got commoditized. Translation is next.
When my company’s acquisition closed in 2024, I thought about pursuing a psychology degree in the US. The impulse was the same one that drives URE: wanting to understand how things are wired under the hood. My wife shut it down—“Really? You know that’s not going to work”—and she was right, though neither of us fully understood why at the time.
What I was actually chasing wasn’t psychology. It was context.
...
A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor.
January 2026 · Stefano Schotten
The contract was honored. The business still bled.
My case manager called me from the customer site. I could hear the tension before he said a word.
“The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.”
...
I see people everywhere anxious about whether AI will disrupt their jobs, their industries, their lives. I’ve always approached this with calm. Not indifference—calm.
The future rarely sends advance notice, but it is always arriving. This isn’t news. It’s the human condition.
A few years ago, I attended a keynote by Michio Kaku where he framed—perfectly, for me—the relationship between humanity and technological change. What follows is my version. I can’t claim novelty, and I’m not a domain expert in sociology or economics. I’m an infrastructure builder observing the same pattern from the inside.
...
A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing.
Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat.
...
In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department.
My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency.
The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability.
On paper, it was textbook. In practice, it was a slow-motion collision with reality.
...
Most security programs are built around preventing bad things from happening. That’s necessary but insufficient. At AMTI, where I served as CTO and led infrastructure security for a multi-tenant cloud serving customers from single-VM deployments to enterprise DRaaS contracts spanning hundreds of miles of metro fiber, I learned that mature security is about resilience: the capacity to detect, contain, and recover faster than adversaries can escalate.
The Visibility Problem at Scale Operating a cloud service provider on your own ASN creates a specific governance challenge: you’re the abuse contact, but in a GDPR-compliant architecture, you have no visibility into customer data. Encrypted traffic is opaque by design. This constraint forced architectural discipline: we couldn’t inspect our way to security, so we had to instrument our way there.
...
Every company says security is a core value. Few embed it as a design constraint. The difference shows up when things break.
I get a call from a co-founder I’ve known for years. His company just raised $400M+ Series D. His voice is flat: “We have a problem.” Same day, we’re on a call. He’s a skilled engineer — personally devastated. They leaked over 2 million user records. Home addresses. Phone numbers. The full profile. The data had been publicly accessible for three weeks before anyone noticed.
...
A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient.
The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor.
...
We collect basic, self-hosted analytics to improve content and reliability (visits, top pages, traffic sources). We don’t use this data for advertising, we don’t build third-party profiles, and we never sell or share it.
1/5 — The Inception Series: Security Assurance — URE Case — 1/5
Start from the beginning: you’re here.
Next: 2/5 — Trust Boundaries
This is the first of five short posts on Security Assurance Engineering. The goal is simple: separate security intent from security proof, and show what “assurance” looks like when you treat a system as real—owned, changing, and measurable.
I’ll use URE as the working surface. URE is the platform where I publish research notes and operating practice generated in my lab—work that started as a few shared threads with friends and peers, and eventually became worth “productizing” into something durable and navigable.
...
2/5 — Trust Boundaries Series: Security Assurance — URE Case — 2/5
Start from the beginning: 1/5 — The Inception
Next: 3/5 — The Design
In mature environments, we don’t start with implementation. We start with boundaries and ownership.
Before anyone spins up “a simple website/blog,” we make three things explicit:
What is the system? (scope and components) Who can change it? (identities and permissions) What must always remain true? (invariants + guardrails) Security should be intentional. The goal is to create guardrails the rest of the team can rely on—so delivery is fast and the system stays trustworthy under change.
...
3/5 — The Design Series: Security Assurance — URE Case — 3/5
Start from the beginning: 1/5 — The Inception
Next: 4/5 — Security as an Enabler (and “forward agency”)
Design is where “a simple website” becomes a real system.
Not because the pages are complex—but because the moment you publish, you inherit real dependencies: DNS, build pipelines, third parties, telemetry, and the drift that comes with change. So before we build anything, we do one unglamorous thing:
...
4/5 — Security as an Enabler (and “forward agency”) Series: Security Assurance — URE Case — 4/5
Start from the beginning: 1/5 — The Inception
Next: 5/5 — Conclusion — Assurance Without Theater
Security enables the business when it shows up with agency: not just identifying risk, but carrying enough context to propose solutions that preserve the mission.
That requires a maturity shift.
When security arrives late, it often speaks in “non-English.” It blocks because the system is already committed to choices no one can defend.
...
5/5 — Conclusion — Assurance Without Theater Series: Security Assurance — URE Case — 5/5
Start from the beginning: 1/5 — The Inception
Security Assurance Engineering is not a side quest. It’s not a compliance ritual. And it’s not a “security team thing.”
It’s what turns security from intent into proof—in systems that are owned, changing, and measurable.
Across these chapters, the arc is consistent:
Part 1/5 (Inception): Architecture sets the invariants. Assurance proves they still hold under change. Part 2/5 (Trust Boundaries): If the boundary isn’t explicit, you don’t have a system—you have assumptions. Part 3/5 (Design): The tedious questions aren’t bureaucracy; they are how you prevent accidental scope and irreversible drift. Part 4/5 (Security as Enabler): Done well, security doesn’t slow delivery—it restores optionality and keeps the mission intact under real pressure. The takeaway is simple:
...
Every company says security is a priority. Every company also ships under pressure.
The gap between those two statements is where businesses bleed.
I’ve watched organizations with excellent engineers and serious budgets still get humbled by the same pattern: teams optimize locally (features, velocity, “my backlog”), while the system pays globally (incidents, outages, churn, reputational drag). When things go south, it rarely takes a cinematic attacker or a once-in-a-decade failure.
...
In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%.
When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”.
First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen?
...
I’m currently working on the design of a framework for GPU fleet management.
We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part.
...
Atlas is a single pane of glass for multi-cloud cost visibility. This post documents the pipeline: ingestion, streaming, storage, query, forecasting, and visualization.
Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out.
It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency.
NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology.
...
The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers):
A training run is supposed to take ~3–4 weeks.
Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun.
The dashboards look perfect:
...
GPU clusters don’t fail from sustained load. They fail on transitions.
A pod idling at 20 kW can step toward 300 kW quickly when training begins. The peak matters, but the killer is the step: the dP/dt that forces every layer of the electrical path to react at once.
Thermals matter too—but they’re secondary and collateral. Power transients can push protection and control behavior in cycles. Thermal consequences show up later as throttling, efficiency loss, and “mysteriously slower training” that looks like a software problem until you instrument the facility.
...
Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge.
AI breaks this model completely.
Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it.
...
This is the first of a series of URE articles about thermal management in data center environments—not theory, not “best practices,” but what actually happens when heat meets physics and scale.
Here’s a simple puzzle from two idle machines.
ai01 — home lab, Threadripper 32-core with 2× NVIDIA GPUs (NVLink), rack-level liquid cooling loop, used for ML training and vLLM inference:
Tctl: +33.0°C Tccd1: +33.2°C Tccd5: +31.5°C nj01 — third-party datacenter (colo), Ryzen 12-core, air-cooled:
...
The discipline and the lab. How systems survive production — and where we prove it.
Board Member | Advisory Services AMTI Principal Infrastructure Advisor, post-acquisition. SDDC operations, greenfield Tier 3 design, zero-touch ops continuity. Sovereign Data Center Ops Full-stack CSP with integrated Security-as-a-Service Performance-Optimized Computing — GPU-dense and HPC workloads Founder | Head of R&D URE Unified Resilience Engineering. R&D, field notes, and operational frameworks for GPU-dense infrastructure. Hands-on AI Lab — automation use cases and integration patterns Mapping and building solutions for AI-era infrastructure challenges Flushing the seams between physics and tokens 20 years building mission-critical infrastructure - from concrete to kernel. Resilience is an emergent property. Operational safety is earned — physics doesn’t negotiate, and neither should your architecture.
...
Five operational layers, from the utility connection to the billing system. The backbone of everything URE publishes.