Resilience Engineering — The URE Framework

URE — Unified Resilience Engineering

“Reliability is the property you get when the seams are understood, the changes are safe, and recovery is real.”

Resilience Engineering is cross-layer excellence: the ability to operate systems where physics, supply chains, control planes, and incentives all participate in the outage.

It’s not QA. It’s not “buy a tool.” It’s not a slogan. It’s the craft of designing and operating for failure: knowing where the seams are, making change integrity provable, containing blast radius by default, and keeping measurement honest enough to steer by.

The work spans facilities ↔ hardware/firmware ↔ storage/network control planes ↔ security controls (identity/PKI/middleboxes) ↔ software releases ↔ incentives/economics—because production failures happily route around org charts.

The Discipline

Six pillars define how URE thinks about resilience. Each is a lens, not a silo — real incidents cross all of them.

Seams & Dependencies

Thesis: Most outages are dependency failures, not component failures. Resilience starts by mapping seams across vendors, layers, and contracts—then designing for their failure modes.

Stories it captures: Hidden coupling shows up as “random” production behavior.

Mechanisms: dependency graphs, fault-domain boundaries, interface contracts, vendor edge cases, time/ordering.

Change Integrity

Thesis: Change is the dominant cause of unreliability. The goal is not fewer changes—it’s changes that are constrained, attributable, and reversible under pressure.

Stories it captures: A “small tweak” crosses a seam and becomes a fleet incident.

Mechanisms: version/firmware gates, staged rollout, guardrails, provenance, rollback criteria, approvals that match blast radius.

Containment by Design

Thesis: The system should fail in ways you can afford. Containment is architecture and policy that turns unknown failure into bounded impact.

Stories it captures: One mistake should not have permission to become a multi-domain outage.

Mechanisms: segmentation, rate limits, circuit breakers, kill switches, least privilege, fault isolation, policy-as-code.

Recovery Readiness

Thesis: Recovery is a capability, not a hope. If you can’t rehearse it, time it, and staff it, you don’t have it.

Stories it captures: The incident ends when the system is stable—not when the ticket is closed.

Mechanisms: runbooks that work, restore drills, dependency-aware RTO/RPO, spares strategy, break-glass access, failure-mode rehearsals.

Measurement & Truth

Thesis: You can’t operate what you can’t measure honestly. The hardest part is not collecting signals—it’s ensuring they’re time-aligned, calibrated, and decision-grade.

Stories it captures: Everything is green while the system is quietly degrading.

Mechanisms: sensor validation, baselines, time sync, cross-checks, SLOs tied to physics, anomaly triage.

Incentives & Economics

Thesis: Reliability is negotiated. Systems fail where incentives make risk cheap and outages externalized.

Stories it captures: The root cause is an ownership boundary—priced into behavior.

Mechanisms: contracts, SLAs that map to reality, capex/opex tradeoffs, staffing, accountability, cost of delay, “who pays” modeling.

The Lab

The discipline describes how systems survive. The lab is where we prove it.

Research rarely happens in uncontrolled conditions. It happens under discipline: controlled variables, measured baselines, and experiments that isolate causality. This lab is designed to make results repeatable—and production less surprising.

Applied research pillars

These are the testing domains where the six resilience engineering principles get validated against real hardware, real power, and real thermal behavior.

Compute & Workload Performance

Training throughput, GPU utilization, NCCL tradeoffs, latency/jitter, virtualization overhead, scheduling effects, Slurm, Kubernetes.

Thermal & Cooling Engineering (Liquid + Air)

Loop design, radiator efficiency, fan/pump control, hotspot mapping, airflow characterization (CFM), CFM-per-watt efficiency, thermal drift, throttling behavior, ambient sensitivity, compute-to-thermal efficiency ratio.

Power & Electrical Engineering

Power transients/load steps, PSU behavior, UPS interaction, power quality (sag/ripple/noise), power capping strategies, perf/watt, rack-level power efficiency, stability under burst loads, brownouts.

I/O, Storage & Data Path

NVMe behavior under load, dataset streaming, ingest pipelines, page cache behavior, PCIe contention, storage-induced jitter, NVMe-oF, HPC storage, shared volumes, data sovereignty, fault-domain storage architecture.

Network & Interconnect

Distributed training sensitivity, congestion and microbursts, topology impacts, RDMA (RoCE/IB when applicable), horizontal scaling, automated security perimeters, mesh/fabric architectures for high throughput, load-balancing algorithms.

Observability & Measurement Science

Instrumentation plan, sensor validation, sampling, time alignment, baselines, confidence intervals, repeatability, experiment hygiene.

Platform Automation & Fleet Operations

Auto-inventory (Redfish-first; IPMI fallback), config drift detection, golden images, reproducible provisioning, Infrastructure as Code (IaC), firmware/driver compliance, zero-touch onboarding.

Reliability & Failure Engineering

Controlled fault injection, thermal/power stress tests, degradation over time, MTBF signals, rollback criteria, postmortems, brownout survivability.

Economics & Capacity (FinOps-style)

Cost per training hour, energy sensitivity, utilization vs waste, capex/opex tradeoffs, constraints-to-dollars narratives, rack efficiency (kW per useful work), throughput per kW, throughput per CFM.

Current equipment

GPUs: 4× NVIDIA GPUs with NVLink (2×2 training/inference rig)
CPUs: Intel and AMD platforms (Threadripper, Core i9-14900, and Xeon-class systems)
Switching: Arista and Dell datacenter switches (up to 100 Gbps)
NICs/CNAs: Mellanox CNAs (50 Gbps, RDMA-capable)
Interconnect: NVLink; PCIe topology and contention under test
Cooling: rack-level closed-loop liquid cooling, plus controlled airflow
Power: wye-fed PSU setup; UPS interaction and load-step behavior under test
Containment: lab-scale hot aisle containment for repeatable thermal experiments
Instrumentation: multiple probes/sensors for thermal, airflow, and power measurements
OOB management: Dell iDRAC (Redfish-first; IPMI fallback) on an isolated OOB network

Current projects

Atlas

Atlas is a research framework for capacity planning and workload placement—built from 20 years of infrastructure operations experience. It uses real-world constraints (latency, population, economic footprint) to make placement decisions visible and defensible.

Prototype: Project Atlas
Technical stack: Project Atlas: Technical Stack

The Discipline#

The Lab#

Applied research pillars#

Current equipment#

Current projects#