URE — Unified Resilience Engineering
“Reliability is the property you get when the seams are understood, the changes are safe, and recovery is real.”
Resilience Engineering is cross-layer excellence: the ability to operate systems where physics, supply chains, control planes, and incentives all participate in the outage.
It’s not QA. It’s not “buy a tool.” It’s not a slogan. It’s the craft of designing and operating for failure: knowing where the seams are, making change integrity provable, containing blast radius by default, and keeping measurement honest enough to steer by.
The work spans facilities ↔ hardware/firmware ↔ storage/network control planes ↔ security controls (identity/PKI/middleboxes) ↔ software releases ↔ incentives/economics—because production failures happily route around org charts.
The Discipline
Six pillars define how URE thinks about resilience. Each is a lens, not a silo — real incidents cross all of them.
Seams & Dependencies
Thesis: Most outages are dependency failures, not component failures. Resilience starts by mapping seams across vendors, layers, and contracts—then designing for their failure modes.
Stories it captures: Hidden coupling shows up as “random” production behavior.
Mechanisms: dependency graphs, fault-domain boundaries, interface contracts, vendor edge cases, time/ordering.
Change Integrity
Thesis: Change is the dominant cause of unreliability. The goal is not fewer changes—it’s changes that are constrained, attributable, and reversible under pressure.
Stories it captures: A “small tweak” crosses a seam and becomes a fleet incident.
Mechanisms: version/firmware gates, staged rollout, guardrails, provenance, rollback criteria, approvals that match blast radius.
Containment by Design
Thesis: The system should fail in ways you can afford. Containment is architecture and policy that turns unknown failure into bounded impact.
Stories it captures: One mistake should not have permission to become a multi-domain outage.
Mechanisms: segmentation, rate limits, circuit breakers, kill switches, least privilege, fault isolation, policy-as-code.
Recovery Readiness
Thesis: Recovery is a capability, not a hope. If you can’t rehearse it, time it, and staff it, you don’t have it.
Stories it captures: The incident ends when the system is stable—not when the ticket is closed.
Mechanisms: runbooks that work, restore drills, dependency-aware RTO/RPO, spares strategy, break-glass access, failure-mode rehearsals.
Measurement & Truth
Thesis: You can’t operate what you can’t measure honestly. The hardest part is not collecting signals—it’s ensuring they’re time-aligned, calibrated, and decision-grade.
Stories it captures: Everything is green while the system is quietly degrading.
Mechanisms: sensor validation, baselines, time sync, cross-checks, SLOs tied to physics, anomaly triage.
Incentives & Economics
Thesis: Reliability is negotiated. Systems fail where incentives make risk cheap and outages externalized.
Stories it captures: The root cause is an ownership boundary—priced into behavior.
Mechanisms: contracts, SLAs that map to reality, capex/opex tradeoffs, staffing, accountability, cost of delay, “who pays” modeling.
The Lab
The discipline describes how systems survive. The lab is where we prove it.
Research rarely happens in uncontrolled conditions. It happens under discipline: controlled variables, measured baselines, and experiments that isolate causality. This lab is designed to make results repeatable—and production less surprising.
Applied research pillars
These are the testing domains where the six resilience engineering principles get validated against real hardware, real power, and real thermal behavior.
Compute & Workload Performance
Training throughput, GPU utilization, NCCL tradeoffs, latency/jitter, virtualization overhead, scheduling effects, Slurm, Kubernetes.
Thermal & Cooling Engineering (Liquid + Air)
Loop design, radiator efficiency, fan/pump control, hotspot mapping, airflow characterization (CFM), CFM-per-watt efficiency, thermal drift, throttling behavior, ambient sensitivity, compute-to-thermal efficiency ratio.
Power & Electrical Engineering
Power transients/load steps, PSU behavior, UPS interaction, power quality (sag/ripple/noise), power capping strategies, perf/watt, rack-level power efficiency, stability under burst loads, brownouts.
I/O, Storage & Data Path
NVMe behavior under load, dataset streaming, ingest pipelines, page cache behavior, PCIe contention, storage-induced jitter, NVMe-oF, HPC storage, shared volumes, data sovereignty, fault-domain storage architecture.
Network & Interconnect
Distributed training sensitivity, congestion and microbursts, topology impacts, RDMA (RoCE/IB when applicable), horizontal scaling, automated security perimeters, mesh/fabric architectures for high throughput, load-balancing algorithms.
Observability & Measurement Science
Instrumentation plan, sensor validation, sampling, time alignment, baselines, confidence intervals, repeatability, experiment hygiene.
Platform Automation & Fleet Operations
Auto-inventory (Redfish-first; IPMI fallback), config drift detection, golden images, reproducible provisioning, Infrastructure as Code (IaC), firmware/driver compliance, zero-touch onboarding.
Reliability & Failure Engineering
Controlled fault injection, thermal/power stress tests, degradation over time, MTBF signals, rollback criteria, postmortems, brownout survivability.
Economics & Capacity (FinOps-style)
Cost per training hour, energy sensitivity, utilization vs waste, capex/opex tradeoffs, constraints-to-dollars narratives, rack efficiency (kW per useful work), throughput per kW, throughput per CFM.
Current equipment
- GPUs: 4× NVIDIA GPUs with NVLink (2×2 training/inference rig)
- CPUs: Intel and AMD platforms (Threadripper, Core i9-14900, and Xeon-class systems)
- Switching: Arista and Dell datacenter switches (up to 100 Gbps)
- NICs/CNAs: Mellanox CNAs (50 Gbps, RDMA-capable)
- Interconnect: NVLink; PCIe topology and contention under test
- Cooling: rack-level closed-loop liquid cooling, plus controlled airflow
- Power: wye-fed PSU setup; UPS interaction and load-step behavior under test
- Containment: lab-scale hot aisle containment for repeatable thermal experiments
- Instrumentation: multiple probes/sensors for thermal, airflow, and power measurements
- OOB management: Dell iDRAC (Redfish-first; IPMI fallback) on an isolated OOB network
Current projects
Atlas
Atlas is a research framework for capacity planning and workload placement—built from 20 years of infrastructure operations experience. It uses real-world constraints (latency, population, economic footprint) to make placement decisions visible and defensible.
- Prototype: Project Atlas
- Technical stack: Project Atlas: Technical Stack