Resilience Engineering

“Reliability is the property you get when the seams are understood, the changes are safe, and recovery is real.”

Resilience Engineering is cross-layer excellence: the ability to operate systems where physics, supply chains, control planes, and incentives all participate in the outage.

It’s not QA. It’s not “buy a tool.” It’s not a slogan. It’s the craft of designing and operating for failure: knowing where the seams are, making change integrity provable, containing blast radius by default, and keeping measurement honest enough to steer by.

The work spans facilities ↔ hardware/firmware ↔ storage/network control planes ↔ security controls (identity/PKI/middleboxes) ↔ software releases ↔ incentives/economics—because production failures happily route around org charts.

→ Start here: pick a pillar and browse the linked tag page.

Research Pillars

Seams & Dependencies

Thesis: Most outages are dependency failures, not component failures. Resilience starts by mapping seams across vendors, layers, and contracts—then designing for their failure modes. Stories it captures: Hidden coupling shows up as “random” production behavior. Mechanisms: dependency graphs, fault-domain boundaries, interface contracts, vendor edge cases, time/ordering. Keywords: seams, coupling, shared fate, split-brain, dependency mapping, fault domains, transitive risk

Change Integrity

Thesis: Change is the dominant cause of unreliability. The goal is not fewer changes—it’s changes that are constrained, attributable, and reversible under pressure. Stories it captures: A “small tweak” crosses a seam and becomes a fleet incident. Mechanisms: version/firmware gates, staged rollout, guardrails, provenance, rollback criteria, approvals that match blast radius. Keywords: safe change, provenance, drift, rollout, rollback, guardrails, compliance-by-default

Containment by Design

Thesis: The system should fail in ways you can afford. Containment is architecture and policy that turns unknown failure into bounded impact. Stories it captures: One mistake should not have permission to become a multi-domain outage. Mechanisms: segmentation, rate limits, circuit breakers, kill switches, least privilege, fault isolation, policy-as-code. Keywords: blast radius, segmentation, least privilege, kill switch, isolation, policy, safety rails

Recovery Readiness

Thesis: Recovery is a capability, not a hope. If you can’t rehearse it, time it, and staff it, you don’t have it. Stories it captures: The incident ends when the system is stable—not when the ticket is closed. Mechanisms: runbooks that work, restore drills, dependency-aware RTO/RPO, spares strategy, break-glass access, failure-mode rehearsals. Keywords: recovery, restore, drills, RTO/RPO, runbooks, spares, break-glass

Measurement & Truth

Thesis: You can’t operate what you can’t measure honestly. The hardest part is not collecting signals—it’s ensuring they’re time-aligned, calibrated, and decision-grade. Stories it captures: Everything is green while the system is quietly degrading. Mechanisms: sensor validation, baselines, time sync, cross-checks, SLOs tied to physics, anomaly triage. Keywords: telemetry, baselines, calibration, time alignment, observability, truth, leading indicators

Incentives & Economics

Thesis: Reliability is negotiated. Systems fail where incentives make risk cheap and outages externalized. Stories it captures: The root cause is an ownership boundary—priced into behavior. Mechanisms: contracts, SLAs that map to reality, capex/opex tradeoffs, staffing, accountability, cost of delay, “who pays” modeling. Keywords: incentives, economics, SLAs, contracts, risk pricing, ownership, externalities