Reliability & Failure Engineering on URE

Reliability & Failure Engineering on UREhttps://ure.us/pillars/reliability--failure-engineering/Recent content in Reliability & Failure Engineering on UREHugo -- 0.162.1en-usSun, 31 May 2026 00:00:00 +0000Security Research Is Not a Crimehttps://ure.us/articles/security-research-is-not-a-crime/Sun, 31 May 2026 00:00:00 +0000https://ure.us/articles/security-research-is-not-a-crime/Microsoft threatened a security researcher with criminal prosecution. The deeper lesson: NIST and ISO mandate independent adversarial testing for a reason.GPU Fleet AIOps: 7 LLM Backends, 6 Failure Scenarioshttps://ure.us/articles/gpu-fleet-aiops-llm-backend-benchmark/Tue, 07 Apr 2026 00:00:00 +0000https://ure.us/articles/gpu-fleet-aiops-llm-backend-benchmark/Benchmarking seven LLM backends as autonomous operators for an 8,000-GPU cluster with six realistic failure scenarios and deterministic checklist scoring.Context Drift Kills AI Agents Before Latency Doeshttps://ure.us/articles/context-drift-kills-agents-before-latency/Wed, 11 Mar 2026 00:00:00 +0000https://ure.us/articles/context-drift-kills-agents-before-latency/LLM agents on remote hosts drown in unfiltered SSH output. Context drift -- not latency, not cost -- is what kills autonomous fleet operations at scale.Local LLM Bench: Scaling Swarms Beyond Fourhttps://ure.us/articles/best-local-llm-scaling-coding-swarms/Mon, 09 Mar 2026 00:00:00 +0000https://ure.us/articles/best-local-llm-scaling-coding-swarms/Per-task throughput plateaus at four concurrent agents and holds flat through eight. Agents five through eight are free. The contention wall is a floor.Local LLM Bench: Best Model for Coding Swarmshttps://ure.us/articles/best-local-llm-coding-agent-swarm/Sat, 07 Mar 2026 00:00:00 +0000https://ure.us/articles/best-local-llm-coding-agent-swarm/MoE is 4.9x faster than Dense when four coding agents share one GPU. We ran the concurrent-load benchmark nobody published - single-request numbers lied.Local LLM Bench: MoE vs Dense on One RTX 3090https://ure.us/articles/best-local-llm-agentic-coding/Fri, 06 Mar 2026 00:00:00 +0000https://ure.us/articles/best-local-llm-agentic-coding/Real benchmarks on dual RTX 3090: the best local setup for agentic coding is one GPU and an MoE model. 168 tok/s, NVLink optional. Data and recommendations.The Lone Wolf Starves Firsthttps://ure.us/articles/the-lone-wolf-starves-first/Sun, 25 Jan 2026 00:00:00 +0000https://ure.us/articles/the-lone-wolf-starves-first/Blame has structure. Resilient teams distribute load, accountability, and recovery instead of creating a heroic single point of failure.It Took a Pandemic to Learn Why Standards Failedhttps://ure.us/articles/it-took-a-pandemic-to-learn-why-standards-failed/Fri, 23 Jan 2026 00:00:00 +0000https://ure.us/articles/it-took-a-pandemic-to-learn-why-standards-failed/Outside-in SOPs drift, create friction, and weaken shared fate. Resilient standards are generated in workflow by the people who operate them.When the Constraint Isn’t Capacityhttps://ure.us/articles/when-the-constraint-isnt-capacity/Tue, 20 Jan 2026 00:00:00 +0000https://ure.us/articles/when-the-constraint-isnt-capacity/A bootstorm incident that looked like capacity pressure, until instrumentation revealed a non-existent SQL dependency stalling every request path.Security Assurance - URE Case - 4/5 - Enablerhttps://ure.us/articles/security-assurance-engineering-practical-example-ure-chapter-4/Thu, 15 Jan 2026 00:00:00 +0000https://ure.us/articles/security-assurance-engineering-practical-example-ure-chapter-4/URE Case 4/5. How security enables business by arriving early with solutions, not vetoes, and reshaping systems to preserve the mission.MEP Providers Are Never in the Postmortemhttps://ure.us/articles/mep-providers-are-never-in-the-postmortem/Wed, 07 Jan 2026 00:00:00 +0000https://ure.us/articles/mep-providers-are-never-in-the-postmortem/Why AI-era data center reliability fails between 'designed' and 'installed' -and how contract chains erase ownership of MEP details when incidents happen.Tail Latency Killed My Beowulf Cluster in 2006https://ure.us/articles/tail-latency-killed-beowulf-cluster-2006/Sun, 04 Jan 2026 00:00:00 +0000https://ure.us/articles/tail-latency-killed-beowulf-cluster-2006/In 2006, I learned that scaling out doesn't work when the interconnect is the bottleneck. Twenty years later, the same physics governs GPU infrastructure.Predictive Power Conditioning for GPU Clustershttps://ure.us/articles/predictive-power-conditioning-gpu-clusters/Thu, 18 Dec 2025 00:00:00 +0000https://ure.us/articles/predictive-power-conditioning-gpu-clusters/GPU clusters fail on transitions, not sustained load. Predicting step-loads from workload telemetry helps pre-position power controls and reduce surprise.HVAC Doesn't Create Cold - It Removes Heathttps://ure.us/articles/hvac-doesnt-create-cold-removes-heat/Sun, 07 Dec 2025 00:00:00 +0000https://ure.us/articles/hvac-doesnt-create-cold-removes-heat/This is the first in a series on thermal management in data center environments. Cooling isn't magic - it's heat removal, at scale.