<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Reliability &amp; Failure Engineering on URE</title><link>https://ure.us/pillars/reliability--failure-engineering/</link><description>Recent content in Reliability &amp; Failure Engineering on URE</description><generator>Hugo -- 0.162.1</generator><language>en-us</language><lastBuildDate>Sun, 31 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ure.us/pillars/reliability--failure-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>Security Research Is Not a Crime</title><link>https://ure.us/articles/security-research-is-not-a-crime/</link><pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/security-research-is-not-a-crime/</guid><description>Microsoft threatened a security researcher with criminal prosecution. The deeper lesson: NIST and ISO mandate independent adversarial testing for a reason.</description></item><item><title>GPU Fleet AIOps: 7 LLM Backends, 6 Failure Scenarios</title><link>https://ure.us/articles/gpu-fleet-aiops-llm-backend-benchmark/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/gpu-fleet-aiops-llm-backend-benchmark/</guid><description>Benchmarking seven LLM backends as autonomous operators for an 8,000-GPU cluster with six realistic failure scenarios and deterministic checklist scoring.</description></item><item><title>Context Drift Kills AI Agents Before Latency Does</title><link>https://ure.us/articles/context-drift-kills-agents-before-latency/</link><pubDate>Wed, 11 Mar 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/context-drift-kills-agents-before-latency/</guid><description>LLM agents on remote hosts drown in unfiltered SSH output. Context drift -- not latency, not cost -- is what kills autonomous fleet operations at scale.</description></item><item><title>Local LLM Bench: Scaling Swarms Beyond Four</title><link>https://ure.us/articles/best-local-llm-scaling-coding-swarms/</link><pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/best-local-llm-scaling-coding-swarms/</guid><description>Per-task throughput plateaus at four concurrent agents and holds flat through eight. Agents five through eight are free. The contention wall is a floor.</description></item><item><title>Local LLM Bench: Best Model for Coding Swarms</title><link>https://ure.us/articles/best-local-llm-coding-agent-swarm/</link><pubDate>Sat, 07 Mar 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/best-local-llm-coding-agent-swarm/</guid><description>MoE is 4.9x faster than Dense when four coding agents share one GPU. We ran the concurrent-load benchmark nobody published - single-request numbers lied.</description></item><item><title>Local LLM Bench: MoE vs Dense on One RTX 3090</title><link>https://ure.us/articles/best-local-llm-agentic-coding/</link><pubDate>Fri, 06 Mar 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/best-local-llm-agentic-coding/</guid><description>Real benchmarks on dual RTX 3090: the best local setup for agentic coding is one GPU and an MoE model. 168 tok/s, NVLink optional. Data and recommendations.</description></item><item><title>The Lone Wolf Starves First</title><link>https://ure.us/articles/the-lone-wolf-starves-first/</link><pubDate>Sun, 25 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/the-lone-wolf-starves-first/</guid><description>Blame has structure. Resilient teams distribute load, accountability, and recovery instead of creating a heroic single point of failure.</description></item><item><title>It Took a Pandemic to Learn Why Standards Failed</title><link>https://ure.us/articles/it-took-a-pandemic-to-learn-why-standards-failed/</link><pubDate>Fri, 23 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/it-took-a-pandemic-to-learn-why-standards-failed/</guid><description>Outside-in SOPs drift, create friction, and weaken shared fate. Resilient standards are generated in workflow by the people who operate them.</description></item><item><title>When the Constraint Isn’t Capacity</title><link>https://ure.us/articles/when-the-constraint-isnt-capacity/</link><pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/when-the-constraint-isnt-capacity/</guid><description>A bootstorm incident that looked like capacity pressure, until instrumentation revealed a non-existent SQL dependency stalling every request path.</description></item><item><title>Security Assurance - URE Case - 4/5 - Enabler</title><link>https://ure.us/articles/security-assurance-engineering-practical-example-ure-chapter-4/</link><pubDate>Thu, 15 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/security-assurance-engineering-practical-example-ure-chapter-4/</guid><description>URE Case 4/5. How security enables business by arriving early with solutions, not vetoes, and reshaping systems to preserve the mission.</description></item><item><title>MEP Providers Are Never in the Postmortem</title><link>https://ure.us/articles/mep-providers-are-never-in-the-postmortem/</link><pubDate>Wed, 07 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/mep-providers-are-never-in-the-postmortem/</guid><description>Why AI-era data center reliability fails between &amp;#39;designed&amp;#39; and &amp;#39;installed&amp;#39; -and how contract chains erase ownership of MEP details when incidents happen.</description></item><item><title>Tail Latency Killed My Beowulf Cluster in 2006</title><link>https://ure.us/articles/tail-latency-killed-beowulf-cluster-2006/</link><pubDate>Sun, 04 Jan 2026 00:00:00 +0000</pubDate><guid>https://ure.us/articles/tail-latency-killed-beowulf-cluster-2006/</guid><description>In 2006, I learned that scaling out doesn&amp;#39;t work when the interconnect is the bottleneck. Twenty years later, the same physics governs GPU infrastructure.</description></item><item><title>Predictive Power Conditioning for GPU Clusters</title><link>https://ure.us/articles/predictive-power-conditioning-gpu-clusters/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://ure.us/articles/predictive-power-conditioning-gpu-clusters/</guid><description>GPU clusters fail on transitions, not sustained load. Predicting step-loads from workload telemetry helps pre-position power controls and reduce surprise.</description></item><item><title>HVAC Doesn't Create Cold - It Removes Heat</title><link>https://ure.us/articles/hvac-doesnt-create-cold-removes-heat/</link><pubDate>Sun, 07 Dec 2025 00:00:00 +0000</pubDate><guid>https://ure.us/articles/hvac-doesnt-create-cold-removes-heat/</guid><description>This is the first in a series on thermal management in data center environments. Cooling isn&amp;#39;t magic - it&amp;#39;s heat removal, at scale.</description></item></channel></rss>