Reliability & Failure Engineering

Security Research Is Not a Crime

It was around 2000. I was running Legion across entire Class B ranges, watching open Windows shares scroll up the screen faster than I could read them. C$. ADMIN$. Whole NT4 boxes answering null sessions like a door with no lock and a welcome mat on the floor. You didn’t need a password. You needed curiosity and a free afternoon. The Microsoft of that era had no Patch Tuesday. No Security Response Center worth the name. Security was a feature request that lost to the ship date, every quarter, on purpose. The company that today runs one of the most disciplined vulnerability programs on the planet once shipped operating systems to hospitals and banks with the equivalent of the front door propped open. ...

GPU Fleet AIOps: 7 LLM Backends, 6 Failure Scenarios

Abstract After measuring a 292x cost gap between a rented B200 and frontier API providers on a batch inference workload, the logical next question was whether the same pattern held for operational intelligence: could a smaller, dedicated model handle the continuous judgment calls required to run a GPU fleet? The batch inference test had exposed KV cache contention as the dominant bottleneck on shared API infrastructure. Processing similar structured data at scale, but this time continuously rather than in batch, seemed like a valid test of whether that contention would degrade operational quality the same way it degraded throughput. ...

Context Drift Kills AI Agents Before Latency Does

A few weeks ago we hit a production issue on a cloud environment — one XCP-ng host was showing IOPS contention caused by a single guest VM. The classic noisy-neighbor race condition on shared storage. The diagnostic path was obvious: cross the dom0 guest list with iostat on the host, find the VM hammering the disk, and work the problem from there. Straightforward correlation — the kind of thing an experienced operator resolves in fifteen minutes with two terminal windows. ...

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

The Lone Wolf Starves First

A few months ago I read Project Hail Mary and found myself thinking about observation and agency. Einstein didn’t “invent” spacetime dilation—he created the conditions to perceive it. Without the means to observe, you’re just touching walls in complete darkness. Trial and error, yes, but you never truly know the depth of what you’re sensing. Saturday mornings I take my son to flag football. He’s been in martial arts for half his life—his coach loves his resilience. But something surfaced in team sports that doesn’t appear on the mat. ...

It Took a Pandemic to Learn Why Standards Failed

In 2015, I did what seemed like the mature thing to do. I created a Production Engineering department. My college foundation was production engineering. I was a true believer: if we formalized standards and assigned a dedicated group to own operational rigor, the organization would naturally converge toward consistency. The mandate: Create SOPs. Define standards. Reduce variance. Improve reliability. On paper, it was textbook. In practice, it was a slow-motion collision with reality. ...

When the Constraint Isn’t Capacity

A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient. The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor. ...

Security Assurance - URE Case - 4/5 - Enabler

4/5 — Security as an Enabler (and “forward agency”) Series: Security Assurance — URE Case — 4/5 Start from the beginning: 1/5 — The Inception Next: 5/5 — Conclusion — Assurance Without Theater Security enables the business when it shows up with agency: not just identifying risk, but carrying enough context to propose solutions that preserve the mission. That requires a maturity shift. When security arrives late, it often speaks in “non-English.” It blocks because the system is already committed to choices no one can defend. ...

MEP Providers Are Never in the Postmortem

In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%. When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”. First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen? ...

Tail Latency Killed My Beowulf Cluster in 2006

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...