Seams & Dependencies

Improving Business Resiliency Through Security Assurance

Improving Business Resiliency Through Security Assurance Every company says security is a priority. Every company also ships under pressure. The gap between those two statements is where businesses bleed. I’ve watched organizations with excellent engineers and serious budgets still get humbled by the same pattern: teams optimize locally (features, velocity, “my backlog”), while the system pays globally (incidents, outages, churn, reputational drag). When things go south, it rarely takes a cinematic attacker or a once-in-a-decade failure. ...

MEP Providers Are Never in the Postmortem

In 2021, I bought a home in Florida. The closing was in August, so imagine the hot summer days with temperatures over 100 degrees and humidity over 80%. When we selected the builder, I noted 2 things: HVAC with 15 SEER and insulation R-39. My house would be minimally energy efficient. I had no option to upgrade the HVAC, but 15 SEER is “good enough”. First week in the house, my wife realized I was getting bothered every time the compressor kicked in - there was a subtle, almost imperceptible, hit on the lights - nobody realized it, but I did. Battle-proven engineer with experience in thermal and power transiency. What could happen? ...

Why GPU Fleet Control Starts with a Map

I’m currently working on the design of a framework for GPU fleet management. We’re living in a crowded data center reality where everybody wants “hero” compute — dense GPUs, fast networking, and delivery that’s closer to the edge. We’re in a land-grab phase where every business wants to be everywhere, but most teams are discovering the same thing: buying GPUs is the easy part. Operating them as a coherent fleet is the hard part. ...

Tail Latency Killed My Beowulf Cluster in 2006 — It's Killing Your GPU Fleet Today

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...