GPU Fleet AIOps: 7 LLM Backends, 6 Failure Scenarios

Abstract After measuring a 292x cost gap between a rented B200 and frontier API providers on a batch inference workload, the logical next question was whether the same pattern held for operational intelligence: could a smaller, dedicated model handle the continuous judgment calls required to run a GPU fleet? The batch inference test had exposed KV cache contention as the dominant bottleneck on shared API infrastructure. Processing similar structured data at scale, but this time continuously rather than in batch, seemed like a valid test of whether that contention would degrade operational quality the same way it degraded throughput. ...

Tail Latency Killed My Beowulf Cluster in 2006

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...