GPU Cluster Tail Latency Analysis

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

Tail Latency Killed My Beowulf Cluster in 2006

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...