292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API. The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves. ...

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...

The Heat Nobody Counts - PUE Ends at the Meter

Meta’s Prometheus data center in New Albany, Ohio is scaling to 1.2 GW. To get there, they’re building behind-the-meter natural gas turbines — two 200 MW Socrates generation facilities, supplied by dedicated gas pipelines, isolated from the grid. In Virginia, the same story plays out with diesel generators, enough of them that it became the top legislative concern entering the 2026 session. The industry talks about PUE as if it were a verdict on environmental efficiency. It isn’t. PUE measures one envelope — the data center facility. Total facility power divided by IT equipment power. A PUE of 1.3 means 30% overhead for cooling, lighting, and support systems. That’s the metric everyone optimizes, the number that shows up in sustainability reports, the figure that earns applause at conferences. ...

Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables. I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable. So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below. ...

The Concorde Problem in AI Infrastructure

The Concorde burned one ton of fuel per passenger to cross the Atlantic. One hundred seats. Three and a half hours. Mach 2. The most advanced commercial aircraft ever built — and every engineer who saw it wanted to believe it was the future. The 747 did the same crossing in seven hours. Four hundred seats. A quarter of the fuel per passenger. No afterburners. No sonic boom. No government subsidies keeping it alive. ...

AI and Society: Three Phases of Tech Adoption

I see people everywhere anxious about whether AI will disrupt their jobs, their industries, their lives. I’ve always approached this with calm. Not indifference—calm. The future rarely sends advance notice, but it is always arriving. This isn’t news. It’s the human condition. A few years ago, I attended a keynote by Michio Kaku where he framed—perfectly, for me—the relationship between humanity and technological change. What follows is my version. I can’t claim novelty, and I’m not a domain expert in sociology or economics. I’m an infrastructure builder observing the same pattern from the inside. ...

The Entropy of Sovereign AI: Map vs. Territory

A few years ago, I was having dinner with the Americas VP of a European energy supermajor — one of those companies that extracts oil from war zones, negotiates with regimes that don’t appear on polite lists, and operates in places where “political risk” means your assets might get nationalized or your personnel kidnapped. Seventy-plus countries. Active operations in Libya, Nigeria, Angola, Myanmar, Yemen. The kinds of places where security briefings come before breakfast. ...