Cost Per Token Infrastructure Design

In the Long Run, Economics Wins

Two postures have hardened around the cost of AI, and most leaders have already picked one without registering it as a choice. The first says zero dollars per token. Own the silicon, run the weights locally, drive the marginal cost of a query to nothing. Apple’s M3 through M5 put a capable model on a machine that fits in a backpack, NVIDIA’s GB10 desktop box puts a small token factory under the desk, and the appeal is clean: no meter, no vendor, no bill that grows every time the team does its job. ...

Frontier AI Is a System, Not a Model

Yesterday a code editor sold for sixty billion dollars. SpaceX exercised an option it had struck back in April. The terms were unusually clean: buy Anysphere, the company behind the Cursor editor, outright for $60 billion in stock, or walk away and pay $10 billion just to partner. It bought. CBS reported the deal the same week SpaceX went public. Cursor leans heavily on Anthropic’s models today, and the new owner has already said it will drop its own models and Grok’s coding agent into that seat. ...

GPU Fleet AIOps: The Augmented Operator

Two in the morning, eighteen hours into the run. Seven LLM backends processing the same stream of GPU cluster anomalies. Same thermal cascades, same NVLink errors, same KV cache evictions. I’m watching the scoring dashboard update in real time and the numbers are breaking my assumptions faster than I can take notes. The $32-per-day model is getting the diagnosis wrong more often than a free one running on my workstation. ...

The Concorde Problem in AI Infrastructure

The Concorde burned one ton of fuel per passenger to cross the Atlantic. One hundred seats. Three and a half hours. Mach 2. The most advanced commercial aircraft ever built — and every engineer who saw it wanted to believe it was the future. The 747 did the same crossing in seven hours. Four hundred seats. A quarter of the fuel per passenger. No afterburners. No sonic boom. No government subsidies keeping it alive. ...

AI Infrastructure Placement Is a Business Decision

Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge. AI breaks this model completely. Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it. ...