In the Long Run, Economics Wins

Two postures have hardened around the cost of AI, and most leaders have already picked one without registering it as a choice. The first says zero dollars per token. Own the silicon, run the weights locally, drive the marginal cost of a query to nothing. Apple’s M3 through M5 put a capable model on a machine that fits in a backpack, NVIDIA’s GB10 desktop box puts a small token factory under the desk, and the appeal is clean: no meter, no vendor, no bill that grows every time the team does its job. ...

292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API. The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves. ...