292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API. The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves. ...

A Million PDFs Won't Organize Themselves

Three months ago, I plugged a Blackwell GPU into my lab bench and pointed it at a million PDFs. Corporate documents: contracts, engineering reports, compliance filings, vendor proposals, maintenance logs, insurance certificates. The kind of archive that accumulates over two decades of running critical infrastructure. A million files. Not sampled, not curated, not cleaned. Raw. The plan was straightforward: parse the documents, chunk them, embed them into a vector store, wire up a retrieval layer, and start asking questions no keyword search could answer. Retrieval-Augmented Generation. The acronym that launched a thousand vendor decks. ...