292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API.

The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves.

One B200. Eleven hours. Seventy dollars.

I ran the same workload estimate against every major API provider I have an account with. The numbers broke something in my understanding of how people are building these systems.

The Numbers

Provider	Cost	vs B200	Wall Clock
B200 self-hosted (Qwen-3.5-27B-NVFP4)	$70	1x	11 hours
Qwen3.5 (DashScope, 120b-a10b)	$140	2x	23 days
GPT-4.1-mini	$508	7x	58 days
GLM-5 (together)	$599	9x	39 days
Claude Sonnet 4.6	$4,084	58x	77 days
GPT-5.4	$6,353	91x	96 days
Claude Opus 4.6	$20,419	292x	144 days

All APIs tested were the platform-level endpoints, the ones an enterprise integration would consume. Not the consumer-facing Max or Pro plans running behind proprietary agent frameworks. Direct API, programmatic access, the way a production pipeline would call them.

I didn’t test Google Vertex AI, Amazon Bedrock, or Azure AI Studio. Those are white-label distribution layers for the same underlying models. If I’m testing Claude, I go to Anthropic’s API, not Bedrock’s resale of it. If I’m testing GPT, I go to OpenAI, not Azure’s wrapper. The hyperscaler platforms may add latency, may add markup, may route through different infrastructure, but the model is the model. I was already wired directly to the providers I wanted to compare. Even if, behind the scenes, some of those providers are themselves running on the same cloud hardware the resellers would offer.

Each provider was tested with a 50-JSON sample from the actual corpus. Double-pass runs, no caching warmup to game the numbers. Tested between 10 and 11 AM on a weekday, because that’s when real workloads run, not at 3 AM when the servers are empty. The full-corpus numbers are extrapolations from those test runs.

The model on the B200 was Qwen-3.5-27B-NVFP4, a 27-billion parameter model quantized to NVIDIA’s FP4 format, capped at 2048 output tokens and tuned specifically for this task. The output quality was perfect for my goal. Not “good enough.” Perfect. Every contextual sentence was accurate, grounded in the document, and concise enough to embed without bloating the chunk envelope.

That’s the first lesson: you don’t need a frontier model for every inference task. GPT-4.1-mini was included as the lower control, the cheapest serious API option in the comparison. The high-shelf models (Claude Sonnet 4.6, GPT-5.4, Claude Opus 4.6) are there because that’s what people reach for when they haven’t done the math. When someone asks “which model should I use for my RAG pipeline?” the default answer is usually the most capable one available. Nobody stops to ask whether a 27B model quantized to FP4 could do the job just as well.

The DashScope entry runs Qwen3.5’s 120b-a10b MoE variant, not the 27B I used locally. Even the same model family, served through an API at a larger parameter count, costs twice as much and takes 23 days instead of 11 hours. And that $140 is already a dumped price. Alibaba recently opened their Virginia region, the first DashScope availability inside the US, and they’re pricing it aggressively to gain market share. The same model in the Singapore region costs four times what it costs in Virginia. So the cheapest API option in this table is running on a promotional price in a region that just launched. The non-promotional price would put it at $560, 8x the B200. Even when a hyperscaler is actively subsidizing your inference to win your business, the self-hosted GPU still wins by half.

The 3090 Detour

I tried the workstation first. Three RTX 3090s, the same rig I use for local LLM inference and embedding experiments. Ran a timing estimate on a subset and extrapolated: 42 days.

I stopped the job within the hour. I’d already tasted B200 speed on the parsing phase of this pipeline and I knew what was available. Rented a single B200, loaded the NVFP4 model, and had my results before lunch the next day.

The Cache You Don’t Control

The cost gap is striking. The wall clock gap is the part that needs explaining.

A B200 running one workload has one thing API providers will never give you: exclusive access to the KV cache. When you run inference on your own GPU, the key-value cache that stores attention state between tokens belongs entirely to your job. No eviction. No contention. No other customer’s request pushing your cached context out of memory to make room for theirs.

Picture a kitchen with one cook and one stove. Every pot is yours. Every burner is yours. The walk-in cooler has your ingredients and nothing else. Now put that same cook in a commercial kitchen where fifty cooks share the same six stoves, the same cooler, the same prep stations. Your dish takes longer not because the recipe changed or the cook got worse, but because every time they reach for a burner, someone else is already on it. That’s what multi-tenant inference looks like from the inside.

I measured KV cache behavior under controlled concurrency in the Local LLM Bench series, where scaling from one to eight concurrent agents on local hardware showed contention stabilizing around 27% on NVLink-connected GPUs. That was with full visibility into what was happening, on my own hardware, with my own scheduling. On a shared API endpoint serving thousands of concurrent users, I have no visibility at all. I don’t know exactly how each provider manages multi-tenant KV cache eviction. Nobody outside those companies does. What I can measure is the wall clock, and the wall clock says my workload takes somewhere between 23 days and 144 days when it shares infrastructure with everyone else.

The tests ran at 10 to 11 AM on a weekday. Peak business hours. Would the numbers be better at 3 AM? Probably. But that’s exactly the point. This is a real-life test, not a synthetic benchmark optimized for a press release. If your production pipeline only performs well during off-peak hours, you don’t have a reliable pipeline. You have a workaround.

The Clock Kills the Pipeline

The cost difference gets the attention. The time difference kills the project.

144 days for Claude Opus to generate contextual sentences for my corpus. That’s not a pipeline step. That’s a geological process. By the time those embeddings land in Qdrant, the underlying documents have changed, the research questions have evolved, and the budget has been reallocated to something that actually delivered results this quarter.

Even the cheapest API option, GPT-4.1-mini at 58 days, means the contextual embedding step alone takes two months. Two months for one layer of a multi-layer pipeline. The parsing already took three days on the A100 cluster. The graph construction is still running in Neo4j. The evaluation framework hasn’t been built yet. Stack those sequential steps with a two-month embedding phase in the middle and you have a pipeline that can’t iterate. A RAG system that can’t iterate can’t improve. It just sits there, frozen in its first approximation.

Not a Lab Problem

This isn’t about having expensive hardware in a home lab. The B200 was rented. Not purchased, not amortized over years of use, not depreciated across some multi-year capacity plan. Rented for 24 hours. Seventy dollars. Less than a team lunch in most tech companies.

The comparison that matters isn’t “self-hosted versus API.” It’s “dedicated compute versus shared compute for batch workloads.” Any organization processing more than a few thousand documents through an LLM should be running this math. Enterprise RAG pipelines, compliance document classification, regulatory filing analysis, internal knowledge base construction: these are all batch workloads that hit a million documents faster than most teams expect. And at a million documents, the difference between $70 and $20,419 isn’t a cost optimization. It’s the difference between a project that ships and a project that gets cancelled when the first invoice arrives. I’ve seen teams burn through five-figure API budgets in a single sprint without realizing the same workload could run on a rented GPU for the cost of a parking ticket, because nobody on the team had ever priced dedicated compute against their API consumption.

The breakeven isn’t where most people assume. You don’t need a million documents to justify dedicated GPU time. You need a few thousand. After that, every API call is a voluntary surcharge on a workload that doesn’t require shared infrastructure, doesn’t require frontier model capabilities, and doesn’t benefit from multi-tenant scheduling. The API is paying for flexibility you’re not using.

What I Don’t Know

This was one task profile: short contextual summaries from structured JSON, 2048 token cap, batch sequential processing. A different task (long-form generation, multi-turn conversation, retrieval-heavy chains with tool use) might produce different ratios. The gap could narrow. It could also widen.

API pricing changes. Today’s 292x might be 50x next quarter if providers introduce dedicated batch tiers or volume discounting that reflects the actual cost of the compute. Some providers already offer batch APIs with better throughput guarantees at higher latency tolerance. I haven’t tested those yet.

The 50-JSON sample is real data from my actual corpus, but it’s still a sample. The extrapolation assumes linear scaling across the full million documents, which may not hold for all providers if they throttle differently at sustained volume.

What I do know: one GPU, eleven hours, seventy dollars. The pipeline moved. The data is embedded. The research continues.

292x is not a rounding error. It’s a design decision.

The Numbers#

The 3090 Detour#

The Cache You Don’t Control#

The Clock Kills the Pipeline#

Not a Lab Problem#

What I Don’t Know#