Frontier AI Is a System, Not a Model

Yesterday a code editor sold for sixty billion dollars.

SpaceX exercised an option it had struck back in April. The terms were unusually clean: buy Anysphere, the company behind the Cursor editor, outright for $60 billion in stock, or walk away and pay $10 billion just to partner. It bought. CBS reported the deal the same week SpaceX went public. Cursor leans heavily on Anthropic’s models today, and the new owner has already said it will drop its own models and Grok’s coding agent into that seat.

So read the trade carefully, because the price tag is telling you something. Sixty billion dollars did not buy a model. It bought the harness around the model. The editor, the context plumbing, the agent loop, the place where a developer actually does the work. The read that went around the next morning was that Anthropic would be fine regardless, that the durable asset was never the raw weights but the whole machine wrapped around them. I think that read is correct.

A few hours after the headline I was on the phone with a friend. Principal engineer, does real R&D, the kind who reads the paper before he repeats the benchmark. We were doing what half the industry is doing this month: arguing about how close GLM-5.2 has gotten to Opus 4.8. The public coding benchmarks put it within a handful of points. And I am the last person to wave that away. I have a Nemotron-3-Super-120B on the bench right now measuring multi-token prediction, with early numbers suggesting the KV-cache tricks I lean on do not map cleanly onto its Mamba-2 layers, though that is still raw data. I love open weights. I run them for a living.

But here is what the benchmark column hides. You cannot download Opus 4.8. You cannot download ChatGPT 5.5. Nobody can. There is no file.

The thing you can’t download

What sits behind that endpoint is not a model. It is a kitchen.

A request comes in and gets pre-processed, cached, routed through deterministic gates, distilled, handed off between what is very likely a dozen specialized models that each do one job, hydrated back into something coherent, and only then plated as a single output token, all of it in the time it takes you to read this sentence. The weights are the chef’s talent. Real, expensive, hard to replicate. But the product is the whole kitchen around the chef: the mise en place, the prep line, the expediter calling tickets, the muscle memory of a hundred services a night. You can hire the chef. You cannot download the restaurant.

That is why a code editor is worth sixty billion dollars and an open-weight model that scores within a few points of the frontier is worth, on its own, a GitHub release. The score measures the chef. The valuation measures the kitchen.

Your kitchen, not theirs

The same physics applies to you.

Your enterprise AI is also a system. The model is the one piece of it you will always rent, from whoever is at the frontier this quarter, on terms they set and change. Fine. Rent it. But the rest of the kitchen is yours to build, and the one part that quietly decides how often you pay the expensive layer is the retrieval layer. Most teams treat retrieval as a checkbox under the model. It is the opposite. It is the cost-control core of the whole system, and for most of the last decade it was too slow and too expensive to treat as infrastructure you owned with any confidence.

Actually, to see why that changed, back up about ten years.

Ten years ago, CUDA was for graphics and a handful of physicists. Then the scientific world figured out that a GPU was a general-purpose math machine, and the floodgates opened. NVIDIA RAPIDS is the heir to that moment: a family of libraries (cuDF for dataframes, cuML for machine learning, cuGraph for graphs) that took the GPU acceleration physicists had been hoarding and pointed it at ordinary data work. The lesson underneath RAPIDS is the one I keep relearning. The bottleneck in applied AI was never the model. It is the data path. When I pointed a Blackwell card at a million corporate PDFs, the wall was never inference. It was everything upstream of the model: parsing, cleaning, embedding, indexing, and the brutal arithmetic of doing all of that again every time something changed.

The newest member of that family aims straight at the retrieval layer. It is called cuVS, and it is the reason this article exists.

What cuVS is

cuVS is GPU-accelerated vector search and clustering. It was spun out of the RAFT library and now lives in RAPIDS as its own thing, and NVIDIA ships it as open source. Its flagship is CAGRA, a graph-based nearest-neighbor index built natively for the GPU, alongside GPU implementations of the IVF family and plain brute force. It builds the index on the GPU and searches it on the GPU, and if you want, it will hand the finished index back to a CPU to serve.

Here is the problem it solves. Vector search is how retrieval works: every document becomes a point in high-dimensional space, and answering a query means finding the nearest points. In Postgres, the default tool for that is pgvector with an HNSW index, and it is genuinely good. Transactional, simple, the thing most teams already run. It is good right up until the corpus crosses from thousands of vectors into tens of millions, and then two walls appear. The first is the index-build wall: HNSW gets painfully slow to construct at scale. The second is the query-throughput wall: search-per-second collapses under load. cuVS exists to knock both down by moving the work onto silicon that was built for exactly this shape of math.

The financial benefit

Now the part finance actually cares about, which is not speed for its own sake.

The vendor numbers are loud, and I will quote them without worshipping them: NVIDIA’s own page claims 21x faster indexing, 29x higher throughput, and 12.5x lower cost for the same build. Treat those as marketing until someone independent reproduces them. Someone did. Meta folded cuVS into FAISS and published the run: on an H100 against a high-end Xeon, at matched 95 percent recall, GPU index builds came in 12.3x faster and search latency dropped as much as 8x. And the number that actually reframes the problem comes from a Zilliz benchmark on Milvus: 635 million vectors indexed in about 56 minutes on eight DGX H100s, against roughly 6.22 days on CPU. Read that again as a project manager, not an engineer. Six days becomes one lunch. That is not an optimization. That is the difference between a thing being on the roadmap and a thing being a Tuesday.

There is a quieter financial lever in here too, and it is dimensions. A lot of teams reach for the biggest embedding model and inherit its 4096-dimensional vectors by default. For a frontier lab that has to answer how fast light moves through different media, fine, buy the width. For a mission-focused corpus that answers operational questions about your own business, 4096 dimensions is vanity. In my own pipelines 384 or 768 dimensions cluster cleanly, and every dimension you do not store is index you do not build, VRAM you do not rent, and latency you do not pay on every single query for the life of the system. Match the vector width to the question you are actually asking.

The point of the financial argument is not that the GPU is faster. Everyone knows the GPU is faster. The point is that the retrieval layer is the cheapest seat in the house, and it is the seat that decides how often the expensive seats get used.

The benchmark, and where I stop trusting slides

I do not buy a kitchen off a slide. Nobody who has shipped hardware does. In manufacturing there is a discipline called New Product Introduction: before you commit a production line, you build a unit, you abuse it, you measure it cold, and only then do you sign. I have too many scars from PDFs and PowerPoints that did not survive contact with a loading dock. So before I tell anyone the GPU route is real, I run it on my own iron.

One box. A single RTX PRO 6000 Blackwell, 96 GB. Too expensive to justify as a personal toy, but very attainable for a hybrid corporate build, and with no hedging, a genuinely good on-prem AI gateway. Around ten thousand dollars on the market today, not a row of DGX pods. PostgreSQL 18 with pgvector 0.8.2 on the HNSW index against cuVS 26.06 running CAGRA. Synthetic 384-dimensional embeddings, clustered to approximate the shape of real ones. Three corpus sizes. Both engines measured in-process, no network in the way, so the numbers are the engines and nothing else.

Corpus	pgvector build	pgvector QPS	cuVS build	cuVS QPS
100K	19 s	712	1.3 s	308,888
1M	228 s	453	11.6 s	87,185
14M (20 GiB)	did not finish in 10h+	~0.3 (seq scan)	35 s	20,000 to 99,000

At 14 million vectors, cuVS held 99.9 percent recall at roughly 20,000 queries per second, which is 49 microseconds a query, and if you wanted perfect recall it would do exact brute-force search in 1.42 milliseconds. pgvector, on this box, never produced a usable 14M index at all, so every query fell back to a 3-second sequential scan. Twenty gigabytes of vectors. cuVS built the index in 35 seconds.

Now the honesty, because the magnitude needs an asterisk and I would rather put it there myself. pgvector’s build here ran effectively single-threaded, sitting around 10 percent CPU, and I left its parallel-build settings off on purpose. That is not a handicap I invented to flatter the GPU. It is the configuration the thing ships in, and the configuration most teams actually run. The reflex in a modern shop is not to sit with the query planner and tune the local engine. It is to shrug and add another container, because horizontal elasticity is easier to reason about and easier to expense than a week spent learning what a parallel build worker does. A performance engineer who actually loved those knobs could tenfold this indexing time, no question, and a tuned pgvector would have finished, shrinking the build gap from the absurd figure on that table to something more like ten or a hundred times rather than a thousand. But I am measuring what gets deployed in the wilderness, not what an expert could coax out of the same binary on a good afternoon. The data is synthetic. It is one box, one run, no error bars. So I am not going to carve a thousand-times build number into a chart and sell it, because it would not survive a tuned rematch, and a number that does not survive contact is worse than no number. What does survive is the shape: build times that go from a maintenance window to a coffee break, and query throughput two orders of magnitude apart that barely moves when you tune the loser. That part is not close.

And the most useful thing the benchmark gave me is the crossover, which is the opposite of a sales pitch. Below roughly 100,000 vectors in-process, or a few hundred thousand once a real network hop sits between your app and the GPU, the CPU wins. I measured that too: at a few hundred vectors, a remote cuVS service answered in 3.72 milliseconds over the LAN while local pgvector answered in 0.70. The GPU has a fixed cost, VRAM residency and a warm process, and below the crossover that cost never amortizes. A good NVIDIA answer includes the regime where you should not buy the NVIDIA. Match the hardware to the workload you actually have, not the one in the keynote.

What you actually win

Strip the multipliers away and three things change underneath you.

The first is feasibility. “Index everything in our email and our SharePoint” stops being the slide that disappoints and becomes a job that finishes. I watched Microsoft 365 Copilot make that promise to a client two years ago and watched the retrieval quality on real corporate data fall well short of it. The promise was not a lie. The data path just was not cheap enough yet to make it true. At 35 seconds for 20 gigabytes on one card, the economics that made it a fantasy are gone.

The second is that re-indexing becomes a non-event. Change your embedder, change your chunking, change your schema, and on the CPU path you have just bought yourself an overnight maintenance window and a reason to avoid improving the system. On the GPU path you rebuild over lunch. That changes how often a team dares to make the thing better, which over a year is worth more than the raw latency.

The third is an old trick that the GPU just made new again. Keep the index resident. Gray-haired operators know this one: you make a 100 GB database thrive on a humble disk array by keeping the hot working set warm in memory so the spindles never become the bottleneck. Same law here. My own numbers show a cold cuVS process paying a 109-millisecond one-time tax to load its CUDA kernels, then answering in under 2 milliseconds once warm. Pre-warm the retrieval layer and you never pay the cold tax in front of a user. None of this is glamorous. That is usually where the real money is.

The cheapest token is the one you never spend

Here is where I want you to stop counting tokens and start counting what the tokens buy.

Take an ordinary corporate ask: “analyze the last four reports.” The naive agent does the obvious thing. It pulls thirty pages into context, runs OCR, extracts the text, makes a second pass for cohesion, performs the gestures of understanding at a confidence it cannot really justify, thinks for eight minutes, and returns a final report with a wide-open surface for error and hallucination. Every run. From scratch. As if it had never seen those reports before, because it hadn’t.

Now the disciplined system. The reports were indexed the day they landed. Their meaning is already computed and sitting in the retrieval layer. Before you spend a single frontier token, you ask the cheap question first, in microseconds: do we already know this? Most of the time the answer is yes, and you assemble the response from your own indexed truth instead of re-deriving it. When you do call the expensive endpoint, you call it with tight, grounded context instead of thirty raw pages. The result is roughly a fiftieth of the tokens, a fraction of the latency, a smaller blast radius for error, and usually a better answer, because it is anchored in what you actually have rather than what the model can reconstruct on the fly.

This is the same lesson as a distilled retrieval corpus beating a three-billion-parameter vision model, and the same lesson as the arithmetic of self-hosting against the API meter. Spend the expensive layer where it earns its keep, and build the cheap layer so well that the expensive one becomes the exception. The retrieval layer is how you write your own knowledge down in a form the machine can check faster than it can ask. That is not a cost center. That is the moat from the first page of this article, brought home to your own building.

SpaceX paid sixty billion dollars for a kitchen. You do not have to. But you do have to build one, because the model is rented and always will be, and the retrieval layer is the part that is actually yours. Build the part you own.

The thing you can’t download#

Your kitchen, not theirs#

What cuVS is#

The financial benefit#

The benchmark, and where I stop trusting slides#

What you actually win#

The cheapest token is the one you never spend#