A Million PDFs Won't Organize Themselves

Three months ago, I plugged a Blackwell GPU into my lab bench and pointed it at a million PDFs.

Corporate documents: contracts, engineering reports, compliance filings, vendor proposals, maintenance logs, insurance certificates. The kind of archive that accumulates over two decades of running critical infrastructure. A million files. Not sampled, not curated, not cleaned. Raw.

The plan was straightforward: parse the documents, chunk them, embed them into a vector store, wire up a retrieval layer, and start asking questions no keyword search could answer. Retrieval-Augmented Generation. The acronym that launched a thousand vendor decks.

That plan lasted about four days.

The Assembly Line That Wasn’t

Every RAG tutorial I’ve read (and I’ve read more than I’d like to admit) presents the pipeline as a production line. Documents go in one end, vectors come out the other. Parse, chunk, embed, serve. The implication is that the hard part is choosing the right model and tuning your chunk size.

It’s not.

Two years ago, Microsoft shipped Microsoft 365 Copilot. The pitch: connect it to SharePoint, OneDrive, Exchange, Teams, and it indexes everything. Your entire corporate knowledge base, searchable by natural language. I evaluated it for a client at the time. The retrieval quality on real corporate data was nowhere close to the promise. Pure marketing. But I knew the underlying problem was genuinely hard, and the whole point of this current sprint is to find out exactly how hard. To build the pipeline myself, layer by layer, and see where the walls are. Because the question nobody seems willing to answer is this: if RAG is the architecture that unlocks enterprise knowledge, why hasn’t anyone solved it in a clean, repeatable way? The deeper I dig, the more obvious the answer becomes.

The hard part is that a million corporate PDFs have nothing in common except the file extension. Scanned images from 2004 sitting next to machine-generated compliance reports from last quarter. Tables that span pages. Headers and footers that repeat on every page and poison your chunks. Boilerplate legal language that drowns the three sentences of actual content. Documents in Portuguese, English, and Spanish. Sometimes all three in the same file.

I started with Docling on an 8×A100 cluster. Seventy-two hours of continuous parsing. Just to get clean text out of the PDFs. Not embeddings. Not vectors. Not retrieval. Just text. And “clean” is generous. It was the cleanest I could get without manual review of a million files, which is to say it was full of artifacts, misrecognized characters, table structures that collapsed into word soup, and headers that got concatenated with the first sentence of every section.

A significant chunk of the corpus needed OCR. Scanned pages, photographed documents, faxes that were printed and re-scanned because someone in 2009 thought that was a reasonable workflow. The question becomes: which OCR backend, and where do you run it? I tested CPU-based pipelines, GPU-accelerated pipelines, and hybrid approaches where Docling handled the layout detection on GPU and offloaded the OCR to CPU to free VRAM for the next batch. Every combination had a different throughput profile, a different failure mode, and a different cost per page. You’re not just picking an OCR engine. You’re scheduling compute across heterogeneous hardware and hoping the bottleneck stays where you planned it.

While I was in the middle of this, NVIDIA launched cuDF and cuVS. GPU-accelerated DataFrames and vector search, purpose-built for the kind of data wrangling I was doing by hand. Come on, NVIDIA. Are you reading my lab notes? But that’s the point. This isn’t a niche problem that only shows up in some hobbyist’s garage. When a GPU vendor ships dedicated libraries for data pipeline acceleration, they’re telling you something about where the industry bottleneck actually lives. It’s not in the model. It’s not in the inference. It’s in the data preparation, and it’s painful enough that silicon companies are building hardware features to address it.

When I was twelve, I kept a plastic bin of electronic components: resistors, capacitors, old circuit boards I’d salvaged from broken radios. I could find any part in seconds because I’d touched every one. I knew the bin. Scale that to a warehouse with a million parts from a hundred different suppliers, packed in boxes with labels in six languages, some with no labels at all. Suddenly “I know where everything is” isn’t a system. It’s a fantasy. That’s what a million PDFs feel like the moment you try to make them queryable.

Embedding Is Not Understanding

Once I had text, I started experimenting with embeddings. Qdrant as the vector store. I tested 256-dimensional vectors, 512, 768, with BM25 as a baseline alongside every configuration, because if your semantic search can’t beat keyword matching, you don’t have a retrieval system. You have a demo.

The dimension testing wasn’t academic. I started playing with 5,000 PDFs. Then scaled to 100,000. Now I’m at a million. At 5,000 documents, switching from 256 to 768 dimensions was a configuration change and a coffee break. At a million, “let me reorganize the vectors this way” is ten hours of compute and $200. Every experiment has a price tag, and the price tag grows linearly with the corpus while the insight grows logarithmically. You learn fast to be deliberate about what you test.

I’m not pretending I have experience with billion-document indexes. I suspect only a handful of people on Earth have built truly scalable systems that can handle that kind of transformational data pipeline end to end. But I can tell you this with certainty: at just one million documents, you have real sweat. Real operational decisions. Real cost. If a million is this hard, I have a new respect for whatever is running behind the scenes at the places that index the entire web.

The results were humbling. Dimension count mattered less than I expected. What mattered, what mattered enormously, was the quality of what went into the embedder. Chunks created by splitting on token count, with no awareness of document structure, produced vectors that clustered in meaningless ways. A chunk containing half a contract clause and half a page footer would embed somewhere in vector space, retrieve on the right query, and deliver garbage to the language model.

I spent two weeks tuning chunk boundaries, overlap ratios, token limits. Every adjustment moved the needle on some queries and broke others. The problem wasn’t the embedder. The problem was that I was asking the embedder to create structure from text that had none. I was hoping the model would infer context that I hadn’t provided.

Hope. The word I’ve spent years telling clients to remove from their operational vocabulary. And here I was, building a retrieval system on it.

The Infrastructure Migrates

I had originally used SQLite3 for the metadata index. Document IDs, file paths, parsed timestamps, chunk offsets. SQLite is a beautiful tool for what it is, but a million documents with multiple chunks each, cross-referenced against a vector store and a growing set of classification labels, is not what it is.

I moved to PostgreSQL. The migration itself was uneventful. The schema was simple enough. But PostgreSQL opened a door I’d been avoiding: graph relationships between documents. Contracts reference engineering reports. Compliance filings cite vendor proposals. Maintenance logs connect to insurance certificates through asset IDs that don’t appear in any structured field. They’re buried in paragraph text, written differently by every author who ever touched the system.

I tried the AGE extension first. Apache AGE gives you Cypher queries inside PostgreSQL, which is appealing if you want to keep everything in one database. I ran tests. For the relationship density I was seeing, where a single maintenance report might connect to fifteen other documents across three entity types, AGE worked but didn’t feel right. The query patterns I needed were graph-native, not relational queries with graph bolted on. I went to Neo4j. Separate system, separate operational surface, but purpose-built for the thing I was actually doing.

Sometimes the right answer is another database, not another extension.

The Rebuild

This is where the economics changed.

After weeks of tuning embeddings on text that was fundamentally context-poor, I made the decision to rebuild the entire pipeline around contextual embedding. The concept is straightforward: before you embed a chunk, you give the model the full document and ask it to generate a short contextual summary, one or two sentences that explain what this chunk is about within the context of the document it belongs to. Then you embed the chunk with that context prepended.

The concept is straightforward. The compute is not.

One to two sentences of context for each chunk, across a million documents, means running a large language model against every document in the corpus. Not an embedding model. A generative model. At scale, that’s not a pipeline step you run on a workstation. I rented a B200 GPU for twenty-four hours. That’s on top of the A100 cluster that already spent three days parsing.

What started as a research project on my lab bench with one Blackwell card, a local Qdrant instance, and a SQLite database had become a multi-cluster operation burning through cloud GPU hours. And this was still research. Still testing. Still iterating on whether contextual chunking actually improved retrieval quality enough to justify the cost. Early results say yes, significantly, but I’m not ready to publish numbers I can’t reproduce.

The Neo4j graph is the other half of the rebuild. GraphRAG, using the graph structure to inform retrieval rather than just vector similarity, gives me something pure vector search can’t: relationships. When someone asks “which maintenance events preceded this compliance failure,” vector search returns documents that talk about maintenance and compliance. Graph traversal returns the actual chain of events, connected by shared entities, temporal sequence, and document references that the parsing layer extracted and the graph layer preserved.

What I’m Learning

I’m deep enough into this to know what I don’t know, and what I don’t know could fill another million PDFs.

But a few things have crystallized.

The first is that heuristics don’t survive contact with real data at scale. Every shortcut I took: splitting on token count, embedding without context, using SQLite because it was fast to set up, avoiding graph storage because it added complexity. Every one of those shortcuts produced a system that worked on a thousand documents and collapsed on a million. Not gracefully. Not with degraded performance. It just stopped returning useful answers, which is worse than returning no answers at all, because it looked like it was working.

The second is that this is labor. Not just compute labor, though the GPU hours are real and the invoices are mounting. It’s intellectual labor. Deciding what constitutes a “document” when some PDFs contain a single page and others contain three hundred. Deciding how to handle multilingual content. Deciding which entity types matter for graph construction and which are noise. Deciding whether a table should be chunked as text or preserved as structured data. Every one of those decisions propagates through the entire pipeline, and every one of them is a judgment call that no model can make for you. Not yet.

The third is that the infrastructure stack under a production RAG system is far deeper than the vector store. It’s the parser. The metadata index. The graph. The contextual enrichment layer. The evaluation framework that tells you whether your changes actually improved retrieval or just made different mistakes. Each layer has its own failure modes, its own scaling characteristics, its own operational surface. I started with one GPU and a Python script. I’m now running across three database systems, two GPU clusters, and a pipeline with more moving parts than some production applications I’ve deployed in data centers.

The Territory

I don’t have a neat conclusion because I’m not at the end. I’m somewhere in the middle, probably closer to the beginning than I’d like to admit. The research is ongoing. The B200 invoices are real. The Neo4j graph is growing. The contextual embeddings are promising but unproven at the scale I need them to hold.

What I can say is this: if you’re planning to build a RAG system on a real corporate corpus, not a demo, not a curated dataset, not a hundred carefully selected documents, plan for the labor. Plan for the infrastructure. Plan for the rebuild, because your first version will teach you everything the tutorials didn’t.

Data doesn’t organize itself. At a million documents, that’s not an observation. That’s the entire job.

The Assembly Line That Wasn’t#

Embedding Is Not Understanding#

The Infrastructure Migrates#

The Rebuild#

What I’m Learning#

The Territory#