A Million PDFs Won't Organize Themselves
Three months ago, I plugged a Blackwell GPU into my lab bench and pointed it at a million PDFs. Corporate documents: contracts, engineering reports, compliance filings, vendor proposals, maintenance logs, insurance certificates. The kind of archive that accumulates over two decades of running critical infrastructure. A million files. Not sampled, not curated, not cleaned. Raw. The plan was straightforward: parse the documents, chunk them, embed them into a vector store, wire up a retrieval layer, and start asking questions no keyword search could answer. Retrieval-Augmented Generation. The acronym that launched a thousand vendor decks. ...