Measurement & Truth

Who Holds the Keys to Confidential Computing

A friend called last week with a familiar complaint. He had built his workload inside AWS Nitro Enclaves, and he wanted out. His words, not mine: “Pretty easy to get in. Pretty costly to get up. Impossible to get out.” A friendly onboarding pipeline had generated his key for him and left it sitting right there in the console, and he honestly could not tell you whether it was his to take somewhere else. AWS ran the attestation. AWS decided, on every request, whether his own code was allowed to touch his own secrets. Then he asked the question that started this article. How do I port this to another provider? ...

In the Long Run, Economics Wins

Two postures have hardened around the cost of AI, and most leaders have already picked one without registering it as a choice. The first says zero dollars per token. Own the silicon, run the weights locally, drive the marginal cost of a query to nothing. Apple’s M3 through M5 put a capable model on a machine that fits in a backpack, NVIDIA’s GB10 desktop box puts a small token factory under the desk, and the appeal is clean: no meter, no vendor, no bill that grows every time the team does its job. ...

Frontier AI Is a System, Not a Model

Yesterday a code editor sold for sixty billion dollars. SpaceX exercised an option it had struck back in April. The terms were unusually clean: buy Anysphere, the company behind the Cursor editor, outright for $60 billion in stock, or walk away and pay $10 billion just to partner. It bought. CBS reported the deal the same week SpaceX went public. Cursor leans heavily on Anthropic’s models today, and the new owner has already said it will drop its own models and Grok’s coding agent into that seat. ...

Visual RAG Beats the Vision Model

A three-billion-parameter vision model looked at a reCAPTCHA tile and got it right 89 percent of the time. It took 128 milliseconds. A lookup over a few hundred megabytes got it right 95 percent of the time. It took seven-tenths of a millisecond. Same tiles. Same held-out set. One of those is how almost everyone is wiring computer vision into their stack this year. The other is how you should. ...

NVFP4: What 4-Bit Really Costs on Blackwell

A reproducible, independent quality-and-throughput study of FP8, INT4-AWQ and NVFP4 against BF16 — across two dense and two Mixture-of-Experts models, measured with no access to NVIDIA’s harness. Reproduce it yourself. Every number below traces to a committed run log, and the entire pipeline is public and MIT-licensed: github.com/sch0tten/nvfp4-benchmark. Clone it, run make all, dispute a number, add a model — see §3.7. Abstract We benchmark four numeric formats — BF16, FP8, INT4-AWQ and NVFP4 — across sixteen arms (two dense and two Mixture-of-Experts instruction-tuned models, each in all four formats) on a single 96 GB NVIDIA Blackwell workstation, using the most-downloaded real-world quantization of each model rather than idealized in-house ones. On quality — measured generatively under one identical protocol with the EleutherAI harness — four bits is nearly free: averaged over five tasks, NVFP4’s cost is at most 0.6 points (the dense models) and the MoE models give up even less, and that cost is concentrated almost entirely in knowledge (MMLU-Pro); math, code and instruction-following sit at a ceiling. NVFP4 and INT4-AWQ are a wash at equal ~½ byte per parameter — which one wins is decided by the quantization recipe, not the number format. On throughput in the single-stream regime, the dominant lever is architecture: the MoE arms decode 3–7× faster than the dense ones, and within a model INT4-AWQ’s mature kernels usually edge NVFP4 on decode while NVFP4 holds the smallest weight footprint. With no access to NVIDIA’s harness, our independently-measured BF16→NVFP4 deltas reproduce NVIDIA’s published deltas to within 0.6 points on three of four benchmarks — and to 0.03 on the Qwen-MoE. The practical verdict for a local agentic deployment: run a 4-bit MoE; take INT4-AWQ for peak tokens-per-second today and the official NVFP4 for the smallest memory and the format Blackwell was built around. ...

The Call Is Coming From Your Update Server

In September 2006, a Debian maintainer did everything right and broke the world’s trust for a year and a half. He was cleaning up the OpenSSL package. Valgrind and Purify, the memory checkers every careful engineer is supposed to listen to, kept flagging two lines in md_rand.c. The lines read uninitialized memory. That’s a sin. Undefined behavior, the kind of thing you delete without a second thought. So he deleted it. ...

Security Research Is Not a Crime

It was around 2000. I was running Legion across entire Class B ranges, watching open Windows shares scroll up the screen faster than I could read them. C$. ADMIN$. Whole NT4 boxes answering null sessions like a door with no lock and a welcome mat on the floor. You didn’t need a password. You needed curiosity and a free afternoon. The Microsoft of that era had no Patch Tuesday. No Security Response Center worth the name. Security was a feature request that lost to the ship date, every quarter, on purpose. The company that today runs one of the most disciplined vulnerability programs on the planet once shipped operating systems to hospitals and banks with the equivalent of the front door propped open. ...

Maria Pennacchi Schotten's Rubik's Cube wolf mosaic, more than a thousand cubes

Applied AI Is Human Augmentation, Not Replacement

Since 2023, I’ve been studying applied AI almost exclusively. I don’t pretend to be a data scientist or ML engineer. Honestly, I don’t think giving up more than twenty years of infrastructure, performance, and security engineering would be smart. I’d end up like a duck: swims, flies, and walks, but doesn’t outperform at any of them. It’s impossible not to get caught up in the vibe-coding thing. I’m not here to criticize anyone shipping and prototyping. A few months back, I heard one of the smartest things anyone’s said about AI, from Naval Ravikant. I’ve been listening to him for a few years now, and his takes are consistently good. I don’t remember the exact words, and I’m not going to chase videos or quotes to nail them down, but it was close to this: “There is no disruption caused by AI. The novelty we’re seeing is the abstraction and conversion of human language into computing language.” Brilliant. ...

GPU Fleet AIOps: The Augmented Operator

Two in the morning, eighteen hours into the run. Seven LLM backends processing the same stream of GPU cluster anomalies. Same thermal cascades, same NVLink errors, same KV cache evictions. I’m watching the scoring dashboard update in real time and the numbers are breaking my assumptions faster than I can take notes. The $32-per-day model is getting the diagnosis wrong more often than a free one running on my workstation. ...

292x: Why Batch Inference Breaks on API Pricing

292x. That’s not a rounding error. That’s the cost multiplier between running a batch inference job on a rented B200 GPU and sending the same workload through Claude Opus 4.6’s API. The job was straightforward: generate one or two contextual sentences for each of a million documents, extracted JSON from the corporate PDF archive I’ve been building a RAG pipeline around. Those sentences get prepended to each chunk before embedding into Qdrant’s 768-dimensional vectors with BM25 sparse indexing. It’s the contextual layer that makes retrieval actually work, the step I described in the previous article about why a million PDFs won’t organize themselves. ...

Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop. We scaled to eight. The cliff never came. This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load. The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only. ...

Local LLM Bench: Best Model for Coding Swarms

In Part 1, we established the baseline: MoE delivers 168 tok/s on a single RTX 3090, 4.1x faster than Dense. Clean single-request numbers. One prompt in, one response out. That’s not how swarms work. An orchestrator like Claude Code dispatches four coding tasks simultaneously. The local model serves all four. Under concurrency, memory bandwidth saturates, per-task throughput drops, and the architecture of the model — not the GPU, the model — determines whether you get useful parallelism or just contention. ...