NVFP4: What 4-Bit Really Costs on Blackwell
A reproducible, independent quality-and-throughput study of FP8, INT4-AWQ and NVFP4 against BF16 — across two dense and two Mixture-of-Experts models, measured with no access to NVIDIA’s harness. Reproduce it yourself. Every number below traces to a committed run log, and the entire pipeline is public and MIT-licensed: github.com/sch0tten/nvfp4-benchmark. Clone it, run make all, dispute a number, add a model — see §3.7. Abstract We benchmark four numeric formats — BF16, FP8, INT4-AWQ and NVFP4 — across sixteen arms (two dense and two Mixture-of-Experts instruction-tuned models, each in all four formats) on a single 96 GB NVIDIA Blackwell workstation, using the most-downloaded real-world quantization of each model rather than idealized in-house ones. On quality — measured generatively under one identical protocol with the EleutherAI harness — four bits is nearly free: averaged over five tasks, NVFP4’s cost is at most 0.6 points (the dense models) and the MoE models give up even less, and that cost is concentrated almost entirely in knowledge (MMLU-Pro); math, code and instruction-following sit at a ceiling. NVFP4 and INT4-AWQ are a wash at equal ~½ byte per parameter — which one wins is decided by the quantization recipe, not the number format. On throughput in the single-stream regime, the dominant lever is architecture: the MoE arms decode 3–7× faster than the dense ones, and within a model INT4-AWQ’s mature kernels usually edge NVFP4 on decode while NVFP4 holds the smallest weight footprint. With no access to NVIDIA’s harness, our independently-measured BF16→NVFP4 deltas reproduce NVIDIA’s published deltas to within 0.6 points on three of four benchmarks — and to 0.03 on the Qwen-MoE. The practical verdict for a local agentic deployment: run a 4-bit MoE; take INT4-AWQ for peak tokens-per-second today and the official NVFP4 for the smallest memory and the format Blackwell was built around. ...