The technology that actually matters in AI compute — SXM socket GPUs — doesn’t exist in consumer hardware. The market has struggled with GPU availability for over a decade; the supply dynamics haven’t changed, they’ve just shifted upstream, away from consumers entirely.
I spent 20 years architecting mission-critical enterprise systems — chasing performance through every layer from L3 cache to SAN fabric. When I moved from Brazil to the US and lost access to the data center I’d built at AMTI, I rebuilt a near-production lab at home: dual RTX 3090s with NVLink, custom liquid cooling, and a 6 kW UPS with faithful power delivery. Not elegant, but thermally and electrically honest. Sustained inference loads without throttling, no measurement fiction. The hardware details are below.
There’s a wave of enthusiasm around local agentic AI on consumer hardware — Mac Minis, NVIDIA GB10 Sparks, GGUF on CPU. People are doing the math on canceling API subscriptions and running local. The math works on paper. I couldn’t find anyone publishing sustained-load benchmarks that isolate what actually matters: how architecture and interconnect affect throughput under real conditions, not demo bursts.
So I ran the tests myself. Here’s what nobody measured for you.
What This Article Is — and What It Isn’t
Inference-only. The NVLink results, the single-GPU recommendations — all of it applies to autoregressive token generation for coding agents. ML training with parallel gradient synchronization is a different workload; interconnect bandwidth is critical there. NVSwitch fabrics and RDMA over NCCL change the calculus entirely. That’s a separate article.
This piece is a hardware and architecture comparison: throughput on real coding tasks across GPU configurations and model architectures. Tokens-per-second alone doesn’t determine real-world agent quality. Code correctness, tool-calling reliability, context utilization, and orchestrator compatibility all matter — those dimensions are for the next article in this series, where we run these same models through real agentic coding tasks against Claude Code with Opus and Sonnet.
This article measures one thing with discipline. The rest follows.
If you want to suggest another workload to test, drop me an email.
What You Should Build
| If you have… | MoE tok/s | Dense tok/s | Recommendation |
|---|---|---|---|
| 1x RTX 3090 | 168 | 41 | MoE. No contest. |
| 2x RTX 3090 (no NVLink) | 164 | 64 | MoE for speed. Dense only if you need its quality edge. |
| 2x RTX 3090 (NVLink) | 170 | 65 | NVLink adds <4%. Save the money. |
One used RTX 3090 (~$800), AWQ-4bit MoE at TP=1: 168 tok/s with 32K context. Enough for most coding agent tasks. Best performance-per-dollar for local inference. Need context beyond 32K? Add a second GPU over plain PCIe — skip NVLink.
Now, the data behind those numbers.
Why This Matters: Local Coding Agent Swarms
The emerging pattern is orchestrator + swarm: a powerful model (Claude Opus, GPT) plans and decomposes work, then delegates parallelizable tasks to faster, cheaper models. Code generation, test writing, refactoring, documentation — embarrassingly parallel.
Running those swarm agents locally gives you cost (zero API spend for bulk generation), latency (no network round-trip, sub-second time-to-first-token), and privacy (code never leaves your machine). There’s a fourth advantage that isn’t obvious from the numbers: data sovereignty. Frontier models handle reasoning and orchestration; the local model handles execution. Your proprietary code and business data stay on your network. The frontier model sees task descriptions and reviews outputs; the local model sees the actual codebase. That boundary only gets more relevant as data residency requirements tighten.
For this to be practical, the local model needs >100 tok/s for interactive agent loops, native tool calling, good code quality, and affordable hardware — ideally a single consumer GPU. As the results show, MoE models hit all four on an $800 used RTX 3090.
The Models
| MoE | Dense | |
|---|---|---|
| Model | Qwen3-Coder-30B-A3B-Instruct | Qwen2.5-Coder-32B-Instruct |
| Architecture | Mixture-of-Experts | Dense transformer |
| Total parameters | 30.5B | 32.5B |
| Active parameters/token | 3.3B (8 of 128 experts) | 32.5B (all weights) |
| Quantization | AWQ 4-bit (Marlin kernels) | AWQ 4-bit (Marlin kernels) |
| Model size on disk | 16.9 GB | 19.5 GB |
| Context | 256K native | 128K (configured at 32K for VRAM) |
| Native tool calling | Yes | No |
| Designed for | Agentic coding, SWE-bench, terminal-bench | General coding |
Both use identical Marlin AWQ kernels on Ampere Tensor Cores. The only variable is architecture.
Qwen3-Coder is one generation newer than Qwen2.5-Coder — intentional. These are the best available coding models in each architecture class. The comparison answers which model should I run? not the theoretical is MoE faster at equal quality? Some portion of the throughput gap reflects Qwen3’s architectural optimizations beyond MoE alone.
The Hardware
| Component | Spec |
|---|---|
| GPUs | 2x NVIDIA RTX 3090 24GB (Ampere, SM 8.6) |
| Interconnect | NV3 NVLink (3 lanes, ~112 GB/s bidirectional) |
| PCIe | Gen 4.0 x16 (~25 GB/s per GPU) |
| CPU | AMD Threadripper (Gigabyte TRX40 Aorus Master) |
| RAM | 64GB DDR4 |
| Storage | PM1735 enterprise NVMe (5.4TB ZFS pool) |
| Cooling | Custom loop, waterblocked GPUs, 360mm radiator |
| Power | 6 kW UPS, 6 kW transformer (wye phase+neutral) |
| OS | Ubuntu 24.04, CUDA 12.8, driver 570.133.20 |
The Benchmark
Five coding tasks — the kind agents actually do:
| Prompt | Task Type | What It Tests |
|---|---|---|
| Implement LRU Cache | Algorithm implementation | Data structure design, O(1) complexity, type hints |
| Debug Merge Sort | Bug finding & fixing | Code comprehension, error analysis, corrected output |
| Pytest Suite | Test generation | Parametrize, edge cases, 12+ test cases for a cron parser |
| Refactor Flask Route | Architecture refactoring | Service layer, repository pattern, separation of concerns |
| Rate Limiter Design | System design + implementation | Token bucket, thread safety, decorator pattern |
Methodology: Each prompt at max_tokens=4096, temperature=0.7. Thinking mode disabled. Served via vLLM’s OpenAI-compatible chat completions API. Two runs per config (warmup then measurement). Tokens/second = API-reported completion_tokens / wall-clock time. All responses completed naturally except one test-generation prompt that hit the 4096-token cap. vLLM 0.17.0rc1 with --enable-chunked-prefill, --max-num-seqs 4, --gpu-memory-utilization 0.92.
Configurations: (1) TP=2, NVLink — both GPUs, NVLink active. (2) TP=1, Single GPU — one GPU only. (3) TP=2, PCIe only — both GPUs, NVLink disabled via NCCL_P2P_DISABLE=1.
Results
Note on “Max Context”: Maximum supported context lengths at each configuration given available VRAM — not the context length used during testing.
MoE: Qwen3-Coder-30B-A3B AWQ-4bit (3.3B active)
| Configuration | Impl | Debug | Test | Refactor | Design | Average | Max Context |
|---|---|---|---|---|---|---|---|
| TP=2, NVLink | 170.0 | 167.3 | 169.6 | 168.5 | 170.7 | 169.7 tok/s | 131K |
| TP=1, Single GPU | 169.2 | 168.8 | 168.0 | 168.1 | 168.5 | 168.4 tok/s | 32K |
| TP=2, PCIe only | 165.9 | 162.9 | 163.2 | 160.7 | 164.5 | 163.5 tok/s | 131K |
Dense: Qwen2.5-Coder-32B AWQ-4bit (32B active)
| Configuration | Impl | Debug | Test | Refactor | Design | Average | Max Context |
|---|---|---|---|---|---|---|---|
| TP=2, NVLink | 65.6 | 65.1 | 65.5 | 64.9 | 65.6 | 65.4 tok/s | 32K |
| TP=1, Single GPU | 41.0 | 40.9 | 40.9 | 41.0 | 41.0 | 41.0 tok/s | 8K |
| TP=2, PCIe only | 63.9 | 63.5 | 63.7 | 63.4 | 63.8 | 63.7 tok/s | 32K |
Head-to-Head
| Configuration | MoE | Dense | MoE Advantage |
|---|---|---|---|
| TP=2 NVLink | 169.7 | 65.4 | 2.6x |
| TP=1 Single GPU | 168.4 | 41.0 | 4.1x |
| TP=2 PCIe only | 163.5 | 63.7 | 2.6x |
Analysis
MoE doesn’t care about your GPU topology. The MoE model delivers 164–170 tok/s regardless of configuration. Single GPU, dual GPU, NVLink, PCIe — spread is 3.8%. It activates 3.3B parameters per token; one RTX 3090’s 936 GB/s memory bandwidth serves that without breaking a sweat. The only reason to add a second GPU is context length: one 24GB GPU fits the 17GB model plus ~7GB KV cache (~32K context). Two GPUs give you 131K.
Dense needs two GPUs — but NVLink barely matters. Going from one GPU to two with NVLink gives a 59% throughput boost (41 → 65 tok/s). PCIe-only dual GPU (63.7 tok/s) comes within 2.7% of NVLink (65.4 tok/s). At single-request batch size, the all-reduce payload is small enough that PCIe Gen 4 handles it. NVLink’s advantage would grow at higher concurrency; for sequential agent loops it’s effectively invisible.
Single GPU is where MoE dominates. One RTX 3090: MoE at 168 tok/s, dense at 41 tok/s — 4.1x. At 168 tok/s, a 1000-token function completes in ~6 seconds. At 41 tok/s, the same task takes 24 seconds — too slow for interactive agent loops.
Throughput is consistent across task types. Both models show near-flat throughput regardless of prompt complexity. You can reliably predict completion times for agent task scheduling.
A Note on Thinking Mode
Both models were benchmarked with thinking mode disabled. Qwen3-Coder supports an optional thinking mode (<think>...</think> reasoning tokens) that may improve code quality at the cost of effective throughput. The numbers above are raw generation speed; with thinking enabled, a portion of tokens are internal reasoning the agent doesn’t use. That trade-off is for a follow-up benchmark.
Why AWQ 4-bit, Not FP8?
The RTX 3090 (Ampere, SM 8.6) has Tensor Cores for FP16, BF16, INT8, and INT4. Native FP8 was introduced in Ada Lovelace and Hopper. On a 3090, FP8 models are decompressed to FP16 on the fly — a compute tax. Our earlier testing showed FP8 13% slower than AWQ 4-bit on the same MoE model. AWQ 4-bit with Marlin is mature on Ampere: smaller files (17GB vs 30GB), more VRAM for KV cache, higher throughput. Quality impact is negligible (~0.7% perplexity). On Ada/Hopper, native FP8 would change the calculus; we haven’t measured that.
The Agent Architecture
At 168 tok/s, a 500-token code generation completes in ~3 seconds. Three parallel subagents can each produce a complete file in the time it takes to review the output. The orchestrator (Claude Code) handles planning, architectural reasoning, and quality review; the local swarm handles volume — boilerplate, test scaffolding, documentation, single-function implementations. Faster iteration, lower cost.
What to Build — Recommendations
One GPU, ~$800: the sweet spot. Used RTX 3090, AWQ-4bit MoE at TP=1. Best performance-per-dollar; 32K context is enough for most agent tasks.
Two GPUs, ~$1600: only if you need long context. Skip NVLink. Two RTX 3090s over PCIe deliver 164 tok/s with 131K context. The 4% NVLink speedup doesn’t justify bridges and compatible motherboards.
Quantization: AWQ-4bit wins on Ampere. Faster than FP8, smaller on disk, quality loss negligible. FP8 makes sense on GPUs with native FP8 compute — not here.
Dense models still have their place. If you need the best code quality and have two GPUs, a dense 32B at 65 tok/s is viable for non-interactive workloads — background batch generation, overnight code review. For interactive agent loops, MoE wins.
The Stack
| Component | What | Why |
|---|---|---|
| vLLM 0.17.0 nightly | Inference engine | Required for Qwen3 MoE support |
--disable-custom-all-reduce | vLLM flag | Custom all-reduce crashes on SM 8.6 (RTX 3090) |
--tool-call-parser qwen3_coder | vLLM flag | Enables native tool calling for agent workflows |
NCCL_P2P_DISABLE=1 | Environment variable | Forces PCIe transport (for benchmarking or non-NVLink systems) |
| LiteLLM | API proxy | Translates Anthropic API to OpenAI API for Claude Code integration |
This benchmark uses vLLM 0.17.0rc1 nightly (Qwen3 MoE). Qwen3.5 MoE (e.g. Qwen3.5-35B-A3B) uses a different architecture not yet available in this build.
Conclusion
MoE architectures change the economics of local LLM inference. The old playbook — multiple GPUs, NVLink, maximize memory bandwidth — was written for dense models. With MoE, one $800 RTX 3090 serves a 30B-parameter coding model at 168 tokens per second on real coding tasks. Fast enough to power a swarm of agents that generate, test, and refactor code in parallel — all local.
As the industry moves toward larger MoE models (Qwen3-Coder-Next at 80B/3B active, DeepSeek-V3 at 671B/37B active), the NVLink premium matters less. PCIe is sufficient for all-reduce in MoE inference. NVLink remains valuable for prefill-heavy and high-concurrency serving; for autoregressive generation — the bread and butter of coding agents — it’s a luxury, not a necessity.
Bottom line: if you’re building local infrastructure for AI coding agents, start with one RTX 3090, an AWQ-4bit MoE model, and vLLM. You get 168 tok/s, native tool calling, and a capable coding model for the price of two months of API credits.
Benchmarked March 6, 2026. Hardware: 2x NVIDIA RTX 3090 24GB, NV3 NVLink, AMD Threadripper, 64GB RAM, PM1735 NVMe. Software: vLLM 0.17.0rc1, Ubuntu 24.04, CUDA 12.8, driver 570.133.20.