Best Local LLM for Agentic Coding — Real Benchmarks

The technology that actually matters in AI compute — SXM socket GPUs — doesn’t exist in consumer hardware. The market has struggled with GPU availability for over a decade; the supply dynamics haven’t changed, they’ve just shifted upstream, away from consumers entirely.

I spent 20 years architecting mission-critical enterprise systems — chasing performance through every layer from L3 cache to SAN fabric. When I moved from Brazil to the US and lost access to the data center I’d built at AMTI, I rebuilt a near-production lab at home: dual RTX 3090s with NVLink, custom liquid cooling, and a 6 kW UPS with faithful power delivery. Not elegant, but thermally and electrically honest. Sustained inference loads without throttling, no measurement fiction. The hardware details are below.

There’s a wave of enthusiasm around local agentic AI on consumer hardware — Mac Minis, NVIDIA GB10 Sparks, GGUF on CPU. People are doing the math on canceling API subscriptions and running local. The math works on paper. I couldn’t find anyone publishing sustained-load benchmarks that isolate what actually matters: how architecture and interconnect affect throughput under real conditions, not demo bursts.

So I ran the tests myself. Here’s what nobody measured for you.

What This Article Is — and What It Isn’t

Inference-only. The NVLink results, the single-GPU recommendations — all of it applies to autoregressive token generation for coding agents. ML training with parallel gradient synchronization is a different workload; interconnect bandwidth is critical there. NVSwitch fabrics and RDMA over NCCL change the calculus entirely. That’s a separate article.

This piece is a hardware and architecture comparison: throughput on real coding tasks across GPU configurations and model architectures. Tokens-per-second alone doesn’t determine real-world agent quality. Code correctness, tool-calling reliability, context utilization, and orchestrator compatibility all matter — those dimensions are for the next article in this series, where we run these same models through real agentic coding tasks against Claude Code with Opus and Sonnet.

This article measures one thing with discipline. The rest follows.

If you want to suggest another workload to test, drop me an email.

What You Should Build

If you have…	MoE tok/s	Dense tok/s	Recommendation
1x RTX 3090	168	41	MoE. No contest.
2x RTX 3090 (no NVLink)	164	64	MoE for speed. Dense only if you need its quality edge.
2x RTX 3090 (NVLink)	170	65	NVLink adds <4%. Save the money.

One used RTX 3090 (~$800), AWQ-4bit MoE at TP=1: 168 tok/s with 32K context. Enough for most coding agent tasks. Best performance-per-dollar for local inference. Need context beyond 32K? Add a second GPU over plain PCIe — skip NVLink.

Now, the data behind those numbers.

Why This Matters: Local Coding Agent Swarms

The emerging pattern is orchestrator + swarm: a powerful model (Claude Opus, GPT) plans and decomposes work, then delegates parallelizable tasks to faster, cheaper models. Code generation, test writing, refactoring, documentation — embarrassingly parallel.

Running those swarm agents locally gives you cost (zero API spend for bulk generation), latency (no network round-trip, sub-second time-to-first-token), and privacy (code never leaves your machine). There’s a fourth advantage that isn’t obvious from the numbers: data sovereignty. Frontier models handle reasoning and orchestration; the local model handles execution. Your proprietary code and business data stay on your network. The frontier model sees task descriptions and reviews outputs; the local model sees the actual codebase. That boundary only gets more relevant as data residency requirements tighten.

For this to be practical, the local model needs >100 tok/s for interactive agent loops, native tool calling, good code quality, and affordable hardware — ideally a single consumer GPU. As the results show, MoE models hit all four on an $800 used RTX 3090.

The Models

	MoE	Dense
Model	Qwen3-Coder-30B-A3B-Instruct	Qwen2.5-Coder-32B-Instruct
Architecture	Mixture-of-Experts	Dense transformer
Total parameters	30.5B	32.5B
Active parameters/token	3.3B (8 of 128 experts)	32.5B (all weights)
Quantization	AWQ 4-bit (Marlin kernels)	AWQ 4-bit (Marlin kernels)
Model size on disk	16.9 GB	19.5 GB
Context	256K native	128K (configured at 32K for VRAM)
Native tool calling	Yes	No
Designed for	Agentic coding, SWE-bench, terminal-bench	General coding

Both use identical Marlin AWQ kernels on Ampere Tensor Cores. The only variable is architecture.

Qwen3-Coder is one generation newer than Qwen2.5-Coder — intentional. These are the best available coding models in each architecture class. The comparison answers which model should I run? not the theoretical is MoE faster at equal quality? Some portion of the throughput gap reflects Qwen3’s architectural optimizations beyond MoE alone.

The Hardware

Component	Spec
GPUs	2x NVIDIA RTX 3090 24GB (Ampere, SM 8.6)
Interconnect	NV3 NVLink (3 lanes, ~112 GB/s bidirectional)
PCIe	Gen 4.0 x16 (~25 GB/s per GPU)
CPU	AMD Threadripper (Gigabyte TRX40 Aorus Master)
RAM	64GB DDR4
Storage	PM1735 enterprise NVMe (5.4TB ZFS pool)
Cooling	Custom loop, waterblocked GPUs, 360mm radiator
Power	6 kW UPS, 6 kW transformer (wye phase+neutral)
OS	Ubuntu 24.04, CUDA 12.8, driver 570.133.20

The Benchmark

Five coding tasks — the kind agents actually do:

Prompt	Task Type	What It Tests
Implement LRU Cache	Algorithm implementation	Data structure design, O(1) complexity, type hints
Debug Merge Sort	Bug finding & fixing	Code comprehension, error analysis, corrected output
Pytest Suite	Test generation	Parametrize, edge cases, 12+ test cases for a cron parser
Refactor Flask Route	Architecture refactoring	Service layer, repository pattern, separation of concerns
Rate Limiter Design	System design + implementation	Token bucket, thread safety, decorator pattern

Methodology: Each prompt at max_tokens=4096, temperature=0.7. Thinking mode disabled. Served via vLLM’s OpenAI-compatible chat completions API. Two runs per config (warmup then measurement). Tokens/second = API-reported completion_tokens / wall-clock time. All responses completed naturally except one test-generation prompt that hit the 4096-token cap. vLLM 0.17.0rc1 with --enable-chunked-prefill, --max-num-seqs 4, --gpu-memory-utilization 0.92.

Configurations: (1) TP=2, NVLink — both GPUs, NVLink active. (2) TP=1, Single GPU — one GPU only. (3) TP=2, PCIe only — both GPUs, NVLink disabled via NCCL_P2P_DISABLE=1.

Results

Note on “Max Context”: Maximum supported context lengths at each configuration given available VRAM — not the context length used during testing.

MoE: Qwen3-Coder-30B-A3B AWQ-4bit (3.3B active)

Configuration	Impl	Debug	Test	Refactor	Design	Average	Max Context
TP=2, NVLink	170.0	167.3	169.6	168.5	170.7	169.7 tok/s	131K
TP=1, Single GPU	169.2	168.8	168.0	168.1	168.5	168.4 tok/s	32K
TP=2, PCIe only	165.9	162.9	163.2	160.7	164.5	163.5 tok/s	131K

Dense: Qwen2.5-Coder-32B AWQ-4bit (32B active)

Configuration	Impl	Debug	Test	Refactor	Design	Average	Max Context
TP=2, NVLink	65.6	65.1	65.5	64.9	65.6	65.4 tok/s	32K
TP=1, Single GPU	41.0	40.9	40.9	41.0	41.0	41.0 tok/s	8K
TP=2, PCIe only	63.9	63.5	63.7	63.4	63.8	63.7 tok/s	32K

Head-to-Head

Configuration	MoE	Dense	MoE Advantage
TP=2 NVLink	169.7	65.4	2.6x
TP=1 Single GPU	168.4	41.0	4.1x
TP=2 PCIe only	163.5	63.7	2.6x

Analysis

MoE doesn’t care about your GPU topology. The MoE model delivers 164–170 tok/s regardless of configuration. Single GPU, dual GPU, NVLink, PCIe — spread is 3.8%. It activates 3.3B parameters per token; one RTX 3090’s 936 GB/s memory bandwidth serves that without breaking a sweat. The only reason to add a second GPU is context length: one 24GB GPU fits the 17GB model plus ~7GB KV cache (~32K context). Two GPUs give you 131K.

Dense needs two GPUs — but NVLink barely matters. Going from one GPU to two with NVLink gives a 59% throughput boost (41 → 65 tok/s). PCIe-only dual GPU (63.7 tok/s) comes within 2.7% of NVLink (65.4 tok/s). At single-request batch size, the all-reduce payload is small enough that PCIe Gen 4 handles it. NVLink’s advantage would grow at higher concurrency; for sequential agent loops it’s effectively invisible.

Single GPU is where MoE dominates. One RTX 3090: MoE at 168 tok/s, dense at 41 tok/s — 4.1x. At 168 tok/s, a 1000-token function completes in ~6 seconds. At 41 tok/s, the same task takes 24 seconds — too slow for interactive agent loops.

Throughput is consistent across task types. Both models show near-flat throughput regardless of prompt complexity. You can reliably predict completion times for agent task scheduling.

A Note on Thinking Mode

Both models were benchmarked with thinking mode disabled. Qwen3-Coder supports an optional thinking mode (<think>...</think> reasoning tokens) that may improve code quality at the cost of effective throughput. The numbers above are raw generation speed; with thinking enabled, a portion of tokens are internal reasoning the agent doesn’t use. That trade-off is for a follow-up benchmark.

Why AWQ 4-bit, Not FP8?

The RTX 3090 (Ampere, SM 8.6) has Tensor Cores for FP16, BF16, INT8, and INT4. Native FP8 was introduced in Ada Lovelace and Hopper. On a 3090, FP8 models are decompressed to FP16 on the fly — a compute tax. Our earlier testing showed FP8 13% slower than AWQ 4-bit on the same MoE model. AWQ 4-bit with Marlin is mature on Ampere: smaller files (17GB vs 30GB), more VRAM for KV cache, higher throughput. Quality impact is negligible (~0.7% perplexity). On Ada/Hopper, native FP8 would change the calculus; we haven’t measured that.

The Agent Architecture

At 168 tok/s, a 500-token code generation completes in ~3 seconds. Three parallel subagents can each produce a complete file in the time it takes to review the output. The orchestrator (Claude Code) handles planning, architectural reasoning, and quality review; the local swarm handles volume — boilerplate, test scaffolding, documentation, single-function implementations. Faster iteration, lower cost.

What to Build — Recommendations

One GPU, ~$800: the sweet spot. Used RTX 3090, AWQ-4bit MoE at TP=1. Best performance-per-dollar; 32K context is enough for most agent tasks.

Two GPUs, ~$1600: only if you need long context. Skip NVLink. Two RTX 3090s over PCIe deliver 164 tok/s with 131K context. The 4% NVLink speedup doesn’t justify bridges and compatible motherboards.

Quantization: AWQ-4bit wins on Ampere. Faster than FP8, smaller on disk, quality loss negligible. FP8 makes sense on GPUs with native FP8 compute — not here.

Dense models still have their place. If you need the best code quality and have two GPUs, a dense 32B at 65 tok/s is viable for non-interactive workloads — background batch generation, overnight code review. For interactive agent loops, MoE wins.

The Stack

Component	What	Why
vLLM 0.17.0 nightly	Inference engine	Required for Qwen3 MoE support
`--disable-custom-all-reduce`	vLLM flag	Custom all-reduce crashes on SM 8.6 (RTX 3090)
`--tool-call-parser qwen3_coder`	vLLM flag	Enables native tool calling for agent workflows
`NCCL_P2P_DISABLE=1`	Environment variable	Forces PCIe transport (for benchmarking or non-NVLink systems)
LiteLLM	API proxy	Translates Anthropic API to OpenAI API for Claude Code integration

This benchmark uses vLLM 0.17.0rc1 nightly (Qwen3 MoE). Qwen3.5 MoE (e.g. Qwen3.5-35B-A3B) uses a different architecture not yet available in this build.

Conclusion

MoE architectures change the economics of local LLM inference. The old playbook — multiple GPUs, NVLink, maximize memory bandwidth — was written for dense models. With MoE, one $800 RTX 3090 serves a 30B-parameter coding model at 168 tokens per second on real coding tasks. Fast enough to power a swarm of agents that generate, test, and refactor code in parallel — all local.

As the industry moves toward larger MoE models (Qwen3-Coder-Next at 80B/3B active, DeepSeek-V3 at 671B/37B active), the NVLink premium matters less. PCIe is sufficient for all-reduce in MoE inference. NVLink remains valuable for prefill-heavy and high-concurrency serving; for autoregressive generation — the bread and butter of coding agents — it’s a luxury, not a necessity.

Bottom line: if you’re building local infrastructure for AI coding agents, start with one RTX 3090, an AWQ-4bit MoE model, and vLLM. You get 168 tok/s, native tool calling, and a capable coding model for the price of two months of API credits.

Benchmarked March 6, 2026. Hardware: 2x NVIDIA RTX 3090 24GB, NV3 NVLink, AMD Threadripper, 64GB RAM, PM1735 NVMe. Software: vLLM 0.17.0rc1, Ubuntu 24.04, CUDA 12.8, driver 570.133.20.

What This Article Is — and What It Isn’t#

What You Should Build#

Why This Matters: Local Coding Agent Swarms#

The Models#

The Hardware#

The Benchmark#

Results#

MoE: Qwen3-Coder-30B-A3B AWQ-4bit (3.3B active)#

Dense: Qwen2.5-Coder-32B AWQ-4bit (32B active)#

Head-to-Head#

Analysis#

A Note on Thinking Mode#

Why AWQ 4-bit, Not FP8?#

The Agent Architecture#

What to Build — Recommendations#

The Stack#

Conclusion#