Local LLM Bench: MoE vs Dense on One RTX 3090

I went looking for sustained-load benchmarks comparing MoE and Dense coding models on consumer GPUs. Not demo bursts on a Mac Mini — sustained autoregressive generation on real coding tasks, where architecture and interconnect are the only variables.

I found plenty of one-shot numbers. Nobody had published the comparison that matters: same hardware, same quantization, same inference engine, MoE versus Dense, across GPU configurations. Methodology visible. Numbers verifiable.

So I ran the tests. Dual RTX 3090s with NVLink, custom liquid cooling, a 6 kW isolation transformer feeding a double-conversion UPS. Not elegant, but thermally and electrically honest — sustained inference loads without throttling, no measurement fiction. The hardware details are below.

Scope: Inference-only. Autoregressive token generation for coding agents. Training workloads where interconnect bandwidth dominates gradient synchronization are a different discipline entirely. Code quality, tool-calling reliability, orchestrator compatibility — those dimensions matter too, but this piece measures one thing with discipline. The rest follows.

This is Part 1 of the Local LLM Bench series. Part 2 tests what happens when four agents hit the same GPU simultaneously. Part 3 scales to eight.

What You Should Build

If you have…	MoE tok/s	Dense tok/s	Recommendation
1x RTX 3090	168	41	MoE. No contest.
2x RTX 3090 (no NVLink)	164	64	MoE for speed. Dense only if you need its quality edge.
2x RTX 3090 (NVLink)	170	65	NVLink adds <4%. Save the money.

One used RTX 3090 (~$800), AWQ-4bit MoE at TP=1: 168 tok/s with 32K context. Enough for most coding agent tasks. Best performance-per-dollar for local inference. Need context beyond 32K? Add a second GPU over plain PCIe — skip NVLink.

Now, the data behind those numbers.

Why This Matters: Local Coding Agent Swarms

The emerging pattern is orchestrator + swarm: a powerful model (Claude Opus, GPT) plans and decomposes work, then delegates parallelizable tasks to faster, cheaper models. Code generation, test writing, refactoring, documentation — embarrassingly parallel.

Running those swarm agents locally gives you cost (zero API spend for bulk generation), latency (no network round-trip, sub-second time-to-first-token), and privacy (code never leaves your machine). There’s a fourth advantage that isn’t obvious from the numbers: data sovereignty. Frontier models handle reasoning and orchestration; the local model handles execution. Your proprietary code and business data stay on your network. The frontier model sees task descriptions and reviews outputs; the local model sees the actual codebase. That boundary only gets more relevant as data residency requirements tighten.

For this to be practical, the local model needs >100 tok/s for interactive agent loops, native tool calling, good code quality, and affordable hardware — ideally a single consumer GPU. As the results show, MoE models hit all four on an $800 used RTX 3090.

The Models

	MoE	Dense
Model	Qwen3-Coder-30B-A3B-Instruct	Qwen2.5-Coder-32B-Instruct
Architecture	Mixture-of-Experts	Dense transformer
Total parameters	30.5B	32.5B
Active parameters/token	3.3B (8 of 128 experts)	32.5B (all weights)
Quantization	AWQ 4-bit (Marlin kernels)	AWQ 4-bit (Marlin kernels)
Model size on disk	16.9 GB	19.5 GB
Context	256K native	128K (configured at 32K for VRAM)
Native tool calling	Yes	No
Designed for	Agentic coding, SWE-bench, terminal-bench	General coding

Both use identical Marlin AWQ kernels on Ampere Tensor Cores. The only variable is architecture.

Qwen3-Coder is one generation newer than Qwen2.5-Coder — intentional. These are the best available coding models in each architecture class. The comparison answers which model should I run? not the theoretical is MoE faster at equal quality? Some portion of the throughput gap reflects Qwen3’s architectural optimizations beyond MoE alone.

The Hardware

Component	Spec
GPUs	2x NVIDIA RTX 3090 24GB (Ampere, SM 8.6)
Interconnect	NV3 NVLink (3 lanes, ~112 GB/s bidirectional)
PCIe	Gen 4.0 x16 (~25 GB/s per GPU)
CPU	AMD Threadripper (Gigabyte TRX40 Aorus Master)
RAM	64GB DDR4
Storage	PM1735 enterprise NVMe (5.4TB ZFS pool)
Cooling	Custom loop, waterblocked GPUs, 360mm radiator
Power	6 kW UPS, 6 kW transformer (wye phase+neutral)
OS	Ubuntu 24.04, CUDA 12.8, driver 570.133.20

The Benchmark

Five coding tasks — the kind agents actually do:

Prompt	Task Type	What It Tests
Implement LRU Cache	Algorithm implementation	Data structure design, O(1) complexity, type hints
Debug Merge Sort	Bug finding & fixing	Code comprehension, error analysis, corrected output
Pytest Suite	Test generation	Parametrize, edge cases, 12+ test cases for a cron parser
Refactor Flask Route	Architecture refactoring	Service layer, repository pattern, separation of concerns
Rate Limiter Design	System design + implementation	Token bucket, thread safety, decorator pattern

Methodology: Each prompt at max_tokens=4096, temperature=0.7. Thinking mode disabled. Served via vLLM’s OpenAI-compatible chat completions API. Two runs per config (warmup then measurement). Tokens/second = API-reported completion_tokens / wall-clock time. All responses completed naturally except one test-generation prompt that hit the 4096-token cap. vLLM 0.17.0rc1 with --enable-chunked-prefill, --max-num-seqs 4, --gpu-memory-utilization 0.92.

Configurations: (1) TP=2, NVLink — both GPUs, NVLink active. (2) TP=1, Single GPU — one GPU only. (3) TP=2, PCIe only — both GPUs, NVLink disabled via NCCL_P2P_DISABLE=1.

Results

Note on “Max Context”: Maximum supported context lengths at each configuration given available VRAM — not the context length used during testing.

MoE: Qwen3-Coder-30B-A3B AWQ-4bit (3.3B active)

Configuration	Impl	Debug	Test	Refactor	Design	Average	Max Context
TP=2, NVLink	170.0	167.3	169.6	168.5	170.7	169.7 tok/s	131K
TP=1, Single GPU	169.2	168.8	168.0	168.1	168.5	168.4 tok/s	32K
TP=2, PCIe only	165.9	162.9	163.2	160.7	164.5	163.5 tok/s	131K

Dense: Qwen2.5-Coder-32B AWQ-4bit (32B active)

Configuration	Impl	Debug	Test	Refactor	Design	Average	Max Context
TP=2, NVLink	65.6	65.1	65.5	64.9	65.6	65.4 tok/s	32K
TP=1, Single GPU	41.0	40.9	40.9	41.0	41.0	41.0 tok/s	8K
TP=2, PCIe only	63.9	63.5	63.7	63.4	63.8	63.7 tok/s	32K

Head-to-Head

Configuration	MoE	Dense	MoE Advantage
TP=2 NVLink	169.7	65.4	2.6x
TP=1 Single GPU	168.4	41.0	4.1x
TP=2 PCIe only	163.5	63.7	2.6x

MoE vs Dense: Average Throughput by Configuration

Analysis

MoE doesn’t care about your GPU topology. The MoE model delivers 164–170 tok/s regardless of configuration. Single GPU, dual GPU, NVLink, PCIe — spread is 3.8%. It activates 3.3B parameters per token; one RTX 3090’s 936 GB/s memory bandwidth serves that without breaking a sweat. The only reason to add a second GPU is context length: one 24GB GPU fits the 17GB model plus ~7GB KV cache (~32K context). Two GPUs give you 131K.

Configuration Impact: MoE Doesn’t Care, Dense Does

Dense needs two GPUs — but NVLink barely matters. Going from one GPU to two with NVLink gives a 59% throughput boost (41 → 65 tok/s). PCIe-only dual GPU (63.7 tok/s) comes within 2.7% of NVLink (65.4 tok/s). At single-request batch size, the all-reduce payload is small enough that PCIe Gen 4 handles it. NVLink’s advantage grows at higher concurrency — we measure that in Part 2.

Single GPU is where MoE dominates. One RTX 3090: MoE at 168 tok/s, dense at 41 tok/s — 4.1x. At 168 tok/s, a 1000-token function completes in ~6 seconds. At 41 tok/s, the same task takes 24 seconds — too slow for interactive agent loops.

Throughput is consistent across task types. Both models show near-flat throughput regardless of prompt complexity. You can reliably predict completion times for agent task scheduling.

Throughput Consistency Across Task Types

A Note on Thinking Mode

Both models were benchmarked with thinking mode disabled. Qwen3-Coder supports an optional thinking mode (<think>...</think> reasoning tokens) that may improve code quality at the cost of effective throughput. The numbers above are raw generation speed; with thinking enabled, a portion of tokens are internal reasoning the agent doesn’t use. That trade-off is for a follow-up benchmark.

Why AWQ 4-bit, Not FP8?

The RTX 3090 (Ampere, SM 8.6) has Tensor Cores for FP16, BF16, INT8, and INT4. Native FP8 was introduced in Ada Lovelace and Hopper. On a 3090, FP8 models are decompressed to FP16 on the fly — a compute tax. Our earlier testing showed FP8 13% slower than AWQ 4-bit on the same MoE model. AWQ 4-bit with Marlin is mature on Ampere: smaller files (17GB vs 30GB), more VRAM for KV cache, higher throughput. Quality impact is negligible (~0.7% perplexity). On Ada/Hopper, native FP8 would change the calculus; we haven’t measured that.

The Agent Architecture

At 168 tok/s, a 500-token code generation completes in ~3 seconds. Three parallel subagents can each produce a complete file in the time it takes to review the output. The orchestrator (Claude Code) handles planning, architectural reasoning, and quality review; the local swarm handles volume — boilerplate, test scaffolding, documentation, single-function implementations. Faster iteration, lower cost.

What to Build

One GPU, ~$800: the sweet spot. Used RTX 3090, AWQ-4bit MoE at TP=1. Best performance-per-dollar; 32K context is enough for most agent tasks.

Two GPUs, ~$1600: only if you need long context. Skip NVLink. Two RTX 3090s over PCIe deliver 164 tok/s with 131K context. The 4% NVLink speedup doesn’t justify bridges and compatible motherboards.

Quantization: AWQ-4bit wins on Ampere. Faster than FP8, smaller on disk, quality loss negligible. FP8 makes sense on GPUs with native FP8 compute — not here.

Dense models still have their place. If you need the best code quality and have two GPUs, a dense 32B at 65 tok/s is viable for non-interactive workloads — background batch generation, overnight code review. For interactive agent loops, MoE wins.

The Stack

Component	What	Why
vLLM 0.17.0 nightly	Inference engine	Required for Qwen3 MoE support
`--disable-custom-all-reduce`	vLLM flag	Custom all-reduce crashes on SM 8.6 (RTX 3090)
`--tool-call-parser qwen3_coder`	vLLM flag	Enables native tool calling for agent workflows
`NCCL_P2P_DISABLE=1`	Environment variable	Forces PCIe transport (for benchmarking or non-NVLink systems)
LiteLLM	API proxy	Translates Anthropic API to OpenAI API for Claude Code integration

This benchmark uses vLLM 0.17.0rc1 nightly (Qwen3 MoE). Qwen3.5 MoE (e.g. Qwen3.5-35B-A3B) uses a different architecture not yet available in this build.

What Comes Next

These are single-request numbers. One prompt in, one response out. Clean, isolated, reproducible.

That’s not how swarms work. An orchestrator dispatches four tasks simultaneously. The local model serves all four. Memory bandwidth contends. Per-task throughput drops. The question isn’t whether it drops — it’s whether MoE’s bandwidth headroom absorbs the penalty better than Dense.

Part 2 of this series answers that. The short version: the advantage gets wider under load, not narrower. One GPU, four agents, 336 effective tok/s. The numbers change the recommendation from “MoE is faster” to “MoE is the only architecture that makes local swarms practical on consumer hardware.”

Benchmarked March 6, 2026. Hardware: 2x NVIDIA RTX 3090 24GB, NV3 NVLink, AMD Threadripper, 64GB RAM, PM1735 NVMe. Software: vLLM 0.17.0rc1, Ubuntu 24.04, CUDA 12.8, driver 570.133.20. All benchmark code, raw CSV data, and chart generation scripts at github.com/sch0tten/local-llm-eval.

What You Should Build#

Why This Matters: Local Coding Agent Swarms#

The Models#

The Hardware#

The Benchmark#

Results#

MoE: Qwen3-Coder-30B-A3B AWQ-4bit (3.3B active)#

Dense: Qwen2.5-Coder-32B AWQ-4bit (32B active)#

Head-to-Head#

Analysis#

A Note on Thinking Mode#

Why AWQ 4-bit, Not FP8?#

The Agent Architecture#

What to Build#

The Stack#

What Comes Next#