Local LLM Bench: Scaling Swarms Beyond Four

Part 2 ended with a promise: find the cliff. Run the MoE model from four concurrent agents upward until the physics says stop.

We scaled to eight. The cliff never came.

This is Part 3 of the Local LLM Bench series. Part 1 covers the single-request baseline. Part 2 established the MoE advantage under concurrent load.

The model: Qwen3-Coder-30B-A3B — a Mixture-of-Experts architecture that activates only 3.3B of its 30B parameters per token. On consumer GPUs, that sparse activation leaves ~90% of memory bandwidth idle at batch size 1, creating headroom that concurrent agents fill. Dense models activate all 32B parameters on every token — already at the bandwidth ceiling before the second agent connects. Part 1 explains why these specific models were chosen (best in class for each architecture); Part 2 conclusively eliminated Dense under concurrent load. This benchmark tests MoE only.

The Finding

Per-task throughput drops 27% between one and four concurrent agents. That penalty was established in Part 2. What Part 2 didn’t test was whether the degradation continues — whether agent five costs another 9%, and agent six another, until the GPU pair is producing tokens too slowly to be useful.

It doesn’t. The degradation stops at four.

Concurrent Agents	Per-task tok/s	Effective tok/s	Contention vs C=1
1	169	169	—
2	154	220	-9%
3	137	247	-19%
4	125	404	-26%
5	130	316	-23%
6	124	372	-27%
7	126	369	-26%
8	123	388	-27%

Read the C=4 to C=8 rows. Per-task throughput oscillates between 123 and 130 tok/s. That’s measurement noise, not degradation. The contention penalty that matters — 169 down to 125 tok/s, roughly 27% — happens between C=1 and C=4. After that, the floor is reached.

Agents five through eight are free.

Per-Task Throughput Plateau

The 500 tok/s Ceiling

Part 2 called the per-task slowdown “contention” and attributed it to memory bandwidth saturation. That’s half the story. The vLLM engine logs during the benchmark runs tell the other half:

Running: 4 reqs → Avg generation throughput: 498.4 tokens/s
Running: 4 reqs → Avg generation throughput: 500.3 tokens/s
Running: 8 reqs → Avg generation throughput: 496.0 tokens/s

The GPU pair has a fixed token budget. Approximately 500 tok/s aggregate, regardless of how many requests are in flight. At C=4, each task gets ~125 tok/s. At C=8, you’d expect 500 divided by 8 — 62.5 tok/s per agent. But that’s not what happens.

Think of a highway at capacity. Once traffic reaches saturated flow, adding more cars doesn’t make everyone slower — the cars just run closer together, and every exit creates a momentary gap that a new car fills. vLLM’s continuous batching scheduler works the same way. Shorter tasks finish and yield their compute slots. Longer tasks absorb the freed bandwidth. The per-token generation speed stays constant; what changes is how many tokens are in flight simultaneously. The result: each of eight agents still gets 123 tok/s instead of the 62.5 that static division predicts.

Engine Throughput Ceiling

A Measurement Nuance

Our benchmark uses different prompt variants at each concurrency level to defeat vLLM’s prefix caching — every level gets completely different prompts so no cached KV blocks inflate the throughput. But within each level, the two measurement runs use the same prompts.

The engine logs show prefix cache hit rates climbing from 0% on run 1 to 54% on run 2 at C=1, then stabilizing around 60-67% at higher concurrency as cache eviction balances accumulation. That C=1 baseline of 169 tok/s averages a cold run and a warm run. The “27% contention penalty” at C=4 is measured against a baseline that’s already boosted by cache on its second pass.

This doesn’t invalidate the numbers — in production, prefix cache is always active, and your agent prompts will share system prompts, tool definitions, and partial context. Cache hit rates of 50-70% are realistic. The point is that the 27% is an upper bound. Real compute contention is lower.

The Full Curve

Effective Throughput C=1 to C=8

The effective throughput numbers above C=4 look noisy — 316 at C=5, 372 at C=6, 388 at C=8. This variance comes from prompt output length, not throughput degradation. Different variant sets produce different total tokens. Variant E prompts average 10,357 tokens across four tasks; variant H averages 12,759. Since effective tok/s equals total tokens divided by wall time, output length variation masks the underlying stability.

Per-task tok/s tells the real story. Flat from C=4 to C=8. And it holds across all four task types, not just in aggregate:

Task Type	C=1	C=4	C=8	C=4 to C=8 delta
Algorithm	170	127	125	-2 tok/s
Testing	170	123	122	-1 tok/s
Refactoring	166	124	123	-1 tok/s
System Design	169	127	122	-5 tok/s

No task type is disproportionately affected. The MoE architecture distributes the compute budget evenly regardless of workload mix.

The Contention Floor

Part 2 called the 27% degradation at C=4 a “contention wall.” The extended data reframes it: this is a contention floor. A stable operating point where GPU compute is saturated but not overloaded.

Contention Floor

Two distinct regions. From C=1 to C=4: linear degradation as concurrent sequences compete for GDDR6X bandwidth at 936 GB/s per GPU. Each additional agent costs roughly 9% per-task throughput. From C=4 to C=8: plateau. Per-task throughput stabilizes at 123-130 tok/s. Additional agents are effectively free.

The transition aligns with the ~500 tok/s engine ceiling. Once the GPU pair is saturated, vLLM’s continuous batching manages the queue efficiently enough that adding sequences doesn’t increase per-token latency. It just increases total tokens in flight.

Single GPU — Different Floor, Same Shape

All the numbers above are dual RTX 3090 over NVLink. We ran the same sweep on a single GPU.

Concurrent Agents	Per-task tok/s (TP=1)	Eff. tok/s (TP=1)	Per-task tok/s (TP=2 NVLink)
1	167	165	169
2	142	193	154
4	100	323	125
6	102	300	124
8	100	303	123

Same shape. Contention ramp from C=1 to C=4, then the floor appears. The floor is lower — ~100 tok/s versus ~125 — and the engine ceiling is ~330 tok/s versus ~500. Both numbers track: half the GPUs, roughly 65% of the throughput, because a single GPU shares its 936 GB/s bandwidth across all sequences without the NVLink aggregate bandwidth to lean on.

The contention penalty is steeper on a single GPU: 40% versus 26%. Part 2 showed this at C=4. The extended data confirms it holds through C=8 — the floor stays at 40% contention, it doesn’t deepen.

One $800 GPU. Eight concurrent agents at 100 tok/s each. A 2,000-token function implementation completes in 20 seconds. Eight of them complete in 20 seconds.

What This Means for Your Swarm

Set --max-num-seqs 8. On our dual-GPU setup, there’s no per-task penalty beyond C=4. The only cost is KV cache memory, and at 0.92 GPU utilization on 48GB total, the benchmark showed peak KV cache usage of 4% even at C=8. Not a constraint.

Don’t cap your agent count at four. If your orchestrator can decompose work into six or eight independent tasks — generate a module, write its tests, refactor a dependency, draft the docs — dispatch them all. Each agent gets 123 tok/s on dual GPU, 100 tok/s on single. Eight tasks finish in the wall time it takes one agent to complete one task serially, plus 27-40% overhead.

Prefix cache is your ally in production. We deliberately defeat it for measurement purity. In a real deployment, all subagents share a system prompt, tool definitions, and often partial context. Your actual per-task throughput under load will be higher than what we measured.

Agents	Per-task tok/s	Time for 2K tokens	Aggregate tok/s	Tasks per minute
1	169	11.8s	169	5.1
4	125	16.0s	500	15.0
8	123	16.3s	984	29.5

Nearly 30 coding tasks per minute at C=8. Each producing a full module, test suite, or refactored component. The per-task latency increase from serial to eight-agent swarm: 4.5 seconds. That’s the total cost of 8x parallelism.

These Are Floor Numbers

The RTX 3090 is a 2020 GPU. GDDR6X memory, 6 MB L2 cache. Every limitation we hit in this benchmark is a memory subsystem constraint that newer hardware directly addresses.

GPU	Bandwidth	L2 Cache	What changes
RTX 3090 (this bench)	936 GB/s	6 MB	Baseline — 500 tok/s ceiling
RTX 4090	1,008 GB/s	72 MB	12x L2: expert weights stay on-chip, contention floor drops
RTX 5090	1,792 GB/s	128 MB	1.9x bandwidth + 21x L2: ceiling shifts to ~900+ tok/s
H100	3,350 GB/s	50 MB	3.6x bandwidth + native FP8: different regime entirely

The L2 cache matters more than raw bandwidth numbers suggest for MoE inference. The model activates 3.3B parameters per token — at 4-bit quantization, roughly 1.7 GB of expert weights. The RTX 3090’s 6 MB L2 caches virtually nothing. Every expert fetch goes to GDDR6X. An RTX 5090 with 128 MB L2 would cache significant portions of the hot expert set, reducing effective memory traffic per token and pushing the contention floor lower, the engine ceiling higher, or both.

The scaling patterns we measured — the contention ramp, the plateau, the prefix cache dynamics — are architectural behaviors of MoE inference on saturated memory bandwidth. They hold across hardware. What changes is where the transitions land. The C=4 plateau on GDDR6X might become C=8 or C=12 on HBM. The 27% contention floor might drop to 10-15% with 20x more L2 cache. The 500 tok/s engine ceiling is a GDDR6X number, not a model architecture limit.

If an MoE model on 2020 consumer GPUs delivers eight concurrent agents at 123 tok/s each, the same architecture on current hardware isn’t incrementally better. It’s a different class of performance.

What We Still Haven’t Measured

Code quality under concurrency. All three parts of this series measure throughput — tokens per second, wall-clock time, contention penalties. We haven’t measured whether the quality of generated code degrades under concurrent load. Temperature and sampling are independent of throughput, so the hypothesis is no. But hypotheses without data are just hope.

Context length effects. These prompts are ~200 tokens of input generating 2,000-4,000 tokens of output. Agent prompts with 8K or 16K context windows consume more KV cache and may shift the plateau to a lower concurrency level.

Heterogeneous workloads. Real swarms mix short tasks with long ones — a code review alongside a full module generation. Whether the plateau holds when task lengths vary by 10x is an open question.

Benchmarked March 9, 2026. Hardware: 2x NVIDIA RTX 3090 24GB, NV3 NVLink, AMD Threadripper, 64GB RAM, PM1735 NVMe. Software: vLLM 0.17.0rc1, Ubuntu 24.04, CUDA 12.8, driver 570.133.20. All benchmark code, raw CSV data, and chart generation scripts at github.com/sch0tten/local-llm-eval.

The Finding#

The 500 tok/s Ceiling#

A Measurement Nuance#

The Full Curve#

The Contention Floor#

Single GPU — Different Floor, Same Shape#

What This Means for Your Swarm#

These Are Floor Numbers#

What We Still Haven’t Measured#