A benchmark lab for local models only makes sense when the rules do not change between runs. That is why every model in this article uses the same GPU, the same context length, the same backend and the same five prompts. The goal is simple: measure what a local user actually feels. That means throughput in tok/s, time to first token, model size on disk and VRAM after loading.
This version of the lab already covers eight completed runs in the 27B to 35B class. It includes MoE and dense models, plus one MTP result that shows how large the real-world gap can be when speculative decoding works well. I also map every tested model to a realistic home GPU, because that is the first question most readers ask after seeing the charts.
Fixed hardware for the benchmark lab for local models
The whole comparison uses one unchanged test bed:
- GPU: AMD Radeon PRO W7900 48 GB VRAM
- Backend: llama-server Vulkan, local build for gfx1100
- OS: Windows 11 Pro 24H2
- Context: 8192 tokens
- Quantization: Q4_K_M for standard runs
- Protocol: 10 runs x 5 prompts per model
I keep this setup fixed on purpose. A benchmark lab for local models becomes useless when drivers, context or runtime change between test runs. The W7900 also gives enough headroom to load every tested model fully on GPU, so the comparison is not distorted by CPU offload.
What the benchmark lab for local models measures
Throughput in tok/s
This is the sustained generation rate after the first token. It matters most for long answers, code generation and batch jobs.
TTFT in milliseconds
Time to first token controls how responsive a model feels in chat or in an editor. Lower TTFT usually feels better than a small raw tok/s gain.
Size on disk
This is the GGUF file size for the exact or nearest Q4_K_M class artifact used in the lab. In practice, publisher builds differ slightly. MTP variants also tend to be larger.
VRAM after loading
This is the observed memory class used in the benchmark metadata after the model is loaded at context 8192. For home GPUs, this number matters more than parameter count.
Prompt suite and scoring
The test set has five prompts. Two are general tasks, two are code tasks and one is a longer generation task. Code prompts get a higher weight in the final composite score because this series targets local developer workflows.
- T1 summarize: technical summary
- T2 reasoning: multi-step reasoning
- C1 debug: bug finding and repair
- C2 generate: code generation
- L1 long: long technical answer
Because of that weighting, the composite score is the best quick view of how a model behaves in a real benchmark lab for local models. It captures speed where people actually spend time.
Results: benchmark lab for local models leaderboard
| Model | Architecture | Disk size | VRAM after load | Composite tok/s | Avg TTFT | Minimum home GPU |
|---|---|---|---|---|---|---|
| Gemma-4-26B-A4B | MoE, 26B total / 4B active | ~15 GB | 18 GB | 87.82 | 272.3 ms | RX 7900 XT 20 GB or better |
| Qwen3.6-35B-A3B MTP | MoE + MTP, 35B / 3B active | ~21 GB | 24 GB | 82.71 | 364.4 ms | RTX 4090 24 GB or RX 7900 XTX 24 GB |
| Qwen3.5-35B-A3B | MoE, 35B / 3B active | ~18 GB | 24 GB | 45.37 | 342.0 ms | RTX 4090 24 GB or RX 7900 XTX 24 GB |
| Qwen3.6-35B-A3B | MoE, 35B / 3B active | ~18 GB | 24 GB | 40.79 | 350.9 ms | RTX 4090 24 GB or RX 7900 XTX 24 GB |
| Nemotron-Nano-Omni-30B | MoE, 30B / 3B active | ~18 GB | 24 GB | 37.06 | 247.7 ms | RTX 4090 24 GB or RX 7900 XTX 24 GB |
| Qwen3.6-27B | Dense, 27B / 27B active | 15.4 GB measured | 17 GB | 33.13 | 683.4 ms | RX 7900 XT 20 GB or better |
| Gemma-4-31B | Dense, 31B / 31B active | ~18 GB | 20 GB | 28.18 | 918.2 ms | RTX 4090 24 GB or RX 7900 XTX 24 GB |
| Granite-4.1-30B | Dense, 30B / 30B active | ~17 GB | 17 GB | 27.41 | 1250.0 ms | RX 7900 XT 20 GB or better |
The first big takeaway is obvious. Gemma-4-26B-A4B is the throughput winner in this benchmark lab for local models. The second takeaway is more interesting: Nemotron-Nano-Omni-30B has the best TTFT of the whole tested set. The third is practical: Qwen3.6-27B is much smaller on disk than the 35B class, but its TTFT and throughput do not beat the best MoE results in this setup.

Methodology of the completed tests
Every finished result in this article comes from the same measurement loop. Each model gets five prompts, each prompt runs ten times and every run uses the same backend, context and quantization class. This matters because a local benchmark can look precise while still being unfair. I only compare numbers that were collected under the same rules.
The prompt pack mixes short and long outputs on purpose. T1 and T2 capture summary and reasoning. C1 and C2 represent debugging and generation. L1 checks long-form generation. The composite score then gives extra weight to the code prompts, because this series is aimed at developer workloads rather than pure chat.
How each result is measured
| Element | Method used in the test lab |
|---|---|
| Backend | llama-server Vulkan on AMD PRO W7900 |
| Context length | 8192 tokens for every completed run |
| Quantization | Q4_K_M for standard models in this article |
| Prompt count | 5 prompts per model |
| Repetitions | 10 runs per prompt |
| Main metrics | Composite tok/s and average TTFT |
| Storage policy | Model downloaded, tested, result saved, GGUF removed to recover disk space |
This workflow gives two benefits. First, it keeps the comparison honest. Second, it prevents the disk from filling up during a long multi-model run. The downside is that you have to rely on saved JSON results after the test completes, because the model file itself may already be deleted.
How to read the results correctly
A higher tok/s result does not always mean a better user experience. Nemotron-Nano-Omni-30B is a good example. It does not win the throughput chart, but it is the quickest to first token. On the other hand, Granite-4.1-30B fits in memory, yet its latency is high enough to feel slower in real work. That is why the article keeps both metrics side by side.
There is also a hardware caveat. These numbers describe one specific setup: one Vulkan backend on one 48 GB GPU. Change the backend, context length or quantization and the ranking can move. MTP is especially sensitive to that. In some paths it is a major accelerator. In other paths it is only a small gain or a deployment complication.
VRAM after loading: what actually fits on home GPUs
This is the section many synthetic benchmarks skip. They compare models without saying which card can really hold them. In a benchmark lab for local models, that omission is fatal. A model that scores well but does not fit your GPU is not a candidate.
For the exact setup used here, 16 GB cards are not enough for any tested model. That includes RTX 4060 Ti 16 GB, RTX 4080 16 GB and RX 7900 GRE 16 GB. The floor for this class starts at 20 GB, and even there only the lighter 17-18 GB load class is comfortable.
| GPU | VRAM | What it can run from this lab | Comment |
|---|---|---|---|
| RTX 4060 | 8 GB | None | Far below the 27B-35B class used here |
| RTX 4070 / 4070 Super | 12 GB | None | Good for smaller models, not this benchmark set |
| RTX 4060 Ti / RTX 4080 / RX 7900 GRE | 16 GB | None at these settings | Borderline or insufficient for 17-18 GB load classes |
| RX 7900 XT | 20 GB | Granite-4.1-30B, Qwen3.6-27B, Gemma-4-26B-A4B | Best entry point for this exact article setup |
| RTX 4090 / RX 7900 XTX | 24 GB | All currently completed standard models, plus Qwen3.6-35B MTP | The practical sweet spot for 30B-35B local work |
| AMD PRO W7900 | 48 GB | Everything in this article with comfortable headroom | Best for repeatable benchmarking and future larger tests |
There is one important caveat. Smaller cards can sometimes run these models with lower context, CPU offload or a more aggressive quant. But that is no longer the same benchmark lab for local models. It becomes a different test profile.
Dense vs MoE: what the charts say
The dense models in this set do not automatically lose. Gemma-4-31B and Granite-4.1-30B both fit under 20 GB VRAM. The problem is not loading them. The problem is latency and sustained speed. Granite is the clearest example: it runs, but it feels slow at 1250 ms TTFT.
MoE models tell a more mixed story. Gemma-4-26B-A4B is extremely fast. Qwen3.6-35B-A3B standard is solid but no longer dominant. Qwen3.6-35B-A3B MTP is the outlier that proves how strong speculative decoding can be when the runtime path is favorable.

Which model to choose for which job
| Use case | Best current pick | Why |
|---|---|---|
| Fastest overall throughput | Gemma-4-26B-A4B | Highest composite score in the whole lab |
| Best first-token responsiveness | Nemotron-Nano-Omni-30B | Lowest measured TTFT at 247.7 ms |
| Best MTP result so far | Qwen3.6-35B-A3B MTP | 82.71 tok/s with 24 GB load class |
| Smallest dense model in this test band | Qwen3.6-27B | 15.4 GB measured on disk and 17 GB VRAM class |
| Best fit for a 20 GB card | Gemma-4-26B-A4B | Faster than the other models that fit the same VRAM class |
| Best fit for a 24 GB enthusiast card | Qwen3.6-35B-A3B MTP or Gemma-4-26B-A4B | Choose MTP for peak speedups, Gemma for simpler deployment |
Methodology notes and caveats
All numbers in this benchmark lab for local models come from one backend and one machine. That is a strength because it makes the comparison fair. It is also a limitation because another backend can change the ranking. MTP is the best example. A good implementation can double speed. A weak path can add complexity without enough gain.
Exact disk size also depends on the publisher artifact. A Q4_K_M build from one source can differ from another by hundreds of megabytes or more. For that reason, the table above uses measured values where available and conservative approximations elsewhere.
If you want adjacent practical context, see the companion guides on Multi-Token Prediction on AMD W7900, running local LLMs with OpenAI-compatible APIs and installing llama.cpp on Windows.
Summary
The updated benchmark lab for local models now gives a cleaner answer than the earlier two-model draft. The raw throughput winner is Gemma-4-26B-A4B. The responsiveness winner is Nemotron-Nano-Omni-30B. The biggest acceleration result still belongs to Qwen3.6-35B-A3B MTP. Qwen3.6-27B adds a smaller dense option, but in this exact setup it does not take the performance crown.
The hardware conclusion is just as important as the model ranking. If you want to reproduce these results at home, 20 GB is the practical entry level and 24 GB is the real comfort tier for this 27B-35B class. That single fact is often more useful than another synthetic benchmark score.
Leave a Reply