10 min read

Benchmark lab for local models: real 27B-35B results on one 48 GB GPU

Benchmark lab for local models: real 27B-35B results on one 48 GB GPU
✅ REAL DATA — This benchmark lab for local models uses one machine, one backend and one prompt suite. The numbers below come from AMD PRO W7900 48 GB, llama-server Vulkan, Q4_K_M quantization, context 8192 and 10 runs per prompt.

A benchmark lab for local models only makes sense when the rules do not change between runs. That is why every model in this article uses the same GPU, the same context length, the same backend and the same five prompts. The goal is simple: measure what a local user actually feels. That means throughput in tok/s, time to first token, model size on disk and VRAM after loading.

This version of the lab already covers eight completed runs in the 27B to 35B class. It includes MoE and dense models, plus one MTP result that shows how large the real-world gap can be when speculative decoding works well. I also map every tested model to a realistic home GPU, because that is the first question most readers ask after seeing the charts.

Fixed hardware for the benchmark lab for local models

The whole comparison uses one unchanged test bed:

  • GPU: AMD Radeon PRO W7900 48 GB VRAM
  • Backend: llama-server Vulkan, local build for gfx1100
  • OS: Windows 11 Pro 24H2
  • Context: 8192 tokens
  • Quantization: Q4_K_M for standard runs
  • Protocol: 10 runs x 5 prompts per model

I keep this setup fixed on purpose. A benchmark lab for local models becomes useless when drivers, context or runtime change between test runs. The W7900 also gives enough headroom to load every tested model fully on GPU, so the comparison is not distorted by CPU offload.

What the benchmark lab for local models measures

Throughput in tok/s

This is the sustained generation rate after the first token. It matters most for long answers, code generation and batch jobs.

TTFT in milliseconds

Time to first token controls how responsive a model feels in chat or in an editor. Lower TTFT usually feels better than a small raw tok/s gain.

Size on disk

This is the GGUF file size for the exact or nearest Q4_K_M class artifact used in the lab. In practice, publisher builds differ slightly. MTP variants also tend to be larger.

VRAM after loading

This is the observed memory class used in the benchmark metadata after the model is loaded at context 8192. For home GPUs, this number matters more than parameter count.

Prompt suite and scoring

The test set has five prompts. Two are general tasks, two are code tasks and one is a longer generation task. Code prompts get a higher weight in the final composite score because this series targets local developer workflows.

  • T1 summarize: technical summary
  • T2 reasoning: multi-step reasoning
  • C1 debug: bug finding and repair
  • C2 generate: code generation
  • L1 long: long technical answer

Because of that weighting, the composite score is the best quick view of how a model behaves in a real benchmark lab for local models. It captures speed where people actually spend time.

Results: benchmark lab for local models leaderboard

Model Architecture Disk size VRAM after load Composite tok/s Avg TTFT Minimum home GPU
Gemma-4-26B-A4BMoE, 26B total / 4B active~15 GB18 GB87.82272.3 msRX 7900 XT 20 GB or better
Qwen3.6-35B-A3B MTPMoE + MTP, 35B / 3B active~21 GB24 GB82.71364.4 msRTX 4090 24 GB or RX 7900 XTX 24 GB
Qwen3.5-35B-A3BMoE, 35B / 3B active~18 GB24 GB45.37342.0 msRTX 4090 24 GB or RX 7900 XTX 24 GB
Qwen3.6-35B-A3BMoE, 35B / 3B active~18 GB24 GB40.79350.9 msRTX 4090 24 GB or RX 7900 XTX 24 GB
Nemotron-Nano-Omni-30BMoE, 30B / 3B active~18 GB24 GB37.06247.7 msRTX 4090 24 GB or RX 7900 XTX 24 GB
Qwen3.6-27BDense, 27B / 27B active15.4 GB measured17 GB33.13683.4 msRX 7900 XT 20 GB or better
Gemma-4-31BDense, 31B / 31B active~18 GB20 GB28.18918.2 msRTX 4090 24 GB or RX 7900 XTX 24 GB
Granite-4.1-30BDense, 30B / 30B active~17 GB17 GB27.411250.0 msRX 7900 XT 20 GB or better

The first big takeaway is obvious. Gemma-4-26B-A4B is the throughput winner in this benchmark lab for local models. The second takeaway is more interesting: Nemotron-Nano-Omni-30B has the best TTFT of the whole tested set. The third is practical: Qwen3.6-27B is much smaller on disk than the 35B class, but its TTFT and throughput do not beat the best MoE results in this setup.

Benchmark lab for local models on AMD W7900 - throughput chart for tested 27B to 35B models
Chart 1: Composite throughput in the benchmark lab for local models. Gemma-4-26B-A4B leads on raw speed, while Qwen3.6-35B MTP shows how large the MTP uplift can be when the backend supports it well.

Methodology of the completed tests

Every finished result in this article comes from the same measurement loop. Each model gets five prompts, each prompt runs ten times and every run uses the same backend, context and quantization class. This matters because a local benchmark can look precise while still being unfair. I only compare numbers that were collected under the same rules.

The prompt pack mixes short and long outputs on purpose. T1 and T2 capture summary and reasoning. C1 and C2 represent debugging and generation. L1 checks long-form generation. The composite score then gives extra weight to the code prompts, because this series is aimed at developer workloads rather than pure chat.

How each result is measured

ElementMethod used in the test lab
Backendllama-server Vulkan on AMD PRO W7900
Context length8192 tokens for every completed run
QuantizationQ4_K_M for standard models in this article
Prompt count5 prompts per model
Repetitions10 runs per prompt
Main metricsComposite tok/s and average TTFT
Storage policyModel downloaded, tested, result saved, GGUF removed to recover disk space

This workflow gives two benefits. First, it keeps the comparison honest. Second, it prevents the disk from filling up during a long multi-model run. The downside is that you have to rely on saved JSON results after the test completes, because the model file itself may already be deleted.

How to read the results correctly

A higher tok/s result does not always mean a better user experience. Nemotron-Nano-Omni-30B is a good example. It does not win the throughput chart, but it is the quickest to first token. On the other hand, Granite-4.1-30B fits in memory, yet its latency is high enough to feel slower in real work. That is why the article keeps both metrics side by side.

There is also a hardware caveat. These numbers describe one specific setup: one Vulkan backend on one 48 GB GPU. Change the backend, context length or quantization and the ranking can move. MTP is especially sensitive to that. In some paths it is a major accelerator. In other paths it is only a small gain or a deployment complication.

VRAM after loading: what actually fits on home GPUs

This is the section many synthetic benchmarks skip. They compare models without saying which card can really hold them. In a benchmark lab for local models, that omission is fatal. A model that scores well but does not fit your GPU is not a candidate.

For the exact setup used here, 16 GB cards are not enough for any tested model. That includes RTX 4060 Ti 16 GB, RTX 4080 16 GB and RX 7900 GRE 16 GB. The floor for this class starts at 20 GB, and even there only the lighter 17-18 GB load class is comfortable.

GPUVRAMWhat it can run from this labComment
RTX 40608 GBNoneFar below the 27B-35B class used here
RTX 4070 / 4070 Super12 GBNoneGood for smaller models, not this benchmark set
RTX 4060 Ti / RTX 4080 / RX 7900 GRE16 GBNone at these settingsBorderline or insufficient for 17-18 GB load classes
RX 7900 XT20 GBGranite-4.1-30B, Qwen3.6-27B, Gemma-4-26B-A4BBest entry point for this exact article setup
RTX 4090 / RX 7900 XTX24 GBAll currently completed standard models, plus Qwen3.6-35B MTPThe practical sweet spot for 30B-35B local work
AMD PRO W790048 GBEverything in this article with comfortable headroomBest for repeatable benchmarking and future larger tests

There is one important caveat. Smaller cards can sometimes run these models with lower context, CPU offload or a more aggressive quant. But that is no longer the same benchmark lab for local models. It becomes a different test profile.

Dense vs MoE: what the charts say

The dense models in this set do not automatically lose. Gemma-4-31B and Granite-4.1-30B both fit under 20 GB VRAM. The problem is not loading them. The problem is latency and sustained speed. Granite is the clearest example: it runs, but it feels slow at 1250 ms TTFT.

MoE models tell a more mixed story. Gemma-4-26B-A4B is extremely fast. Qwen3.6-35B-A3B standard is solid but no longer dominant. Qwen3.6-35B-A3B MTP is the outlier that proves how strong speculative decoding can be when the runtime path is favorable.

TTFT vs throughput scatter plot for local 27B to 35B language models on AMD W7900
Chart 2: TTFT versus throughput. Nemotron-Nano-Omni-30B owns responsiveness, Gemma-4-26B-A4B owns raw speed, and Qwen3.6-27B sits in the middle as a smaller dense option.

Which model to choose for which job

Use caseBest current pickWhy
Fastest overall throughputGemma-4-26B-A4BHighest composite score in the whole lab
Best first-token responsivenessNemotron-Nano-Omni-30BLowest measured TTFT at 247.7 ms
Best MTP result so farQwen3.6-35B-A3B MTP82.71 tok/s with 24 GB load class
Smallest dense model in this test bandQwen3.6-27B15.4 GB measured on disk and 17 GB VRAM class
Best fit for a 20 GB cardGemma-4-26B-A4BFaster than the other models that fit the same VRAM class
Best fit for a 24 GB enthusiast cardQwen3.6-35B-A3B MTP or Gemma-4-26B-A4BChoose MTP for peak speedups, Gemma for simpler deployment

Methodology notes and caveats

All numbers in this benchmark lab for local models come from one backend and one machine. That is a strength because it makes the comparison fair. It is also a limitation because another backend can change the ranking. MTP is the best example. A good implementation can double speed. A weak path can add complexity without enough gain.

Exact disk size also depends on the publisher artifact. A Q4_K_M build from one source can differ from another by hundreds of megabytes or more. For that reason, the table above uses measured values where available and conservative approximations elsewhere.

If you want adjacent practical context, see the companion guides on Multi-Token Prediction on AMD W7900, running local LLMs with OpenAI-compatible APIs and installing llama.cpp on Windows.

Summary

The updated benchmark lab for local models now gives a cleaner answer than the earlier two-model draft. The raw throughput winner is Gemma-4-26B-A4B. The responsiveness winner is Nemotron-Nano-Omni-30B. The biggest acceleration result still belongs to Qwen3.6-35B-A3B MTP. Qwen3.6-27B adds a smaller dense option, but in this exact setup it does not take the performance crown.

The hardware conclusion is just as important as the model ranking. If you want to reproduce these results at home, 20 GB is the practical entry level and 24 GB is the real comfort tier for this 27B-35B class. That single fact is often more useful than another synthetic benchmark score.

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *