6 min read

Multi-Token Prediction on AMD W7900: 1.85× speedup with llama-server Vulkan (Qwen3.6-35B benchmark)

Multi-Token Prediction on AMD W7900: 1.85× speedup with llama-server Vulkan (Qwen3.6-35B benchmark)
✅ REAL DATA — All benchmark results in this article come from actual measurements on AMD PRO W7900 48 GB using llama-server (llama.cpp) with Vulkan backend. Baseline vs MTP comparison, both using identical server configuration. 10 runs per prompt (short/medium), 5 runs (long). Scripts: github.com/bestin-it/lm-studio-benchmark-lab
LM Studio interface with MTP settings on Windows 11
LM Studio 0.4.14 — model settings panel with the Multi-Token Prediction option. Photo: Pixabay / CC0.

Multi-Token Prediction (MTP) on AMD Radeon PRO W7900 — real numbers, real hardware. Qwen3.6-35B-A3B is one of the first openly available models with a built-in MTP prediction head in the GGUF format. This article presents a controlled benchmark: the same model, the same llama.cpp server (Vulkan backend), two configurations — standard decoding vs. MTP speculative decoding with --spec-type draft-mtp --spec-draft-n-max 4.

What is Multi-Token Prediction

Standard autoregressive decoding generates one token at a time. Each step requires a full forward pass through all model layers. With a 35B-parameter model, even on a 48 GB GPU, this process is a bottleneck.

MTP (Multi-Token Prediction) changes this approach. Qwen3 models, including Qwen3.6-35B-A3B, have a built-in additional prediction head that tries to predict several consecutive tokens at once. The llama.cpp implementation uses this head as a lightweight draft model for speculative decoding: “guess K tokens ahead, verify them with the main model in a single forward pass”.

The key advantage over classic speculative decoding: you don’t need a separate small draft model. The MTP head is part of the GGUF file and loads together with the main model.

Test environment and configuration

All tests run on a single specific machine:

  • GPU: AMD Radeon PRO W7900 48 GB VRAM (RDNA3, gfx1100)
  • OS: Windows 11 Pro 24H2
  • Runtime: llama-server (llama.cpp build b5437, MSVC x64, Vulkan backend)
  • Model: Qwen3.6-35B-A3B-UD-Q4_K_S.gguf — Q4_K_S quantization, MTP variant (19.92 GB)
  • Context: 8192 tokens, parallel slots: 1
  • Baseline flags: -ngl 99 --ctx-size 8192 -np 1 -fit off --no-warmup
  • MTP flags: as above + --spec-type draft-mtp --spec-draft-n-max 4

The only difference between the two runs is the presence of --spec-type draft-mtp --spec-draft-n-max 4. Everything else is identical — same model file, same context, same hardware, same benchmark script.

Each test uses 3 prompt categories:

  • short_code — code generation task, 200 max tokens, 10 runs
  • medium_reasoning — multi-step reasoning, 400 max tokens, 10 runs
  • long_generation — technical document generation, 800 max tokens, 5 runs

Benchmark results: throughput (tok/s)

Throughput comparison chart tok/s — MTP enabled vs disabled on AMD W7900
Chart 1: Throughput (tok/s) — baseline (standard decoding) vs MTP (speculative decoding with draft-mtp, spec-draft-n-max=4). AMD PRO W7900 48 GB, llama-server Vulkan gfx1100, Qwen3.6-35B-A3B Q4_K_S.
Prompt typeBaseline tok/s (avg ± σ)MTP tok/s (avg ± σ)Speedup
short_code (code generation, 200 tokens)49.7 ± 0.199.4 ± 6.22.00×
medium_reasoning (multi-step, 400 tokens)49.5 ± 0.192.3 ± 19.61.86×
long_generation (technical doc, 800 tokens)49.9 ± 0.483.7 ± 4.61.68×
Overall average49.7 tok/s91.8 tok/s1.85×

MTP delivers a consistent and significant throughput gain across all three prompt categories. The highest speedup appears on code generation (2.00×) — which makes sense, as code has repetitive patterns (indentation, keywords, variable names) that the MTP prediction head handles well. Long document generation still shows a respectable 1.68× speedup.

Benchmark results: TTFT (Time to First Token)

TTFT chart — time to first token with and without MTP in llama-server
Chart 2: Time to First Token (ms) per prompt type — baseline vs MTP. TTFT is virtually identical between both configurations.
Prompt typeBaseline TTFT (avg)MTP TTFT (avg)
short_code2278 ms2313 ms
medium_reasoning2305 ms2368 ms
long_generation2734 ms2422 ms
Overall average2439 ms2368 ms

TTFT is nearly identical between baseline and MTP configurations. The overhead of loading the MTP head does not affect time to first token. MTP only accelerates generation after the first token is produced, so for latency-sensitive applications (real-time chat) the improvement is entirely in throughput, not TTFT.

How to enable MTP in llama-server

To use MTP in llama-server (llama.cpp), you need:

  1. A GGUF model that includes an MTP head — for Qwen3, the -MTP- variants (e.g., Qwen3.6-35B-A3B-MTP-GGUF). Standard Qwen3 GGUFs without the MTP variant do not contain the prediction head.
  2. A recent llama.cpp build (b5437 or newer) with the --spec-type flag available.

Start the server with these additional flags:

llama-server.exe ^
  --model Qwen3.6-35B-A3B-UD-Q4_K_S.gguf ^
  -ngl 99 --ctx-size 8192 --port 8080 ^
  --spec-type draft-mtp ^
  --spec-draft-n-max 4

The --spec-draft-n-max 4 parameter controls how many tokens the MTP head tries to predict speculatively in each step. Values of 3–5 typically work well; higher values may decrease acceptance rate and not improve throughput.

Note on –no-warmup: When using --no-warmup, the server accepts HTTP connections before GPU kernels are fully compiled. The very first inference request will cause a connection reset. Always send one non-streaming warmup request ({"prompt":"Hi","n_predict":5,"stream":false}) before starting a benchmark or production workload.

Why code generation benefits most from MTP

The 2.00× speedup on code generation is the most striking result. The explanation lies in the entropy of the token distribution:

  • Low entropy = high acceptance rate. Code has predictable patterns: closing braces follow opening braces, return statements end functions, variable names repeat. The MTP head predicts these with high accuracy, so most speculative tokens are accepted without re-verification.
  • High entropy = low acceptance rate. Creative text, dialog, complex reasoning — here the next token is less predictable. Speculative tokens are rejected more often, and the overhead of the draft-verify cycle reduces the effective speedup.

This is why medium_reasoning (1.86×) and long_generation (1.68×) show lower speedups than short_code (2.00×) — they involve more unpredictable token sequences.

When MTP helps and when it doesn’t

  • Code generation — highest gain, up to 2×. Strongly recommended.
  • Document/report generation — consistent 1.68× speedup. Worth enabling for batch tasks.
  • Multi-step reasoning — 1.86× on average, but higher variance (σ = 19.6 tok/s). MTP acceptance rate fluctuates with reasoning complexity.
  • Interactive chat / real-time — TTFT (~2300–2400 ms) dominates total response time for short exchanges. MTP throughput gain still present.
  • Short Q&A (<100 tokens) — TTFT dominates; throughput gain has minimal practical impact.
  • Models without MTP head — most models outside Qwen3 and DeepSeek-V3 families have no MTP GGUF. The flag will be ignored or cause an error.

LM Studio vs llama-server on AMD hardware

LM Studio 0.4.14 includes MTP as a GUI option. However, as of testing, enabling MTP on the qwen3.6-35b-a3b-mtp variant in LM Studio 0.4.14 Build 4 resulted in model load failures on AMD gfx1100 (Error: Failed to load model). The feature works correctly in llama-server with Vulkan backend when the correct flags are passed.

If you want MTP to work on AMD hardware today:

  1. Use llama-server (llama.cpp) with Vulkan backend, not LM Studio.
  2. Use the -MTP-GGUF model variant (contains the prediction head).
  3. Pass --spec-type draft-mtp --spec-draft-n-max 4.

Summary and recommendations

MTP in llama.cpp with Vulkan on AMD PRO W7900 delivers a real 1.7–2.0× throughput increase with no meaningful TTFT penalty. The gain is consistent across all tested prompt types, with the highest benefit for code generation tasks.

ScenarioMTP recommendationMeasured speedup
Code generation (long blocks)✅ Enable2.00×
Document batch processing✅ Enable1.68×
Multi-step reasoning✅ Enable (higher variance)1.86× avg
Interactive chat / real-time⚠️ Optional~1.85× throughput, TTFT unchanged
Short Q&A (<100 tokens)❌ Standard decodingTTFT dominates; gain minimal
Models without MTP head❌ N/AFlag ignored or error

Next article in the series: How to build an independent benchmark lab for local models on a single 48 GB GPU — complete methodology and scripts.

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *