
Multi-Token Prediction (MTP) on AMD Radeon PRO W7900 — real numbers, real hardware. Qwen3.6-35B-A3B is one of the first openly available models with a built-in MTP prediction head in the GGUF format. This article presents a controlled benchmark: the same model, the same llama.cpp server (Vulkan backend), two configurations — standard decoding vs. MTP speculative decoding with --spec-type draft-mtp --spec-draft-n-max 4.
What is Multi-Token Prediction
Standard autoregressive decoding generates one token at a time. Each step requires a full forward pass through all model layers. With a 35B-parameter model, even on a 48 GB GPU, this process is a bottleneck.
MTP (Multi-Token Prediction) changes this approach. Qwen3 models, including Qwen3.6-35B-A3B, have a built-in additional prediction head that tries to predict several consecutive tokens at once. The llama.cpp implementation uses this head as a lightweight draft model for speculative decoding: “guess K tokens ahead, verify them with the main model in a single forward pass”.
The key advantage over classic speculative decoding: you don’t need a separate small draft model. The MTP head is part of the GGUF file and loads together with the main model.
Test environment and configuration
All tests run on a single specific machine:
- GPU: AMD Radeon PRO W7900 48 GB VRAM (RDNA3, gfx1100)
- OS: Windows 11 Pro 24H2
- Runtime: llama-server (llama.cpp build b5437, MSVC x64, Vulkan backend)
- Model:
Qwen3.6-35B-A3B-UD-Q4_K_S.gguf— Q4_K_S quantization, MTP variant (19.92 GB) - Context: 8192 tokens, parallel slots: 1
- Baseline flags:
-ngl 99 --ctx-size 8192 -np 1 -fit off --no-warmup - MTP flags: as above +
--spec-type draft-mtp --spec-draft-n-max 4
The only difference between the two runs is the presence of --spec-type draft-mtp --spec-draft-n-max 4. Everything else is identical — same model file, same context, same hardware, same benchmark script.
Each test uses 3 prompt categories:
- short_code — code generation task, 200 max tokens, 10 runs
- medium_reasoning — multi-step reasoning, 400 max tokens, 10 runs
- long_generation — technical document generation, 800 max tokens, 5 runs
Benchmark results: throughput (tok/s)

| Prompt type | Baseline tok/s (avg ± σ) | MTP tok/s (avg ± σ) | Speedup |
|---|---|---|---|
| short_code (code generation, 200 tokens) | 49.7 ± 0.1 | 99.4 ± 6.2 | 2.00× |
| medium_reasoning (multi-step, 400 tokens) | 49.5 ± 0.1 | 92.3 ± 19.6 | 1.86× |
| long_generation (technical doc, 800 tokens) | 49.9 ± 0.4 | 83.7 ± 4.6 | 1.68× |
| Overall average | 49.7 tok/s | 91.8 tok/s | 1.85× |
MTP delivers a consistent and significant throughput gain across all three prompt categories. The highest speedup appears on code generation (2.00×) — which makes sense, as code has repetitive patterns (indentation, keywords, variable names) that the MTP prediction head handles well. Long document generation still shows a respectable 1.68× speedup.
Benchmark results: TTFT (Time to First Token)

| Prompt type | Baseline TTFT (avg) | MTP TTFT (avg) |
|---|---|---|
| short_code | 2278 ms | 2313 ms |
| medium_reasoning | 2305 ms | 2368 ms |
| long_generation | 2734 ms | 2422 ms |
| Overall average | 2439 ms | 2368 ms |
TTFT is nearly identical between baseline and MTP configurations. The overhead of loading the MTP head does not affect time to first token. MTP only accelerates generation after the first token is produced, so for latency-sensitive applications (real-time chat) the improvement is entirely in throughput, not TTFT.
How to enable MTP in llama-server
To use MTP in llama-server (llama.cpp), you need:
- A GGUF model that includes an MTP head — for Qwen3, the
-MTP-variants (e.g.,Qwen3.6-35B-A3B-MTP-GGUF). Standard Qwen3 GGUFs without the MTP variant do not contain the prediction head. - A recent llama.cpp build (b5437 or newer) with the
--spec-typeflag available.
Start the server with these additional flags:
llama-server.exe ^
--model Qwen3.6-35B-A3B-UD-Q4_K_S.gguf ^
-ngl 99 --ctx-size 8192 --port 8080 ^
--spec-type draft-mtp ^
--spec-draft-n-max 4
The --spec-draft-n-max 4 parameter controls how many tokens the MTP head tries to predict speculatively in each step. Values of 3–5 typically work well; higher values may decrease acceptance rate and not improve throughput.
--no-warmup, the server accepts HTTP connections before GPU kernels are fully compiled. The very first inference request will cause a connection reset. Always send one non-streaming warmup request ({"prompt":"Hi","n_predict":5,"stream":false}) before starting a benchmark or production workload.
Why code generation benefits most from MTP
The 2.00× speedup on code generation is the most striking result. The explanation lies in the entropy of the token distribution:
- Low entropy = high acceptance rate. Code has predictable patterns: closing braces follow opening braces,
returnstatements end functions, variable names repeat. The MTP head predicts these with high accuracy, so most speculative tokens are accepted without re-verification. - High entropy = low acceptance rate. Creative text, dialog, complex reasoning — here the next token is less predictable. Speculative tokens are rejected more often, and the overhead of the draft-verify cycle reduces the effective speedup.
This is why medium_reasoning (1.86×) and long_generation (1.68×) show lower speedups than short_code (2.00×) — they involve more unpredictable token sequences.
When MTP helps and when it doesn’t
- Code generation — highest gain, up to 2×. Strongly recommended.
- Document/report generation — consistent 1.68× speedup. Worth enabling for batch tasks.
- Multi-step reasoning — 1.86× on average, but higher variance (σ = 19.6 tok/s). MTP acceptance rate fluctuates with reasoning complexity.
- Interactive chat / real-time — TTFT (~2300–2400 ms) dominates total response time for short exchanges. MTP throughput gain still present.
- Short Q&A (<100 tokens) — TTFT dominates; throughput gain has minimal practical impact.
- Models without MTP head — most models outside Qwen3 and DeepSeek-V3 families have no MTP GGUF. The flag will be ignored or cause an error.
LM Studio vs llama-server on AMD hardware
LM Studio 0.4.14 includes MTP as a GUI option. However, as of testing, enabling MTP on the qwen3.6-35b-a3b-mtp variant in LM Studio 0.4.14 Build 4 resulted in model load failures on AMD gfx1100 (Error: Failed to load model). The feature works correctly in llama-server with Vulkan backend when the correct flags are passed.
If you want MTP to work on AMD hardware today:
- Use llama-server (llama.cpp) with Vulkan backend, not LM Studio.
- Use the
-MTP-GGUFmodel variant (contains the prediction head). - Pass
--spec-type draft-mtp --spec-draft-n-max 4.
Summary and recommendations
MTP in llama.cpp with Vulkan on AMD PRO W7900 delivers a real 1.7–2.0× throughput increase with no meaningful TTFT penalty. The gain is consistent across all tested prompt types, with the highest benefit for code generation tasks.
| Scenario | MTP recommendation | Measured speedup |
|---|---|---|
| Code generation (long blocks) | ✅ Enable | 2.00× |
| Document batch processing | ✅ Enable | 1.68× |
| Multi-step reasoning | ✅ Enable (higher variance) | 1.86× avg |
| Interactive chat / real-time | ⚠️ Optional | ~1.85× throughput, TTFT unchanged |
| Short Q&A (<100 tokens) | ❌ Standard decoding | TTFT dominates; gain minimal |
| Models without MTP head | ❌ N/A | Flag ignored or error |
Next article in the series: How to build an independent benchmark lab for local models on a single 48 GB GPU — complete methodology and scripts.
Leave a Reply