TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

Last week I had a perfectly functional setup: Qwen3.6 35B-A3B running at 28 tok/s with 262K context on a consumer RTX 3070. Most people would stop there. Most people are smarter than me.

Instead, I decided to rebase a 194-commit experimental CUDA fork onto a moving upstream target, resolve conflicts in flash attention kernels at 11 PM, and bolt on Multi-Token Prediction — all to answer one question: how fast can this thing actually go?

The answer: 40.6 tok/s. That's a 42% speedup. For free. Let me show you how.

📦 Code: lucad87/llama-cpp-turboquant — branch feature/turboquant-mtp. Fork of TheTom's turboquant rebased onto upstream llama.cpp with MTP support. Clone, build, run.

🤔 Wait, Weren't We Doing TurboQuant?

Yes. And we still are. Let me recap the layers of this lasagna:

Layer	What it does
TurboQuant (TheTom fork)	4-bit KV cache compression via custom CUDA kernels. Drops KV cache VRAM from ~6 GB to ~1.5 GB at 262K context.
auto-asymmetric	Detects GQA ratio and automatically upgrades K cache from `turbo4` to `q8_0` to prevent quality degradation on models with high head ratios (like Qwen's 8:1).
MTP (upstream llama.cpp)	Multi-Token Prediction. The model predicts 2-4 tokens at once instead of one, verifies them, and only keeps the correct ones. Native to Qwen3.6's architecture.

The problem: TurboQuant lived on a fork from April 2026. MTP was merged into upstream llama.cpp in May. They had never met. I needed to introduce them.

⚙️ The Hardware (Still the Same, Still Defying Physics)

Component	Spec
CPU	AMD Ryzen 7 5800X3D (8c/16t)
RAM	32 GB DDR4
GPU	NVIDIA GeForce RTX 3070 (8 GB VRAM)
CUDA	13.2
Compiler	Visual Studio 2026 Community (MSVC 14.50)
CMake	4.3.2

If you're new here: yes, running a 35B model on 8GB VRAM is absurd. The MoE architecture (3B active out of 35B total) plus aggressive offloading makes it possible. But adding speculative decoding on top? That's the part I wasn't sure about.

🧬 The Model: Qwen3.6 35B-A3B (Now With MTP Layers)

The model we're using is Qwen3.6 35B-A3B — a Mixture-of-Experts beast with 256 experts (8 active + 1 shared) and a native 262K context window. Architecturally, it uses Gated DeltaNet + Gated Attention layers interleaved with MoE feed-forward blocks.

Here's what matters for this article:

Property	Value
Total parameters	35B
Active parameters	~3B
Context window	262,144 tokens
Quantization	UD-Q4_K_M (Unsloth Dynamic 2.0)
File size	22.1 GB
MTP heads	Trained with multi-step prediction
MTP context VRAM	~1,004 MiB

The critical detail: Qwen3.6 was trained with MTP — the model literally has extra "next-token heads" that predict multiple future tokens simultaneously. But the original Unsloth GGUF from a month ago was converted before MTP support existed in llama.cpp. I needed the new Unsloth Dynamic 2.0 conversion with MTP layers baked in.

Unsloth released a separate MTP-enabled GGUF just 4 days ago: unsloth/Qwen3.6-35B-A3B-MTP-GGUF. This is the one you want. The regular Qwen3.6-35B-A3B-GGUF will throw context type MTP requested but model doesn't contain MTP layers and then crash. Ask me how I know.

🔀 The Merge: Rebase or Die Trying

The TheTom fork (feature/turboquant-kv-cache) was 435 commits behind upstream master and 194 commits ahead with its own turbo4 CUDA kernels. Merging was not going to be a git merge and a prayer.

You can grab the final result directly — no need to redo the rebase yourself:

powershell

git clone https://github.com/lucad87/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-mtp

But if you're curious how the sausage is made, here's the full autopsy:

Step 1: Add upstream remote

powershell

git remote add upstream https://github.com/ggml-org/llama.cpp
git fetch upstream master

Step 2: Rebase (this is where it gets sporty)

powershell

git checkout -b feature/turboquant-mtp
git rebase upstream/master

194 commits to replay. 22 conflicts. Here's the damage report:

File	Conflict type	Resolution
`ggml-metal/*.cpp` ×3	Metal (Apple) — irrelevant for CUDA	Kept upstream
`ggml-vulkan/*` ×4	Vulkan — irrelevant	Kept upstream
`vendors/hip.h`	HIP/ROCm macros	Kept turboquant (more complete)
`fattn.cu`	Flash attention dispatch	Both sides merged
`fattn-mma-f16.cuh`	Matrix multiply templates	Both sides merged
`ggml-cuda.cu`	Op support table	Both sides kept
`common/*.cpp/.h` ×2	Speculative decoding cherry-picks	Skipped (already upstream)
`perplexity.cpp`	Comment-only	Kept turboquant comment

The key skips: two commits were cherry-picks of speculative decoding from an older upstream. Since the new upstream already has MTP (which is speculative decoding on steroids), these became redundant.

Step 3: Build

powershell

cmake -S . -B build -G "Visual Studio 18 2026" -A x64 `
  -DGGML_CUDA=ON `
  -DCMAKE_CUDA_ARCHITECTURES=86 `
  -DGGML_CUDA_FA=ON `
  -DGGML_CUDA_GRAPHS=ON `
  -DGGML_CUDA_COMPRESSION_MODE=size

cmake --build build --config Release --target llama-server llama-cli -j 8

~10 minutes later:

llama-server.vcxproj -> llama-server.exe
llama-cli.vcxproj -> llama-cli.exe
ggml-cuda.vcxproj -> ggml-cuda.dll

No errors. No warnings that mattered. The turbo4 kernels, auto-asymmetric, and MTP speculative decoding all living in the same binary. I almost didn't believe it.

🚀 Launch: MTP + TurboQuant, Together at Last

powershell

C:\llama-turboquant-thetom\build\bin\Release\llama-server.exe `
  -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" `
  --host 127.0.0.1 --port 8080 `
  -fa auto -ngl 999 --n-cpu-moe 248 `
  --no-mmap --mlock `
  --cache-type-k turbo4 --cache-type-v turbo4 `
  --ctx-size 262144 `
  --spec-type draft-mtp `
  --temp 0.1 --repeat-penalty 1.05 --threads 16 -b 512

Flag breakdown: what's new

Flag	Value	Why
`--spec-type`	`draft-mtp`	Enable native MTP speculative decoding using the model's own prediction heads
`--n-cpu-moe`	`248`	256 experts total, 8 active. Keep 248 on CPU pinned RAM, 8 on GPU.

Flag breakdown: the turboquant classics

Flag	What it does	Why you need it
`-ngl 999`	All layers to GPU	"Just do it"
`--n-cpu-moe 248`	248 experts on CPU	Only 8 active experts live on GPU (~2 GB). The other 248 sit in pinned RAM at PCIe speed.
`--no-mmap --mlock`	No memory map, pin memory	Required with `--n-cpu-moe`. Prevents the OS from swapping 14 GB of expert weights to disk.
`--cache-type-k/v turbo4`	4-bit KV cache	~1.5 GB KV cache at 262K instead of ~6 GB.
`-fa auto`	Flash Attention	Forces efficient attention computation.

What happened at load time

The log tells the story:

auto-asymmetric: GQA ratio 8:1 — upgrading K from turbo4 to q8_0
warming up the model with an empty run...
creating MTP draft context against the target model...
[spec] estimated memory usage of MTP context is 1004.02 MiB
server is listening on http://127.0.0.1:8080

Three things of note:

auto-asymmetric still works. K→q8_0, V→turbo4. Qwen's 8:1 attention head ratio would corrupt output with pure turbo4, but the auto-upgrade saves us transparently.
MTP context created successfully. The 1,004 MiB is the cost of the speculative decoding infrastructure — worth every byte.
No mmproj this time. With MTP adding ~1 GB, the vision encoder (another ~1.7 GB) would OOM. Trade-off: text-only but faster, or vision but slower. I chose speed.

📊 Benchmark: The Numbers That Made Me Grin

Same prompt, same model, same hardware. The only variable: --spec-type draft-mtp.

The prompt

"Write a Python function that computes prime numbers using Sieve of Eratosthenes. Output only code, no explanation."

Results

Metric	Without MTP	With MTP	Delta
Generation speed	28.6 tok/s	40.6 tok/s	+42%
Time for 127 tokens	4,444 ms	3,128 ms	-30%
Draft tokens generated	—	102	—
Draft tokens accepted	—	92	90.2%
Output correctness	✅	✅	Identical

Ninety percent draft acceptance. That's the number that matters. MTP generates candidate tokens speculatively and verifies them against the main model. At 90% acceptance, you're getting nearly 2x throughput with zero quality loss — because the tokens that get rejected are simply regenerated by the main model, and the ones that pass are identical to what the model would have produced anyway.

What 42% means in practice

At 28 tok/s, generating a 1,000-token response takes 35 seconds. At 40 tok/s, it takes 25 seconds. Over a coding session with 50 requests, that's 8 minutes saved. Over a month? Hours.

And remember: this is still a 35B MoE model with 262K context running on an 8 GB consumer GPU. The fact that we're discussing 40+ tok/s at all is absurd.

🐛 Troubleshooting (The "I Did This So You Don't Have To" Section)

`context type MTP requested but model doesn't contain MTP layers`

Cause: You're using the old Unsloth GGUF (the one without -MTP in the filename).
Fix: Download the MTP-specific GGUF from unsloth/Qwen3.6-35B-A3B-MTP-GGUF. The model card explicitly says "MTP: trained with multi-steps" — but the GGUF needs to include those layers during conversion, and the regular Unsloth quants from a month ago don't.

`cudaMalloc failed: out of memory` with `--mmproj`

Cause: MTP context + vision encoder = too much VRAM.
Fix: Drop --mmproj (saves ~1.7 GB). If you absolutely need vision, reduce --ctx-size to ~196K.
Alternative: Use the non-MTP server for vision tasks and the MTP server for coding. Two servers, different ports, best of both worlds.

Rebase conflicts in speculative decoding files

Cause: The TheTom fork had cherry-picked older speculative decoding code that now conflicts with upstream's newer MTP implementation.
Fix: Skip those commits. Upstream has better speculative decoding now anyway. git rebase --skip is your friend.

Server takes 30+ seconds to load

That's normal. --no-mmap copies ~14 GB of expert weights into pinned CUDA host memory. The alternative (memory-mapped I/O) doesn't work with --n-cpu-moe. Go make coffee.

🔗 OpenCode Integration

If you're using this with OpenCode (and you should be), here's the provider config:

json
{
  "provider": {
    "turboquant-mtp": {
      "baseURL": "http://localhost:8080/v1",
      "apiKey": "none",
      "models": {
        "qwen36-35b-mtp": {
          "context": 262144,
          "output": 16384,
          "reasoning": true,
          "tool_call": true
        }
      }
    }
  }
}

The model supports thinking mode by default, which is great for agentic coding tasks. To disable thinking for quick answers, set "chat_template_kwargs": {"enable_thinking": false} in your API request.

🎉 Conclusion

What started as "can I bolt MTP onto TurboQuant" turned into:

✅ A successful rebase of 194 turboquant commits onto latest upstream
✅ MTP native speculative decoding with 90.2% draft acceptance
✅ 40.6 tok/s on an RTX 3070 — a 42% speedup over the previous 28.6 tok/s
✅ Full 262K context preserved (auto-asymmetric turbo4/q8_0 hybrid KV cache)
✅ All CUDA kernels intact (turbo4, flash attention, warp-cooperative MoE)

The evolution of this setup

Configuration	Context	Speed	Key innovation
Stock llama.cpp + q8_0	131K	27 tok/s	Baseline
TheTom fork + turbo4	262K	28 tok/s	4-bit KV cache, auto-asymmetric
TheTom fork + turbo4 + MTP	262K	40.6 tok/s	Native multi-token prediction

What's next?

The obvious missing piece is vision support with MTP. Right now it's either/or — MTP adds ~1 GB of VRAM, and the vision encoder needs another ~1.7 GB. With some memory optimization or reduced context, both should fit. That's a problem for next weekend.

The other frontier: MTP with --spec-draft-n-max tuned higher than the default. Qwen3.6 was trained with multi-step prediction — the theoretical limit might be higher than the current draft acceptance suggests. More benchmarks needed.

🛠️ Try it yourself:

lucad87/llama-cpp-turboquant — branch feature/turboquant-mtp

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

🤔 Wait, Weren't We Doing TurboQuant?

⚙️ The Hardware (Still the Same, Still Defying Physics)

🧬 The Model: Qwen3.6 35B-A3B (Now With MTP Layers)

🔀 The Merge: Rebase or Die Trying

Step 1: Add upstream remote

Step 2: Rebase (this is where it gets sporty)

Step 3: Build

🚀 Launch: MTP + TurboQuant, Together at Last

Flag breakdown: what's new

Flag breakdown: the turboquant classics

What happened at load time

📊 Benchmark: The Numbers That Made Me Grin

The prompt

Results

What 42% means in practice

🐛 Troubleshooting (The "I Did This So You Don't Have To" Section)

`context type MTP requested but model doesn't contain MTP layers`

`cudaMalloc failed: out of memory` with `--mmproj`

Rebase conflicts in speculative decoding files

Server takes 30+ seconds to load

🔗 OpenCode Integration

🎉 Conclusion

The evolution of this setup

What's next?

Luca

Keep Reading

Taming a 26B MoE Model on 8GB VRAM with TurboQuant

Deploying Qwen3-Coder-30B-A3B on 8GB GPU with Docker

🤔 Wait, Weren't We Doing TurboQuant?

⚙️ The Hardware (Still the Same, Still Defying Physics)

🧬 The Model: Qwen3.6 35B-A3B (Now With MTP Layers)

🔀 The Merge: Rebase or Die Trying

Step 1: Add upstream remote

Step 2: Rebase (this is where it gets sporty)

Step 3: Build

🚀 Launch: MTP + TurboQuant, Together at Last

Flag breakdown: what's new

Flag breakdown: the turboquant classics

What happened at load time

📊 Benchmark: The Numbers That Made Me Grin

The prompt

Results

What 42% means in practice

🐛 Troubleshooting (The "I Did This So You Don't Have To" Section)

context type MTP requested but model doesn't contain MTP layers

cudaMalloc failed: out of memory with --mmproj

Rebase conflicts in speculative decoding files

Server takes 30+ seconds to load

🔗 OpenCode Integration

🎉 Conclusion

The evolution of this setup

What's next?

Luca

Keep Reading

Taming a 26B MoE Model on 8GB VRAM with TurboQuant

Deploying Qwen3-Coder-30B-A3B on 8GB GPU with Docker

`context type MTP requested but model doesn't contain MTP layers`

`cudaMalloc failed: out of memory` with `--mmproj`