Published May 25, 2026• Updated Jun 9, 2026

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

How to build llama.cpp w/ TurboQuant and MTP and use it on consumer HW - 32GB RAM + 8GB VRAM GPU.

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

Last week I had a perfectly functional setup: Qwen3.6 35B-A3B running at 28 tok/s with 262K context on a consumer RTX 3070. Most people would stop there. Most people are smarter than me.

Instead, I decided to rebase a 194-commit experimental CUDA fork onto a moving upstream target, resolve conflicts in flash attention kernels at 11 PM, and bolt on Multi-Token Prediction — all to answer one question: how fast can this thing actually go?

The answer: 40.6 tok/s. That's a 42% speedup. For free. Let me show you how.

📦 Code: lucad87/llama-cpp-turboquant — branch feature/turboquant-mtp. Fork of TheTom's turboquant rebased onto upstream llama.cpp with MTP support. Clone, build, run.


🤔 Wait, Weren't We Doing TurboQuant?

Yes. And we still are. Let me recap the layers of this lasagna:

LayerWhat it does
TurboQuant (TheTom fork)4-bit KV cache compression via custom CUDA kernels. Drops KV cache VRAM from ~6 GB to ~1.5 GB at 262K context.
auto-asymmetricDetects GQA ratio and automatically upgrades K cache from turbo4 to q8_0 to prevent quality degradation on models with high head ratios (like Qwen's 8:1).
MTP (upstream llama.cpp)Multi-Token Prediction. The model predicts 2-4 tokens at once instead of one, verifies them, and only keeps the correct ones. Native to Qwen3.6's architecture.

The problem: TurboQuant lived on a fork from April 2026. MTP was merged into upstream llama.cpp in May. They had never met. I needed to introduce them.


⚙️ The Hardware (Still the Same, Still Defying Physics)

ComponentSpec
CPUAMD Ryzen 7 5800X3D (8c/16t)
RAM32 GB DDR4
GPUNVIDIA GeForce RTX 3070 (8 GB VRAM)
CUDA13.2
CompilerVisual Studio 2026 Community (MSVC 14.50)
CMake4.3.2

If you're new here: yes, running a 35B model on 8GB VRAM is absurd. The MoE architecture (3B active out of 35B total) plus aggressive offloading makes it possible. But adding speculative decoding on top? That's the part I wasn't sure about.


🧬 The Model: Qwen3.6 35B-A3B (Now With MTP Layers)

The model we're using is Qwen3.6 35B-A3B — a Mixture-of-Experts beast with 256 experts (8 active + 1 shared) and a native 262K context window. Architecturally, it uses Gated DeltaNet + Gated Attention layers interleaved with MoE feed-forward blocks.

Here's what matters for this article:

PropertyValue
Total parameters35B
Active parameters~3B
Context window262,144 tokens
QuantizationUD-Q4_K_M (Unsloth Dynamic 2.0)
File size22.1 GB
MTP headsTrained with multi-step prediction
MTP context VRAM~1,004 MiB

The critical detail: Qwen3.6 was trained with MTP — the model literally has extra "next-token heads" that predict multiple future tokens simultaneously. But the original Unsloth GGUF from a month ago was converted before MTP support existed in llama.cpp. I needed the new Unsloth Dynamic 2.0 conversion with MTP layers baked in.

Unsloth released a separate MTP-enabled GGUF just 4 days ago: unsloth/Qwen3.6-35B-A3B-MTP-GGUF. This is the one you want. The regular Qwen3.6-35B-A3B-GGUF will throw context type MTP requested but model doesn't contain MTP layers and then crash. Ask me how I know.


🔀 The Merge: Rebase or Die Trying

The TheTom fork (feature/turboquant-kv-cache) was 435 commits behind upstream master and 194 commits ahead with its own turbo4 CUDA kernels. Merging was not going to be a git merge and a prayer.

You can grab the final result directly — no need to redo the rebase yourself:

powershell
git clone https://github.com/lucad87/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-mtp

But if you're curious how the sausage is made, here's the full autopsy:

Step 1: Add upstream remote

powershell
git remote add upstream https://github.com/ggml-org/llama.cpp
git fetch upstream master

Step 2: Rebase (this is where it gets sporty)

powershell
git checkout -b feature/turboquant-mtp
git rebase upstream/master

194 commits to replay. 22 conflicts. Here's the damage report:

FileConflict typeResolution
ggml-metal/*.cpp ×3Metal (Apple) — irrelevant for CUDAKept upstream
ggml-vulkan/* ×4Vulkan — irrelevantKept upstream
vendors/hip.hHIP/ROCm macrosKept turboquant (more complete)
fattn.cuFlash attention dispatchBoth sides merged
fattn-mma-f16.cuhMatrix multiply templatesBoth sides merged
ggml-cuda.cuOp support tableBoth sides kept
common/*.cpp/.h ×2Speculative decoding cherry-picksSkipped (already upstream)
perplexity.cppComment-onlyKept turboquant comment

The key skips: two commits were cherry-picks of speculative decoding from an older upstream. Since the new upstream already has MTP (which is speculative decoding on steroids), these became redundant.

Step 3: Build

powershell
cmake -S . -B build -G "Visual Studio 18 2026" -A x64 `
  -DGGML_CUDA=ON `
  -DCMAKE_CUDA_ARCHITECTURES=86 `
  -DGGML_CUDA_FA=ON `
  -DGGML_CUDA_GRAPHS=ON `
  -DGGML_CUDA_COMPRESSION_MODE=size

cmake --build build --config Release --target llama-server llama-cli -j 8

~10 minutes later:

llama-server.vcxproj -> llama-server.exe
llama-cli.vcxproj -> llama-cli.exe
ggml-cuda.vcxproj -> ggml-cuda.dll

No errors. No warnings that mattered. The turbo4 kernels, auto-asymmetric, and MTP speculative decoding all living in the same binary. I almost didn't believe it.


🚀 Launch: MTP + TurboQuant, Together at Last

powershell
C:\llama-turboquant-thetom\build\bin\Release\llama-server.exe `
  -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" `
  --host 127.0.0.1 --port 8080 `
  -fa auto -ngl 999 --n-cpu-moe 248 `
  --no-mmap --mlock `
  --cache-type-k turbo4 --cache-type-v turbo4 `
  --ctx-size 262144 `
  --spec-type draft-mtp `
  --temp 0.1 --repeat-penalty 1.05 --threads 16 -b 512

Flag breakdown: what's new

FlagValueWhy
--spec-typedraft-mtpEnable native MTP speculative decoding using the model's own prediction heads
--n-cpu-moe248256 experts total, 8 active. Keep 248 on CPU pinned RAM, 8 on GPU.

Flag breakdown: the turboquant classics

FlagWhat it doesWhy you need it
-ngl 999All layers to GPU"Just do it"
--n-cpu-moe 248248 experts on CPUOnly 8 active experts live on GPU (~2 GB). The other 248 sit in pinned RAM at PCIe speed.
--no-mmap --mlockNo memory map, pin memoryRequired with --n-cpu-moe. Prevents the OS from swapping 14 GB of expert weights to disk.
--cache-type-k/v turbo44-bit KV cache~1.5 GB KV cache at 262K instead of ~6 GB.
-fa autoFlash AttentionForces efficient attention computation.

What happened at load time

The log tells the story:

auto-asymmetric: GQA ratio 8:1 — upgrading K from turbo4 to q8_0
warming up the model with an empty run...
creating MTP draft context against the target model...
[spec] estimated memory usage of MTP context is 1004.02 MiB
server is listening on http://127.0.0.1:8080

Three things of note:

  1. auto-asymmetric still works. K→q8_0, V→turbo4. Qwen's 8:1 attention head ratio would corrupt output with pure turbo4, but the auto-upgrade saves us transparently.

  2. MTP context created successfully. The 1,004 MiB is the cost of the speculative decoding infrastructure — worth every byte.

  3. No mmproj this time. With MTP adding ~1 GB, the vision encoder (another ~1.7 GB) would OOM. Trade-off: text-only but faster, or vision but slower. I chose speed.


📊 Benchmark: The Numbers That Made Me Grin

Same prompt, same model, same hardware. The only variable: --spec-type draft-mtp.

The prompt

"Write a Python function that computes prime numbers using Sieve of Eratosthenes. Output only code, no explanation."

Results

MetricWithout MTPWith MTPDelta
Generation speed28.6 tok/s40.6 tok/s+42%
Time for 127 tokens4,444 ms3,128 ms-30%
Draft tokens generated102
Draft tokens accepted9290.2%
Output correctnessIdentical

Ninety percent draft acceptance. That's the number that matters. MTP generates candidate tokens speculatively and verifies them against the main model. At 90% acceptance, you're getting nearly 2x throughput with zero quality loss — because the tokens that get rejected are simply regenerated by the main model, and the ones that pass are identical to what the model would have produced anyway.

What 42% means in practice

At 28 tok/s, generating a 1,000-token response takes 35 seconds. At 40 tok/s, it takes 25 seconds. Over a coding session with 50 requests, that's 8 minutes saved. Over a month? Hours.

And remember: this is still a 35B MoE model with 262K context running on an 8 GB consumer GPU. The fact that we're discussing 40+ tok/s at all is absurd.


🐛 Troubleshooting (The "I Did This So You Don't Have To" Section)

context type MTP requested but model doesn't contain MTP layers

Cause: You're using the old Unsloth GGUF (the one without -MTP in the filename).
Fix: Download the MTP-specific GGUF from unsloth/Qwen3.6-35B-A3B-MTP-GGUF. The model card explicitly says "MTP: trained with multi-steps" — but the GGUF needs to include those layers during conversion, and the regular Unsloth quants from a month ago don't.

cudaMalloc failed: out of memory with --mmproj

Cause: MTP context + vision encoder = too much VRAM.
Fix: Drop --mmproj (saves ~1.7 GB). If you absolutely need vision, reduce --ctx-size to ~196K.
Alternative: Use the non-MTP server for vision tasks and the MTP server for coding. Two servers, different ports, best of both worlds.

Rebase conflicts in speculative decoding files

Cause: The TheTom fork had cherry-picked older speculative decoding code that now conflicts with upstream's newer MTP implementation.
Fix: Skip those commits. Upstream has better speculative decoding now anyway. git rebase --skip is your friend.

Server takes 30+ seconds to load

That's normal. --no-mmap copies ~14 GB of expert weights into pinned CUDA host memory. The alternative (memory-mapped I/O) doesn't work with --n-cpu-moe. Go make coffee.


🔗 OpenCode Integration

If you're using this with OpenCode (and you should be), here's the provider config:

json
{
  "provider": {
    "turboquant-mtp": {
      "baseURL": "http://localhost:8080/v1",
      "apiKey": "none",
      "models": {
        "qwen36-35b-mtp": {
          "context": 262144,
          "output": 16384,
          "reasoning": true,
          "tool_call": true
        }
      }
    }
  }
}

The model supports thinking mode by default, which is great for agentic coding tasks. To disable thinking for quick answers, set "chat_template_kwargs": {"enable_thinking": false} in your API request.


🎉 Conclusion

What started as "can I bolt MTP onto TurboQuant" turned into:

  • ✅ A successful rebase of 194 turboquant commits onto latest upstream
  • ✅ MTP native speculative decoding with 90.2% draft acceptance
  • 40.6 tok/s on an RTX 3070 — a 42% speedup over the previous 28.6 tok/s
  • ✅ Full 262K context preserved (auto-asymmetric turbo4/q8_0 hybrid KV cache)
  • ✅ All CUDA kernels intact (turbo4, flash attention, warp-cooperative MoE)

The evolution of this setup

ConfigurationContextSpeedKey innovation
Stock llama.cpp + q8_0131K27 tok/sBaseline
TheTom fork + turbo4262K28 tok/s4-bit KV cache, auto-asymmetric
TheTom fork + turbo4 + MTP262K40.6 tok/sNative multi-token prediction

What's next?

The obvious missing piece is vision support with MTP. Right now it's either/or — MTP adds ~1 GB of VRAM, and the vision encoder needs another ~1.7 GB. With some memory optimization or reduced context, both should fit. That's a problem for next weekend.

The other frontier: MTP with --spec-draft-n-max tuned higher than the default. Qwen3.6 was trained with multi-step prediction — the theoretical limit might be higher than the current draft acceptance suggests. More benchmarks needed.


🛠️ Try it yourself:

lucad87/llama-cpp-turboquant — branch feature/turboquant-mtp

WRITTEN BY

Luca

Exploring the future of quality assurance and testing automation through deep technical insights.