Last week I had a perfectly functional setup: Qwen3.6 35B-A3B running at 28 tok/s with 262K context on a consumer RTX 3070. Most people would stop there. Most people are smarter than me.
Instead, I decided to rebase a 194-commit experimental CUDA fork onto a moving upstream target, resolve conflicts in flash attention kernels at 11 PM, and bolt on Multi-Token Prediction — all to answer one question: how fast can this thing actually go?
The answer: 40.6 tok/s. That's a 42% speedup. For free. Let me show you how.
📦 Code:
lucad87/llama-cpp-turboquant— branchfeature/turboquant-mtp. Fork of TheTom's turboquant rebased onto upstream llama.cpp with MTP support. Clone, build, run.
🤔 Wait, Weren't We Doing TurboQuant?
Yes. And we still are. Let me recap the layers of this lasagna:
| Layer | What it does |
|---|---|
| TurboQuant (TheTom fork) | 4-bit KV cache compression via custom CUDA kernels. Drops KV cache VRAM from ~6 GB to ~1.5 GB at 262K context. |
| auto-asymmetric | Detects GQA ratio and automatically upgrades K cache from turbo4 to q8_0 to prevent quality degradation on models with high head ratios (like Qwen's 8:1). |
| MTP (upstream llama.cpp) | Multi-Token Prediction. The model predicts 2-4 tokens at once instead of one, verifies them, and only keeps the correct ones. Native to Qwen3.6's architecture. |
The problem: TurboQuant lived on a fork from April 2026. MTP was merged into upstream llama.cpp in May. They had never met. I needed to introduce them.
⚙️ The Hardware (Still the Same, Still Defying Physics)
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 7 5800X3D (8c/16t) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 3070 (8 GB VRAM) |
| CUDA | 13.2 |
| Compiler | Visual Studio 2026 Community (MSVC 14.50) |
| CMake | 4.3.2 |
If you're new here: yes, running a 35B model on 8GB VRAM is absurd. The MoE architecture (3B active out of 35B total) plus aggressive offloading makes it possible. But adding speculative decoding on top? That's the part I wasn't sure about.
🧬 The Model: Qwen3.6 35B-A3B (Now With MTP Layers)
The model we're using is Qwen3.6 35B-A3B — a Mixture-of-Experts beast with 256 experts (8 active + 1 shared) and a native 262K context window. Architecturally, it uses Gated DeltaNet + Gated Attention layers interleaved with MoE feed-forward blocks.
Here's what matters for this article:
| Property | Value |
|---|---|
| Total parameters | 35B |
| Active parameters | ~3B |
| Context window | 262,144 tokens |
| Quantization | UD-Q4_K_M (Unsloth Dynamic 2.0) |
| File size | 22.1 GB |
| MTP heads | Trained with multi-step prediction |
| MTP context VRAM | ~1,004 MiB |
The critical detail: Qwen3.6 was trained with MTP — the model literally has extra "next-token heads" that predict multiple future tokens simultaneously. But the original Unsloth GGUF from a month ago was converted before MTP support existed in llama.cpp. I needed the new Unsloth Dynamic 2.0 conversion with MTP layers baked in.
Unsloth released a separate MTP-enabled GGUF just 4 days ago:
unsloth/Qwen3.6-35B-A3B-MTP-GGUF. This is the one you want. The regularQwen3.6-35B-A3B-GGUFwill throwcontext type MTP requested but model doesn't contain MTP layersand then crash. Ask me how I know.
🔀 The Merge: Rebase or Die Trying
The TheTom fork (feature/turboquant-kv-cache) was 435 commits behind upstream master and 194 commits ahead with its own turbo4 CUDA kernels. Merging was not going to be a git merge and a prayer.
You can grab the final result directly — no need to redo the rebase yourself:
git clone https://github.com/lucad87/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-mtp
But if you're curious how the sausage is made, here's the full autopsy:
Step 1: Add upstream remote
git remote add upstream https://github.com/ggml-org/llama.cpp
git fetch upstream master
Step 2: Rebase (this is where it gets sporty)
git checkout -b feature/turboquant-mtp
git rebase upstream/master
194 commits to replay. 22 conflicts. Here's the damage report:
| File | Conflict type | Resolution |
|---|---|---|
ggml-metal/*.cpp ×3 | Metal (Apple) — irrelevant for CUDA | Kept upstream |
ggml-vulkan/* ×4 | Vulkan — irrelevant | Kept upstream |
vendors/hip.h | HIP/ROCm macros | Kept turboquant (more complete) |
fattn.cu | Flash attention dispatch | Both sides merged |
fattn-mma-f16.cuh | Matrix multiply templates | Both sides merged |
ggml-cuda.cu | Op support table | Both sides kept |
common/*.cpp/.h ×2 | Speculative decoding cherry-picks | Skipped (already upstream) |
perplexity.cpp | Comment-only | Kept turboquant comment |
The key skips: two commits were cherry-picks of speculative decoding from an older upstream. Since the new upstream already has MTP (which is speculative decoding on steroids), these became redundant.
Step 3: Build
cmake -S . -B build -G "Visual Studio 18 2026" -A x64 `
-DGGML_CUDA=ON `
-DCMAKE_CUDA_ARCHITECTURES=86 `
-DGGML_CUDA_FA=ON `
-DGGML_CUDA_GRAPHS=ON `
-DGGML_CUDA_COMPRESSION_MODE=size
cmake --build build --config Release --target llama-server llama-cli -j 8
~10 minutes later:
llama-server.vcxproj -> llama-server.exe
llama-cli.vcxproj -> llama-cli.exe
ggml-cuda.vcxproj -> ggml-cuda.dll
No errors. No warnings that mattered. The turbo4 kernels, auto-asymmetric, and MTP speculative decoding all living in the same binary. I almost didn't believe it.
🚀 Launch: MTP + TurboQuant, Together at Last
C:\llama-turboquant-thetom\build\bin\Release\llama-server.exe `
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" `
--host 127.0.0.1 --port 8080 `
-fa auto -ngl 999 --n-cpu-moe 248 `
--no-mmap --mlock `
--cache-type-k turbo4 --cache-type-v turbo4 `
--ctx-size 262144 `
--spec-type draft-mtp `
--temp 0.1 --repeat-penalty 1.05 --threads 16 -b 512
Flag breakdown: what's new
| Flag | Value | Why |
|---|---|---|
--spec-type | draft-mtp | Enable native MTP speculative decoding using the model's own prediction heads |
--n-cpu-moe | 248 | 256 experts total, 8 active. Keep 248 on CPU pinned RAM, 8 on GPU. |
Flag breakdown: the turboquant classics
| Flag | What it does | Why you need it |
|---|---|---|
-ngl 999 | All layers to GPU | "Just do it" |
--n-cpu-moe 248 | 248 experts on CPU | Only 8 active experts live on GPU (~2 GB). The other 248 sit in pinned RAM at PCIe speed. |
--no-mmap --mlock | No memory map, pin memory | Required with --n-cpu-moe. Prevents the OS from swapping 14 GB of expert weights to disk. |
--cache-type-k/v turbo4 | 4-bit KV cache | ~1.5 GB KV cache at 262K instead of ~6 GB. |
-fa auto | Flash Attention | Forces efficient attention computation. |
What happened at load time
The log tells the story:
auto-asymmetric: GQA ratio 8:1 — upgrading K from turbo4 to q8_0
warming up the model with an empty run...
creating MTP draft context against the target model...
[spec] estimated memory usage of MTP context is 1004.02 MiB
server is listening on http://127.0.0.1:8080
Three things of note:
-
auto-asymmetric still works. K→q8_0, V→turbo4. Qwen's 8:1 attention head ratio would corrupt output with pure turbo4, but the auto-upgrade saves us transparently.
-
MTP context created successfully. The 1,004 MiB is the cost of the speculative decoding infrastructure — worth every byte.
-
No mmproj this time. With MTP adding ~1 GB, the vision encoder (another ~1.7 GB) would OOM. Trade-off: text-only but faster, or vision but slower. I chose speed.
📊 Benchmark: The Numbers That Made Me Grin
Same prompt, same model, same hardware. The only variable: --spec-type draft-mtp.
The prompt
"Write a Python function that computes prime numbers using Sieve of Eratosthenes. Output only code, no explanation."
Results
| Metric | Without MTP | With MTP | Delta |
|---|---|---|---|
| Generation speed | 28.6 tok/s | 40.6 tok/s | +42% |
| Time for 127 tokens | 4,444 ms | 3,128 ms | -30% |
| Draft tokens generated | — | 102 | — |
| Draft tokens accepted | — | 92 | 90.2% |
| Output correctness | ✅ | ✅ | Identical |
Ninety percent draft acceptance. That's the number that matters. MTP generates candidate tokens speculatively and verifies them against the main model. At 90% acceptance, you're getting nearly 2x throughput with zero quality loss — because the tokens that get rejected are simply regenerated by the main model, and the ones that pass are identical to what the model would have produced anyway.
What 42% means in practice
At 28 tok/s, generating a 1,000-token response takes 35 seconds. At 40 tok/s, it takes 25 seconds. Over a coding session with 50 requests, that's 8 minutes saved. Over a month? Hours.
And remember: this is still a 35B MoE model with 262K context running on an 8 GB consumer GPU. The fact that we're discussing 40+ tok/s at all is absurd.
🐛 Troubleshooting (The "I Did This So You Don't Have To" Section)
context type MTP requested but model doesn't contain MTP layers
Cause: You're using the old Unsloth GGUF (the one without -MTP in the filename).
Fix: Download the MTP-specific GGUF from unsloth/Qwen3.6-35B-A3B-MTP-GGUF. The model card explicitly says "MTP: trained with multi-steps" — but the GGUF needs to include those layers during conversion, and the regular Unsloth quants from a month ago don't.
cudaMalloc failed: out of memory with --mmproj
Cause: MTP context + vision encoder = too much VRAM.
Fix: Drop --mmproj (saves ~1.7 GB). If you absolutely need vision, reduce --ctx-size to ~196K.
Alternative: Use the non-MTP server for vision tasks and the MTP server for coding. Two servers, different ports, best of both worlds.
Rebase conflicts in speculative decoding files
Cause: The TheTom fork had cherry-picked older speculative decoding code that now conflicts with upstream's newer MTP implementation.
Fix: Skip those commits. Upstream has better speculative decoding now anyway. git rebase --skip is your friend.
Server takes 30+ seconds to load
That's normal. --no-mmap copies ~14 GB of expert weights into pinned CUDA host memory. The alternative (memory-mapped I/O) doesn't work with --n-cpu-moe. Go make coffee.
🔗 OpenCode Integration
If you're using this with OpenCode (and you should be), here's the provider config:
{
"provider": {
"turboquant-mtp": {
"baseURL": "http://localhost:8080/v1",
"apiKey": "none",
"models": {
"qwen36-35b-mtp": {
"context": 262144,
"output": 16384,
"reasoning": true,
"tool_call": true
}
}
}
}
}
The model supports thinking mode by default, which is great for agentic coding tasks. To disable thinking for quick answers, set "chat_template_kwargs": {"enable_thinking": false} in your API request.
🎉 Conclusion
What started as "can I bolt MTP onto TurboQuant" turned into:
- ✅ A successful rebase of 194 turboquant commits onto latest upstream
- ✅ MTP native speculative decoding with 90.2% draft acceptance
- ✅ 40.6 tok/s on an RTX 3070 — a 42% speedup over the previous 28.6 tok/s
- ✅ Full 262K context preserved (auto-asymmetric turbo4/q8_0 hybrid KV cache)
- ✅ All CUDA kernels intact (turbo4, flash attention, warp-cooperative MoE)
The evolution of this setup
| Configuration | Context | Speed | Key innovation |
|---|---|---|---|
| Stock llama.cpp + q8_0 | 131K | 27 tok/s | Baseline |
| TheTom fork + turbo4 | 262K | 28 tok/s | 4-bit KV cache, auto-asymmetric |
| TheTom fork + turbo4 + MTP | 262K | 40.6 tok/s | Native multi-token prediction |
What's next?
The obvious missing piece is vision support with MTP. Right now it's either/or — MTP adds ~1 GB of VRAM, and the vision encoder needs another ~1.7 GB. With some memory optimization or reduced context, both should fit. That's a problem for next weekend.
The other frontier: MTP with --spec-draft-n-max tuned higher than the default. Qwen3.6 was trained with multi-step prediction — the theoretical limit might be higher than the current draft acceptance suggests. More benchmarks needed.
🛠️ Try it yourself:
lucad87/llama-cpp-turboquant — branch feature/turboquant-mtp