Taming a 26B MoE Model on 8GB VRAM with TurboQuant

Yes, you can run a 26-billion-parameter Mixture-of-Experts model on 8GB of VRAM. No, this is not black magic. Yes, it does involve compiling C++ code at 2 AM while questioning your life choices.

Most people would look at Gemma 4 26B (A4B) — a model with 128 experts per layer and a native 262K context window — and say "nice try, call me when you have an A100." But I'm stubborn. And I have caffeine. So I forked llama.cpp, welded in some experimental KV cache compression called TurboQuant, and turned my RTX 3070 into something that should not exist.

This is that story.

🤔 Wait, Why Not Just Use Docker?

Good question. The official llama.cpp Docker images stopped publishing CUDA builds. And even if they did, TurboQuant isn't upstream — it's sitting on a feature branch called feature/turboquant-kv-cache that adds 3-bit and 4-bit KV cache compression specifically optimized for Gemma 4's weird attention architecture (256-dim and 512-dim heads, sliding window, the works).

So: no Docker, no pre-built binaries, no shortcuts. Just me, CMake, and a mild caffeine addiction.

⚙️ The Hardware (aka "This Shouldn't Work")

Before we dive into the technical trenches, let's establish what we're working with:

Component	Spec
CPU	AMD Ryzen 7 5800X3D (8c/16t)
RAM	32 GB DDR4
GPU	NVIDIA GeForce RTX 3070 (8 GB VRAM)
CUDA	13.2 (driver 595.97)
Compiler	Visual Studio 2026 Community (MSVC 14.50)
CMake	4.3.2

If you're reading this thinking "that's less VRAM than my phone," you're right. We are about to park a semi-truck in a compact car space.

🧬 The Model: Gemma 4 26B MoE (A4B)

This isn't your average 7B dense model. Gemma 4 26B is a Mixture-of-Experts beast:

Property	Value
Total parameters	25.23 billion
Active parameters	~4 billion (8 experts out of 128)
Layers	30
Experts per layer	128
Context window	262,144 tokens
Attention	16 heads, KV heads 2-8 per layer
Head dims	K=512 / V=512 (full), K=256 / V=256 (SWA)
Sliding window	1024 tokens (26 SWA layers, 4 full-attention)
File size (Q4_K_M)	16.8 GB

The secret sauce: only 8 experts are active per token. The other 120 are just... there. On disk. In RAM. Waiting for their moment. We exploit this brutally.

🍴 The Fork: What Exactly Is TurboQuant?

TurboQuant is a set of custom CUDA kernels and memory layouts that compress the KV cache to 4 bits per element instead of the standard 8 or 16. It lives on a feature branch of llama.cpp:

Repo: test1111111111111112/llama-cpp-turboquant-gemma4 (branch: feature/turboquant-kv-cache)

Repo:  feature/turboquant-kv-cache
Commit: e93b7c56f
Key patches:
  - Gemma 4 D=256/512 head dimension support (attn_iswa)
  - Lazy K/V quantization (decompress on-the-fly)
  - Batch decode optimization
  - Warp-cooperative write for MoE gating

The net effect? At 262K context with q8_0 KV cache, you'd need roughly 6 GB just for the KV cache. With turbo4, that drops to ~1.5 GB. That's the difference between "cudaMalloc failed: out of memory" and "oh, I still have 4 GB free."

⚠️ Caveat emptor: TurboQuant is not upstream. It may never merge. It is experimental, opinionated, and requires compiling from source. If you were hoping for pip install happiness, this is not your stop.

🔨 Step 1: Clone, Configure, Compile

Clone the fork

powershell

git clone https://github.com/test1111111111111112/llama-cpp-turboquant-gemma4.git C:\Users\<your-user>\llama-cpp-turboquant-gemma4
cd C:\Users\<your-user>\llama-cpp-turboquant-gemma4
git checkout feature/turboquant-kv-cache

Configure with CMake

This is where things get spicy. TurboQuant needs specific CUDA arch flags:

powershell

mkdir build && cd build

cmake .. -G "Unix Makefiles"   -DGGML_CUDA=ON   -DCMAKE_CUDA_ARCHITECTURES=86   -DGGML_CUDA_FA=ON   -DGGML_CUDA_GRAPHS=ON   -DGGML_CUDA_COMPRESSION_MODE=size

What each flag does:

Flag	Why
`GGML_CUDA=ON`	"Yes, I have an NVIDIA GPU. Please use it."
`CMAKE_CUDA_ARCHITECTURES=86`	Target `sm_86` (RTX 3070/3080/3090). Don't set this to `75` unless you enjoy debugging segfaults.
`GGML_CUDA_FA=ON`	Flash Attention. Without this, the VRAM math doesn't close.
`GGML_CUDA_GRAPHS=ON`	CUDA graphs reduce kernel launch overhead for repeated operations. Free perf.
`GGML_CUDA_COMPRESSION_MODE=size`	Optimize the TurboQuant kernels for minimal memory, not maximum speed.

Build

powershell

cmake --build . --config Release -j 16

This takes about 5-8 minutes on a 5800X3D. Go make coffee. Or two. The first compile is always "exciting" — a euphemism for "I hope CMake didn't silently ignore my CUDA flags again."

Binary output:

File	Size	Role
`llama-server.exe`	8.4 MB	HTTP server (OpenAI-compatible)
`ggml-cuda.dll`	56.5 MB	TurboQuant CUDA kernels
`llama.dll`	2.5 MB	Core library
`ggml.dll`	68 KB	Backend dispatcher

If you see ggml-cuda.dll at ~56 MB, congratulations: the TurboQuant kernels compiled successfully. If it's 5 MB... you're running on CPU and you won't even notice until you try loading the model and your VRAM sits at 0%.

📥 Step 2: Get the Model

powershell

# LM Studio users (model already downloaded):
# Path: ~/.lmstudio/models/lmstudio-community/gemma-4-26B-A4B-it-GGUF/

# Direct download (16.8 GB — get a snack):
wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf

For image support, also grab the multimodal projector:

powershell

wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-BF16.gguf

This adds ~1.1 GB and consumes an extra 141 MB of VRAM at runtime. Totally worth it for the "describe this meme" feature.

🚀 Step 3: Launch the Server

Here's the command. Take a deep breath — it's long, but every flag earns its keep:

powershell

C:\path\to\llama-server.exe   -m "gemma-4-26B-A4B-it-Q4_K_M.gguf"   --mmproj "mmproj-gemma-4-26B-A4B-it-BF16.gguf"   --host 127.0.0.1 --port 8080   -fa auto -ngl 999   --n-cpu-moe 120   --no-mmap --mlock   --cache-type-k turbo4   --cache-type-v turbo4   --ctx-size 262144

Flag-by-flag breakdown

Flag	What it does	Why you need it
`-ngl 999`	Offload ALL layers to GPU	"999" = "yes, all 30 layers, stop asking"
`--n-cpu-moe 120`	Keep 120 experts on CPU, 8 on GPU	This is the trick. 128 experts at ~4 GB each = impossible. But with 120 on CPU (pinned RAM via `CUDA_Host`), only the 8 active experts (~1.6 GB) live on GPU.
`--no-mmap`	Disable memory mapping	Required with `--n-cpu-moe`. Experts on CPU must be explicitly copied into `CUDA_Host` buffers.
`--mlock`	Pin memory, prevent swap	Those 14 GB of `CUDA_Host` tensors better not touch the page file.
`--cache-type-k turbo4`	4-bit KV cache for keys	Drops KV cache from 6 GB to 1.5 GB at 262K context.
`--cache-type-v turbo4`	4-bit KV cache for values	Must match K. Mixing `turbo4` K with `turbo3` V = crash at `fattn.cu:322`. Don't ask how I know.
`-fa auto`	Flash Attention	Reduces VRAM for attention computation. TurboQuant forces this internally anyway.
`--ctx-size 262144`	262K context	Native limit for Gemma 4 26B. Yes, the entire thing fits in VRAM now.

🧠 The MoE offload trick explained:
In a dense 26B model, all parameters sit on GPU and you're dead. In a MoE model, only the active experts (8 of 128) and the shared layers (attention, norms) need GPU residency. The other 120 experts sit in CUDA_Host — pinned system RAM accessible by the GPU at PCIe speed. When the gating mechanism selects different experts for the next token, those experts are already in host memory, and only the weights change (no data movement). The result: 14 GB of "GPU memory" that isn't actually on the GPU.

📊 Step 4: Memory Allocation (Where Did the VRAM Go?)

Here's the breakdown at 262K context with turbo4 KV cache:

Buffer	Location	Size	Notes
Model (attention + norms)	CUDA0	1,574 MB	Non-expert layers always on GPU
Model (128 experts)	CUDA_Host	14,429 MB	Inactive experts on pinned RAM
Model (output layer)	CUDA0	(included above)
Model (CPU tensors)	CPU	578 MB	Embeddings, tokenizer
KV cache non-SWA	CUDA0	1,240 MB	262K cells × 5 full-attention layers
KV cache SWA	CUDA0	218 MB	4.6K cells × 25 SWA layers (window=1024)
Compute buffers	CUDA0	964 MB	Temporary GPU workspace
CLIP/vision (opt)	CUDA0	141 MB	Only with `--mmproj`
Compute buffers	CUDA_Host	532 MB
Total VRAM (text)	CUDA0	~3,996 MB	Under 4 GB!
Total VRAM (+images)	CUDA0	~4,137 MB	Still under 5 GB!

Yes, you read that right. A 26B MoE model with a 262K context window fits in under 4 GB of actual VRAM. The magic is entirely in --n-cpu-moe and turbo4 KV cache.

🏎️ Step 5: Benchmarks

Here's what I measured on the RTX 3070 / 5800X3D:

Metric	Value	Notes
Prompt processing	~83 tok/s	Text-only
Prompt processing (images)	~170 tok/s	Vision encoder + CLIP compute
Token generation	~27 tok/s	Consistent regardless of context or modality
Model loading	30-40 seconds	`--no-mmap` requires explicit copy of 14 GB
Warmup	~5 seconds	First inference after load
Free VRAM after load	~4,054 MB	Out of 8,191 MB total

27 tok/s isn't going to win any drag races against GPT-4o, but for a locally-hosted 26B MoE model that absolutely should not fit on this hardware? It's borderline miraculous.

Comparison with the standard (non-fork) llama.cpp running the smaller Gemma 4 E4B (4B dense):

Aspect	Standard E4B	TurboQuant 26B MoE
Model params	4B (dense)	25B total, ~4B active
File size	5 GB	16.8 GB
Context	128K	262K
KV cache type	q8_0 (8-bit)	turbo4 (4-bit)
KV cache VRAM	~6.8 GB	~1.5 GB
Gen speed	~62 tok/s	~27 tok/s
RAM needed	~6 GB overflow	~15 GB CUDA_Host

The trade-off is clear: you lose about half the generation speed, but you gain 2x the context window at 4.5x less KV cache VRAM with a model that's 6x more knowledgeable.

🖼️ Step 6: Image Support (Optional)

Gemma 4 26B supports vision input. Add --mmproj and the CLIP vision encoder springs to life:

bash

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What do you see in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
      ]
    }],
    "max_tokens": 400,
    "temperature": 0.7
  }'

Convert images to base64 with PowerShell:

powershell

[Convert]::ToBase64String([IO.File]::ReadAllBytes("cat.png"))

The vision pipeline adds ~141 MB of VRAM for CLIP compute buffers. At 262K context you can feed dozens of images before running out of tokens.

🐛 Step 7: Troubleshooting (The "I Learned This The Hard Way" Section)

Crash at `fattn.cu:322`

GGML_ABORT("fatal error")

Cause: Incompatible K/V cache type combination for TurboQuant Flash Attention kernels.
Fix: Use --cache-type-k turbo4 --cache-type-v turbo4 (must match!) or --cache-type-k turbo4 --cache-type-v f16.

Supported combinations:

K	V	Supported
turbo4	turbo4	✅
turbo4	f16	✅
f16	turbo4	✅
turbo4	turbo3	❌ (crash)

Yes, I spent 40 minutes debugging this before reading the actual CUDA kernel code. Learn from my pain.

`cudaMalloc failed: out of memory`

Cause: Too many experts on GPU (--n-cpu-moe too low).
Fix: With 128 experts and 8 active, --n-cpu-moe 120 leaves exactly 8 on GPU. Going lower (e.g., 116) puts 12 on GPU — each expert is ~120 MB at Q4, so that's an extra 480 MB. It adds up.

Model loading takes 30-40 seconds

That's normal. --no-mmap forces explicit copy of ~14 GB into CUDA_Host pinned memory. The alternative (mmap) doesn't work with --n-cpu-moe. Grab a snack.

🔗 Step 8: Integrate with OpenCode

OpenCode can connect to your local TurboQuant server as if it were OpenAI:

json
{
  "provider": {
    "turboquant": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Gemma4 26B MoE (TurboQuant)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "gemma4-26b-moe": {
          "name": "gemma4",
          "limit": {
            "context": 262144,
            "output": 16384
          },
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "reasoning": true,
          "tool_call": true,
          "interleaved": true
        }
      }
    }
  }
}

Then select it with /model turboquant/gemma4-26b-moe — and you're coding with a 26B MoE model that lives entirely on your desk.

🎉 Conclusion

What started as "I wonder if this fits" turned into:

✅ A custom-compiled llama.cpp fork with TurboQuant
✅ A 26B MoE model running at 27 tok/s
✅ 262K context window entirely in VRAM (~1.5 GB KV cache)
✅ 128-expert MoE offloading (120 on CPU, 8 on GPU)
✅ Multimodal support with CLIP vision encoder
✅ OpenAI-compatible API endpoint
✅ Under 4 GB VRAM usage at full context

The key innovations that make this work:

--n-cpu-moe 120: The killer feature. Without it, the model doesn't load. With it, you have a 26B model that behaves like a 4B model in terms of VRAM but thinks like a much larger one.
--cache-type-k/v turbo4: 4-bit KV cache compression. At 262K context, this saves you ~4.5 GB of VRAM compared to q8_0.
ggml-cuda.dll (56.5 MB): The TurboQuant CUDA kernels handle the dequantization on-the-fly, so you never pay the memory cost of the full-precision KV cache.

Could this run on a 6 GB GPU? Possibly, with more aggressive offloading (--n-cpu-moe 124). Could it run on a 4060 Ti 16GB? Comfortably, with room to spare. The ceiling is higher than you think.

If you want to try this yourself: clone the fork, grab the model from HuggingFace, compile with the flags above, and prepare to be surprised at how much you can squeeze out of consumer hardware.

Now if you'll excuse me, I need to explain to my RTX 3070 why it's running a model that's 3x its VRAM capacity. It's not speaking to me.

Taming a 26B MoE Model on 8GB VRAM with TurboQuant

🤔 Wait, Why Not Just Use Docker?

⚙️ The Hardware (aka "This Shouldn't Work")

🧬 The Model: Gemma 4 26B MoE (A4B)

🍴 The Fork: What Exactly Is TurboQuant?

🔨 Step 1: Clone, Configure, Compile

Clone the fork

Configure with CMake

Build

📥 Step 2: Get the Model

🚀 Step 3: Launch the Server

Flag-by-flag breakdown

📊 Step 4: Memory Allocation (Where Did the VRAM Go?)

🏎️ Step 5: Benchmarks

🖼️ Step 6: Image Support (Optional)

🐛 Step 7: Troubleshooting (The "I Learned This The Hard Way" Section)

Crash at `fattn.cu:322`

`cudaMalloc failed: out of memory`

Model loading takes 30-40 seconds

🔗 Step 8: Integrate with OpenCode

🎉 Conclusion

Luca

Keep Reading

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

Deploying Qwen3-Coder-30B-A3B on 8GB GPU with Docker

🤔 Wait, Why Not Just Use Docker?

⚙️ The Hardware (aka "This Shouldn't Work")

🧬 The Model: Gemma 4 26B MoE (A4B)

🍴 The Fork: What Exactly Is TurboQuant?

🔨 Step 1: Clone, Configure, Compile

Clone the fork

Configure with CMake

Build

📥 Step 2: Get the Model

🚀 Step 3: Launch the Server

Flag-by-flag breakdown

📊 Step 4: Memory Allocation (Where Did the VRAM Go?)

🏎️ Step 5: Benchmarks

🖼️ Step 6: Image Support (Optional)

🐛 Step 7: Troubleshooting (The "I Learned This The Hard Way" Section)

Crash at fattn.cu:322

cudaMalloc failed: out of memory

Model loading takes 30-40 seconds

🔗 Step 8: Integrate with OpenCode

🎉 Conclusion

Luca

Keep Reading

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

Deploying Qwen3-Coder-30B-A3B on 8GB GPU with Docker

Crash at `fattn.cu:322`

`cudaMalloc failed: out of memory`