Yes, you can run a 26-billion-parameter Mixture-of-Experts model on 8GB of VRAM. No, this is not black magic. Yes, it does involve compiling C++ code at 2 AM while questioning your life choices.
Most people would look at Gemma 4 26B (A4B) — a model with 128 experts per layer and a native 262K context window — and say "nice try, call me when you have an A100." But I'm stubborn. And I have caffeine. So I forked llama.cpp, welded in some experimental KV cache compression called TurboQuant, and turned my RTX 3070 into something that should not exist.
This is that story.
🤔 Wait, Why Not Just Use Docker?
Good question. The official llama.cpp Docker images stopped publishing CUDA builds. And even if they did, TurboQuant isn't upstream — it's sitting on a feature branch called feature/turboquant-kv-cache that adds 3-bit and 4-bit KV cache compression specifically optimized for Gemma 4's weird attention architecture (256-dim and 512-dim heads, sliding window, the works).
So: no Docker, no pre-built binaries, no shortcuts. Just me, CMake, and a mild caffeine addiction.
⚙️ The Hardware (aka "This Shouldn't Work")
Before we dive into the technical trenches, let's establish what we're working with:
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 7 5800X3D (8c/16t) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 3070 (8 GB VRAM) |
| CUDA | 13.2 (driver 595.97) |
| Compiler | Visual Studio 2026 Community (MSVC 14.50) |
| CMake | 4.3.2 |
If you're reading this thinking "that's less VRAM than my phone," you're right. We are about to park a semi-truck in a compact car space.
🧬 The Model: Gemma 4 26B MoE (A4B)
This isn't your average 7B dense model. Gemma 4 26B is a Mixture-of-Experts beast:
| Property | Value |
|---|---|
| Total parameters | 25.23 billion |
| Active parameters | ~4 billion (8 experts out of 128) |
| Layers | 30 |
| Experts per layer | 128 |
| Context window | 262,144 tokens |
| Attention | 16 heads, KV heads 2-8 per layer |
| Head dims | K=512 / V=512 (full), K=256 / V=256 (SWA) |
| Sliding window | 1024 tokens (26 SWA layers, 4 full-attention) |
| File size (Q4_K_M) | 16.8 GB |
The secret sauce: only 8 experts are active per token. The other 120 are just... there. On disk. In RAM. Waiting for their moment. We exploit this brutally.
🍴 The Fork: What Exactly Is TurboQuant?
TurboQuant is a set of custom CUDA kernels and memory layouts that compress the KV cache to 4 bits per element instead of the standard 8 or 16. It lives on a feature branch of llama.cpp:
Repo: test1111111111111112/llama-cpp-turboquant-gemma4 (branch:
feature/turboquant-kv-cache)
Repo: feature/turboquant-kv-cache
Commit: e93b7c56f
Key patches:
- Gemma 4 D=256/512 head dimension support (attn_iswa)
- Lazy K/V quantization (decompress on-the-fly)
- Batch decode optimization
- Warp-cooperative write for MoE gating
The net effect? At 262K context with q8_0 KV cache, you'd need roughly 6 GB just for the KV cache. With turbo4, that drops to ~1.5 GB. That's the difference between "cudaMalloc failed: out of memory" and "oh, I still have 4 GB free."
⚠️ Caveat emptor: TurboQuant is not upstream. It may never merge. It is experimental, opinionated, and requires compiling from source. If you were hoping for
pip install happiness, this is not your stop.
🔨 Step 1: Clone, Configure, Compile
Clone the fork
git clone https://github.com/test1111111111111112/llama-cpp-turboquant-gemma4.git C:\Users\<your-user>\llama-cpp-turboquant-gemma4
cd C:\Users\<your-user>\llama-cpp-turboquant-gemma4
git checkout feature/turboquant-kv-cache
Configure with CMake
This is where things get spicy. TurboQuant needs specific CUDA arch flags:
mkdir build && cd build
cmake .. -G "Unix Makefiles" -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON -DGGML_CUDA_COMPRESSION_MODE=size
What each flag does:
| Flag | Why |
|---|---|
GGML_CUDA=ON | "Yes, I have an NVIDIA GPU. Please use it." |
CMAKE_CUDA_ARCHITECTURES=86 | Target sm_86 (RTX 3070/3080/3090). Don't set this to 75 unless you enjoy debugging segfaults. |
GGML_CUDA_FA=ON | Flash Attention. Without this, the VRAM math doesn't close. |
GGML_CUDA_GRAPHS=ON | CUDA graphs reduce kernel launch overhead for repeated operations. Free perf. |
GGML_CUDA_COMPRESSION_MODE=size | Optimize the TurboQuant kernels for minimal memory, not maximum speed. |
Build
cmake --build . --config Release -j 16
This takes about 5-8 minutes on a 5800X3D. Go make coffee. Or two. The first compile is always "exciting" — a euphemism for "I hope CMake didn't silently ignore my CUDA flags again."
Binary output:
| File | Size | Role |
|---|---|---|
llama-server.exe | 8.4 MB | HTTP server (OpenAI-compatible) |
ggml-cuda.dll | 56.5 MB | TurboQuant CUDA kernels |
llama.dll | 2.5 MB | Core library |
ggml.dll | 68 KB | Backend dispatcher |
If you see ggml-cuda.dll at ~56 MB, congratulations: the TurboQuant kernels compiled successfully. If it's 5 MB... you're running on CPU and you won't even notice until you try loading the model and your VRAM sits at 0%.
📥 Step 2: Get the Model
# LM Studio users (model already downloaded):
# Path: ~/.lmstudio/models/lmstudio-community/gemma-4-26B-A4B-it-GGUF/
# Direct download (16.8 GB — get a snack):
wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf
For image support, also grab the multimodal projector:
wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-BF16.gguf
This adds ~1.1 GB and consumes an extra 141 MB of VRAM at runtime. Totally worth it for the "describe this meme" feature.
🚀 Step 3: Launch the Server
Here's the command. Take a deep breath — it's long, but every flag earns its keep:
C:\path\to\llama-server.exe -m "gemma-4-26B-A4B-it-Q4_K_M.gguf" --mmproj "mmproj-gemma-4-26B-A4B-it-BF16.gguf" --host 127.0.0.1 --port 8080 -fa auto -ngl 999 --n-cpu-moe 120 --no-mmap --mlock --cache-type-k turbo4 --cache-type-v turbo4 --ctx-size 262144
Flag-by-flag breakdown
| Flag | What it does | Why you need it |
|---|---|---|
-ngl 999 | Offload ALL layers to GPU | "999" = "yes, all 30 layers, stop asking" |
--n-cpu-moe 120 | Keep 120 experts on CPU, 8 on GPU | This is the trick. 128 experts at ~4 GB each = impossible. But with 120 on CPU (pinned RAM via CUDA_Host), only the 8 active experts (~1.6 GB) live on GPU. |
--no-mmap | Disable memory mapping | Required with --n-cpu-moe. Experts on CPU must be explicitly copied into CUDA_Host buffers. |
--mlock | Pin memory, prevent swap | Those 14 GB of CUDA_Host tensors better not touch the page file. |
--cache-type-k turbo4 | 4-bit KV cache for keys | Drops KV cache from 6 GB to 1.5 GB at 262K context. |
--cache-type-v turbo4 | 4-bit KV cache for values | Must match K. Mixing turbo4 K with turbo3 V = crash at fattn.cu:322. Don't ask how I know. |
-fa auto | Flash Attention | Reduces VRAM for attention computation. TurboQuant forces this internally anyway. |
--ctx-size 262144 | 262K context | Native limit for Gemma 4 26B. Yes, the entire thing fits in VRAM now. |
🧠 The MoE offload trick explained:
In a dense 26B model, all parameters sit on GPU and you're dead. In a MoE model, only the active experts (8 of 128) and the shared layers (attention, norms) need GPU residency. The other 120 experts sit inCUDA_Host— pinned system RAM accessible by the GPU at PCIe speed. When the gating mechanism selects different experts for the next token, those experts are already in host memory, and only the weights change (no data movement). The result: 14 GB of "GPU memory" that isn't actually on the GPU.
📊 Step 4: Memory Allocation (Where Did the VRAM Go?)
Here's the breakdown at 262K context with turbo4 KV cache:
| Buffer | Location | Size | Notes |
|---|---|---|---|
| Model (attention + norms) | CUDA0 | 1,574 MB | Non-expert layers always on GPU |
| Model (128 experts) | CUDA_Host | 14,429 MB | Inactive experts on pinned RAM |
| Model (output layer) | CUDA0 | (included above) | |
| Model (CPU tensors) | CPU | 578 MB | Embeddings, tokenizer |
| KV cache non-SWA | CUDA0 | 1,240 MB | 262K cells × 5 full-attention layers |
| KV cache SWA | CUDA0 | 218 MB | 4.6K cells × 25 SWA layers (window=1024) |
| Compute buffers | CUDA0 | 964 MB | Temporary GPU workspace |
| CLIP/vision (opt) | CUDA0 | 141 MB | Only with --mmproj |
| Compute buffers | CUDA_Host | 532 MB | |
| Total VRAM (text) | CUDA0 | ~3,996 MB | Under 4 GB! |
| Total VRAM (+images) | CUDA0 | ~4,137 MB | Still under 5 GB! |
Yes, you read that right. A 26B MoE model with a 262K context window fits in under 4 GB of actual VRAM. The magic is entirely in --n-cpu-moe and turbo4 KV cache.
🏎️ Step 5: Benchmarks
Here's what I measured on the RTX 3070 / 5800X3D:
| Metric | Value | Notes |
|---|---|---|
| Prompt processing | ~83 tok/s | Text-only |
| Prompt processing (images) | ~170 tok/s | Vision encoder + CLIP compute |
| Token generation | ~27 tok/s | Consistent regardless of context or modality |
| Model loading | 30-40 seconds | --no-mmap requires explicit copy of 14 GB |
| Warmup | ~5 seconds | First inference after load |
| Free VRAM after load | ~4,054 MB | Out of 8,191 MB total |
27 tok/s isn't going to win any drag races against GPT-4o, but for a locally-hosted 26B MoE model that absolutely should not fit on this hardware? It's borderline miraculous.
Comparison with the standard (non-fork) llama.cpp running the smaller Gemma 4 E4B (4B dense):
| Aspect | Standard E4B | TurboQuant 26B MoE |
|---|---|---|
| Model params | 4B (dense) | 25B total, ~4B active |
| File size | 5 GB | 16.8 GB |
| Context | 128K | 262K |
| KV cache type | q8_0 (8-bit) | turbo4 (4-bit) |
| KV cache VRAM | ~6.8 GB | ~1.5 GB |
| Gen speed | ~62 tok/s | ~27 tok/s |
| RAM needed | ~6 GB overflow | ~15 GB CUDA_Host |
The trade-off is clear: you lose about half the generation speed, but you gain 2x the context window at 4.5x less KV cache VRAM with a model that's 6x more knowledgeable.
🖼️ Step 6: Image Support (Optional)
Gemma 4 26B supports vision input. Add --mmproj and the CLIP vision encoder springs to life:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}],
"max_tokens": 400,
"temperature": 0.7
}'
Convert images to base64 with PowerShell:
[Convert]::ToBase64String([IO.File]::ReadAllBytes("cat.png"))
The vision pipeline adds ~141 MB of VRAM for CLIP compute buffers. At 262K context you can feed dozens of images before running out of tokens.
🐛 Step 7: Troubleshooting (The "I Learned This The Hard Way" Section)
Crash at fattn.cu:322
GGML_ABORT("fatal error")
Cause: Incompatible K/V cache type combination for TurboQuant Flash Attention kernels.
Fix: Use --cache-type-k turbo4 --cache-type-v turbo4 (must match!) or --cache-type-k turbo4 --cache-type-v f16.
Supported combinations:
| K | V | Supported |
|---|---|---|
| turbo4 | turbo4 | ✅ |
| turbo4 | f16 | ✅ |
| f16 | turbo4 | ✅ |
| turbo4 | turbo3 | ❌ (crash) |
Yes, I spent 40 minutes debugging this before reading the actual CUDA kernel code. Learn from my pain.
cudaMalloc failed: out of memory
Cause: Too many experts on GPU (--n-cpu-moe too low).
Fix: With 128 experts and 8 active, --n-cpu-moe 120 leaves exactly 8 on GPU. Going lower (e.g., 116) puts 12 on GPU — each expert is ~120 MB at Q4, so that's an extra 480 MB. It adds up.
Model loading takes 30-40 seconds
That's normal. --no-mmap forces explicit copy of ~14 GB into CUDA_Host pinned memory. The alternative (mmap) doesn't work with --n-cpu-moe. Grab a snack.
🔗 Step 8: Integrate with OpenCode
OpenCode can connect to your local TurboQuant server as if it were OpenAI:
{
"provider": {
"turboquant": {
"npm": "@ai-sdk/openai-compatible",
"name": "Gemma4 26B MoE (TurboQuant)",
"options": {
"baseURL": "http://localhost:8080/v1"
},
"models": {
"gemma4-26b-moe": {
"name": "gemma4",
"limit": {
"context": 262144,
"output": 16384
},
"modalities": {
"input": ["text", "image"],
"output": ["text"]
},
"reasoning": true,
"tool_call": true,
"interleaved": true
}
}
}
}
}
Then select it with /model turboquant/gemma4-26b-moe — and you're coding with a 26B MoE model that lives entirely on your desk.
🎉 Conclusion
What started as "I wonder if this fits" turned into:
- ✅ A custom-compiled llama.cpp fork with TurboQuant
- ✅ A 26B MoE model running at 27 tok/s
- ✅ 262K context window entirely in VRAM (~1.5 GB KV cache)
- ✅ 128-expert MoE offloading (120 on CPU, 8 on GPU)
- ✅ Multimodal support with CLIP vision encoder
- ✅ OpenAI-compatible API endpoint
- ✅ Under 4 GB VRAM usage at full context
The key innovations that make this work:
-
--n-cpu-moe 120: The killer feature. Without it, the model doesn't load. With it, you have a 26B model that behaves like a 4B model in terms of VRAM but thinks like a much larger one. -
--cache-type-k/v turbo4: 4-bit KV cache compression. At 262K context, this saves you ~4.5 GB of VRAM compared toq8_0. -
ggml-cuda.dll(56.5 MB): The TurboQuant CUDA kernels handle the dequantization on-the-fly, so you never pay the memory cost of the full-precision KV cache.
Could this run on a 6 GB GPU? Possibly, with more aggressive offloading (--n-cpu-moe 124). Could it run on a 4060 Ti 16GB? Comfortably, with room to spare. The ceiling is higher than you think.
If you want to try this yourself: clone the fork, grab the model from HuggingFace, compile with the flags above, and prepare to be surprised at how much you can squeeze out of consumer hardware.
Now if you'll excuse me, I need to explain to my RTX 3070 why it's running a model that's 3x its VRAM capacity. It's not speaking to me.