Published May 14, 2026• Updated Jun 10, 2026

Taming a 26B MoE Model on 8GB VRAM with TurboQuant

Forget Docker images and pre-built binaries. Here's how I compiled a custom llama.cpp fork with TurboQuant, ran a 26B Gemma 4 MoE model on a consumer RTX 3070, and squeezed 262K context into less than 4GB of VRAM.

Taming a 26B MoE Model on 8GB VRAM with TurboQuant

Yes, you can run a 26-billion-parameter Mixture-of-Experts model on 8GB of VRAM. No, this is not black magic. Yes, it does involve compiling C++ code at 2 AM while questioning your life choices.

Most people would look at Gemma 4 26B (A4B) — a model with 128 experts per layer and a native 262K context window — and say "nice try, call me when you have an A100." But I'm stubborn. And I have caffeine. So I forked llama.cpp, welded in some experimental KV cache compression called TurboQuant, and turned my RTX 3070 into something that should not exist.

This is that story.


🤔 Wait, Why Not Just Use Docker?

Good question. The official llama.cpp Docker images stopped publishing CUDA builds. And even if they did, TurboQuant isn't upstream — it's sitting on a feature branch called feature/turboquant-kv-cache that adds 3-bit and 4-bit KV cache compression specifically optimized for Gemma 4's weird attention architecture (256-dim and 512-dim heads, sliding window, the works).

So: no Docker, no pre-built binaries, no shortcuts. Just me, CMake, and a mild caffeine addiction.


⚙️ The Hardware (aka "This Shouldn't Work")

Before we dive into the technical trenches, let's establish what we're working with:

ComponentSpec
CPUAMD Ryzen 7 5800X3D (8c/16t)
RAM32 GB DDR4
GPUNVIDIA GeForce RTX 3070 (8 GB VRAM)
CUDA13.2 (driver 595.97)
CompilerVisual Studio 2026 Community (MSVC 14.50)
CMake4.3.2

If you're reading this thinking "that's less VRAM than my phone," you're right. We are about to park a semi-truck in a compact car space.


🧬 The Model: Gemma 4 26B MoE (A4B)

This isn't your average 7B dense model. Gemma 4 26B is a Mixture-of-Experts beast:

PropertyValue
Total parameters25.23 billion
Active parameters~4 billion (8 experts out of 128)
Layers30
Experts per layer128
Context window262,144 tokens
Attention16 heads, KV heads 2-8 per layer
Head dimsK=512 / V=512 (full), K=256 / V=256 (SWA)
Sliding window1024 tokens (26 SWA layers, 4 full-attention)
File size (Q4_K_M)16.8 GB

The secret sauce: only 8 experts are active per token. The other 120 are just... there. On disk. In RAM. Waiting for their moment. We exploit this brutally.


🍴 The Fork: What Exactly Is TurboQuant?

TurboQuant is a set of custom CUDA kernels and memory layouts that compress the KV cache to 4 bits per element instead of the standard 8 or 16. It lives on a feature branch of llama.cpp:

Repo: test1111111111111112/llama-cpp-turboquant-gemma4 (branch: feature/turboquant-kv-cache)

Repo:  feature/turboquant-kv-cache
Commit: e93b7c56f
Key patches:
  - Gemma 4 D=256/512 head dimension support (attn_iswa)
  - Lazy K/V quantization (decompress on-the-fly)
  - Batch decode optimization
  - Warp-cooperative write for MoE gating

The net effect? At 262K context with q8_0 KV cache, you'd need roughly 6 GB just for the KV cache. With turbo4, that drops to ~1.5 GB. That's the difference between "cudaMalloc failed: out of memory" and "oh, I still have 4 GB free."

⚠️ Caveat emptor: TurboQuant is not upstream. It may never merge. It is experimental, opinionated, and requires compiling from source. If you were hoping for pip install happiness, this is not your stop.


🔨 Step 1: Clone, Configure, Compile

Clone the fork

powershell
git clone https://github.com/test1111111111111112/llama-cpp-turboquant-gemma4.git C:\Users\<your-user>\llama-cpp-turboquant-gemma4
cd C:\Users\<your-user>\llama-cpp-turboquant-gemma4
git checkout feature/turboquant-kv-cache

Configure with CMake

This is where things get spicy. TurboQuant needs specific CUDA arch flags:

powershell
mkdir build && cd build

cmake .. -G "Unix Makefiles"   -DGGML_CUDA=ON   -DCMAKE_CUDA_ARCHITECTURES=86   -DGGML_CUDA_FA=ON   -DGGML_CUDA_GRAPHS=ON   -DGGML_CUDA_COMPRESSION_MODE=size

What each flag does:

FlagWhy
GGML_CUDA=ON"Yes, I have an NVIDIA GPU. Please use it."
CMAKE_CUDA_ARCHITECTURES=86Target sm_86 (RTX 3070/3080/3090). Don't set this to 75 unless you enjoy debugging segfaults.
GGML_CUDA_FA=ONFlash Attention. Without this, the VRAM math doesn't close.
GGML_CUDA_GRAPHS=ONCUDA graphs reduce kernel launch overhead for repeated operations. Free perf.
GGML_CUDA_COMPRESSION_MODE=sizeOptimize the TurboQuant kernels for minimal memory, not maximum speed.

Build

powershell
cmake --build . --config Release -j 16

This takes about 5-8 minutes on a 5800X3D. Go make coffee. Or two. The first compile is always "exciting" — a euphemism for "I hope CMake didn't silently ignore my CUDA flags again."

Binary output:

FileSizeRole
llama-server.exe8.4 MBHTTP server (OpenAI-compatible)
ggml-cuda.dll56.5 MBTurboQuant CUDA kernels
llama.dll2.5 MBCore library
ggml.dll68 KBBackend dispatcher

If you see ggml-cuda.dll at ~56 MB, congratulations: the TurboQuant kernels compiled successfully. If it's 5 MB... you're running on CPU and you won't even notice until you try loading the model and your VRAM sits at 0%.


📥 Step 2: Get the Model

powershell
# LM Studio users (model already downloaded):
# Path: ~/.lmstudio/models/lmstudio-community/gemma-4-26B-A4B-it-GGUF/

# Direct download (16.8 GB — get a snack):
wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf

For image support, also grab the multimodal projector:

powershell
wget https://huggingface.co/bartowski/gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-gemma-4-26B-A4B-it-BF16.gguf

This adds ~1.1 GB and consumes an extra 141 MB of VRAM at runtime. Totally worth it for the "describe this meme" feature.


🚀 Step 3: Launch the Server

Here's the command. Take a deep breath — it's long, but every flag earns its keep:

powershell
C:\path\to\llama-server.exe   -m "gemma-4-26B-A4B-it-Q4_K_M.gguf"   --mmproj "mmproj-gemma-4-26B-A4B-it-BF16.gguf"   --host 127.0.0.1 --port 8080   -fa auto -ngl 999   --n-cpu-moe 120   --no-mmap --mlock   --cache-type-k turbo4   --cache-type-v turbo4   --ctx-size 262144

Flag-by-flag breakdown

FlagWhat it doesWhy you need it
-ngl 999Offload ALL layers to GPU"999" = "yes, all 30 layers, stop asking"
--n-cpu-moe 120Keep 120 experts on CPU, 8 on GPUThis is the trick. 128 experts at ~4 GB each = impossible. But with 120 on CPU (pinned RAM via CUDA_Host), only the 8 active experts (~1.6 GB) live on GPU.
--no-mmapDisable memory mappingRequired with --n-cpu-moe. Experts on CPU must be explicitly copied into CUDA_Host buffers.
--mlockPin memory, prevent swapThose 14 GB of CUDA_Host tensors better not touch the page file.
--cache-type-k turbo44-bit KV cache for keysDrops KV cache from 6 GB to 1.5 GB at 262K context.
--cache-type-v turbo44-bit KV cache for valuesMust match K. Mixing turbo4 K with turbo3 V = crash at fattn.cu:322. Don't ask how I know.
-fa autoFlash AttentionReduces VRAM for attention computation. TurboQuant forces this internally anyway.
--ctx-size 262144262K contextNative limit for Gemma 4 26B. Yes, the entire thing fits in VRAM now.

🧠 The MoE offload trick explained:
In a dense 26B model, all parameters sit on GPU and you're dead. In a MoE model, only the active experts (8 of 128) and the shared layers (attention, norms) need GPU residency. The other 120 experts sit in CUDA_Host — pinned system RAM accessible by the GPU at PCIe speed. When the gating mechanism selects different experts for the next token, those experts are already in host memory, and only the weights change (no data movement). The result: 14 GB of "GPU memory" that isn't actually on the GPU.


📊 Step 4: Memory Allocation (Where Did the VRAM Go?)

Here's the breakdown at 262K context with turbo4 KV cache:

BufferLocationSizeNotes
Model (attention + norms)CUDA01,574 MBNon-expert layers always on GPU
Model (128 experts)CUDA_Host14,429 MBInactive experts on pinned RAM
Model (output layer)CUDA0(included above)
Model (CPU tensors)CPU578 MBEmbeddings, tokenizer
KV cache non-SWACUDA01,240 MB262K cells × 5 full-attention layers
KV cache SWACUDA0218 MB4.6K cells × 25 SWA layers (window=1024)
Compute buffersCUDA0964 MBTemporary GPU workspace
CLIP/vision (opt)CUDA0141 MBOnly with --mmproj
Compute buffersCUDA_Host532 MB
Total VRAM (text)CUDA0~3,996 MBUnder 4 GB!
Total VRAM (+images)CUDA0~4,137 MBStill under 5 GB!

Yes, you read that right. A 26B MoE model with a 262K context window fits in under 4 GB of actual VRAM. The magic is entirely in --n-cpu-moe and turbo4 KV cache.


🏎️ Step 5: Benchmarks

Here's what I measured on the RTX 3070 / 5800X3D:

MetricValueNotes
Prompt processing~83 tok/sText-only
Prompt processing (images)~170 tok/sVision encoder + CLIP compute
Token generation~27 tok/sConsistent regardless of context or modality
Model loading30-40 seconds--no-mmap requires explicit copy of 14 GB
Warmup~5 secondsFirst inference after load
Free VRAM after load~4,054 MBOut of 8,191 MB total

27 tok/s isn't going to win any drag races against GPT-4o, but for a locally-hosted 26B MoE model that absolutely should not fit on this hardware? It's borderline miraculous.

Comparison with the standard (non-fork) llama.cpp running the smaller Gemma 4 E4B (4B dense):

AspectStandard E4BTurboQuant 26B MoE
Model params4B (dense)25B total, ~4B active
File size5 GB16.8 GB
Context128K262K
KV cache typeq8_0 (8-bit)turbo4 (4-bit)
KV cache VRAM~6.8 GB~1.5 GB
Gen speed~62 tok/s~27 tok/s
RAM needed~6 GB overflow~15 GB CUDA_Host

The trade-off is clear: you lose about half the generation speed, but you gain 2x the context window at 4.5x less KV cache VRAM with a model that's 6x more knowledgeable.


🖼️ Step 6: Image Support (Optional)

Gemma 4 26B supports vision input. Add --mmproj and the CLIP vision encoder springs to life:

bash
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What do you see in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
      ]
    }],
    "max_tokens": 400,
    "temperature": 0.7
  }'

Convert images to base64 with PowerShell:

powershell
[Convert]::ToBase64String([IO.File]::ReadAllBytes("cat.png"))

The vision pipeline adds ~141 MB of VRAM for CLIP compute buffers. At 262K context you can feed dozens of images before running out of tokens.


🐛 Step 7: Troubleshooting (The "I Learned This The Hard Way" Section)

Crash at fattn.cu:322

GGML_ABORT("fatal error")

Cause: Incompatible K/V cache type combination for TurboQuant Flash Attention kernels.
Fix: Use --cache-type-k turbo4 --cache-type-v turbo4 (must match!) or --cache-type-k turbo4 --cache-type-v f16.

Supported combinations:

KVSupported
turbo4turbo4
turbo4f16
f16turbo4
turbo4turbo3❌ (crash)

Yes, I spent 40 minutes debugging this before reading the actual CUDA kernel code. Learn from my pain.

cudaMalloc failed: out of memory

Cause: Too many experts on GPU (--n-cpu-moe too low).
Fix: With 128 experts and 8 active, --n-cpu-moe 120 leaves exactly 8 on GPU. Going lower (e.g., 116) puts 12 on GPU — each expert is ~120 MB at Q4, so that's an extra 480 MB. It adds up.

Model loading takes 30-40 seconds

That's normal. --no-mmap forces explicit copy of ~14 GB into CUDA_Host pinned memory. The alternative (mmap) doesn't work with --n-cpu-moe. Grab a snack.


🔗 Step 8: Integrate with OpenCode

OpenCode can connect to your local TurboQuant server as if it were OpenAI:

json
{
  "provider": {
    "turboquant": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Gemma4 26B MoE (TurboQuant)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "gemma4-26b-moe": {
          "name": "gemma4",
          "limit": {
            "context": 262144,
            "output": 16384
          },
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "reasoning": true,
          "tool_call": true,
          "interleaved": true
        }
      }
    }
  }
}

Then select it with /model turboquant/gemma4-26b-moe — and you're coding with a 26B MoE model that lives entirely on your desk.


🎉 Conclusion

What started as "I wonder if this fits" turned into:

  • ✅ A custom-compiled llama.cpp fork with TurboQuant
  • ✅ A 26B MoE model running at 27 tok/s
  • ✅ 262K context window entirely in VRAM (~1.5 GB KV cache)
  • ✅ 128-expert MoE offloading (120 on CPU, 8 on GPU)
  • ✅ Multimodal support with CLIP vision encoder
  • ✅ OpenAI-compatible API endpoint
  • ✅ Under 4 GB VRAM usage at full context

The key innovations that make this work:

  1. --n-cpu-moe 120: The killer feature. Without it, the model doesn't load. With it, you have a 26B model that behaves like a 4B model in terms of VRAM but thinks like a much larger one.

  2. --cache-type-k/v turbo4: 4-bit KV cache compression. At 262K context, this saves you ~4.5 GB of VRAM compared to q8_0.

  3. ggml-cuda.dll (56.5 MB): The TurboQuant CUDA kernels handle the dequantization on-the-fly, so you never pay the memory cost of the full-precision KV cache.

Could this run on a 6 GB GPU? Possibly, with more aggressive offloading (--n-cpu-moe 124). Could it run on a 4060 Ti 16GB? Comfortably, with room to spare. The ceiling is higher than you think.

If you want to try this yourself: clone the fork, grab the model from HuggingFace, compile with the flags above, and prepare to be surprised at how much you can squeeze out of consumer hardware.


Now if you'll excuse me, I need to explain to my RTX 3070 why it's running a model that's 3x its VRAM capacity. It's not speaking to me.

WRITTEN BY

Luca

Exploring the future of quality assurance and testing automation through deep technical insights.