Deploying Qwen3-Coder-30B-A3B on 8GB GPU with Docker

Yes, you can run a 30B model on “normal‑people hardware”. No, it shouldn’t work. Yes, it does anyway.

Most people assume that running a 30‑billion‑parameter LLM requires a data center, a nuclear reactor, or at least a GPU that costs more than your car. But thanks to quantization, clever engineering, and a bit of “LLM black magic”, you can run Qwen3‑Coder‑30B‑A3B on a humble RTX 3070 with just 8GB of VRAM.

And the best part? No Conda environments. No Python dependency hell. No “why is pip uninstalling my system packages?” moments.

Just Docker, a quantized model, and your machine.

⚙️ Prerequisites

You don’t need a supercomputer, but you do need:

8GB+ VRAM Nvidia GPU (CUDA required — sorry AMD friends)
32GB+ RAM (because the CPU will be doing a lot of heavy lifting)
NVMe/SSD (loading 30B parameters from a spinning HDD is a form of self‑harm)

If you have these, congratulations: you’re ready to run a model that absolutely should not fit on your hardware.

📥 Step 1: Download the Quantized Model

This is the “diet version” of the model — the one that makes the impossible possible.

bash

mkdir -p ~/models
wget https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/resolve/main/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -P ~/models

Why Q4_K_M?

Because it’s the quantization format that says:

“I’ll keep the model smart enough to code, but small enough to not set your GPU on fire.”

🐳 Step 2: Launch the Model with Docker

This is where the magic happens. This command spins up a full LLM server using llama.cpp with CUDA acceleration.

bash
docker run --name qwen3-coder-docker -d \
--gpus all \
-p 8080:8080 \
-v ~/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
--model /models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--no-mmproj \
--jinja --ctx-size 32786 \
--fit on \
--kv-unified \
--split-mode none \
--flash-attn on \
--gpu-layers 35 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--override-tensor ".ffn_(up|down)_exps.=CPU" \
--host 0.0.0.0 --port 8080 \
--temp 0.8 \
--top-k 40 \
--top-p 0.95 \
--min-p 0.05 \
--repeat-last-n 64 \
--repeat-penalty 1.0 \
--dry-multiplier 0 \
--samplers "penalties,dry,top_n_sigma,top_k,typ_p,top_p,min_p,xtc,temperature" \
--alias "qwen3-coder-docker"

🧩 Why This Works (Even Though It Shouldn’t)

Running a 30B model on 8GB VRAM sounds like a joke, but here’s why it actually works:

Quantization (Q4_K_M)
Shrinks the model dramatically while keeping it surprisingly capable.
Partial GPU Offload (--gpu-layers 35)
Only the first 35 layers run on the GPU. The rest run on the CPU, which is slower but has way more RAM.
Tensor Overrides (--override-tensor)
Forces the biggest tensors (FFN expansions) onto the CPU. This avoids VRAM overflow — the enemy of all 8GB GPUs.
Flash Attention (--flash-attn on)
Faster, more memory‑efficient attention. Basically: “Do more with less.”
Unified KV Cache
Reduces fragmentation and improves memory reuse.
Split‑mode none
Keeps memory layout predictable and efficient.

The result? A 30B model that runs at 8–10 tokens per second on consumer hardware. Not bad for something that shouldn’t fit in the first place.

🧩 Step 3: Configure OpenCode

OpenCode can treat your Docker container like a local OpenAI‑compatible endpoint. Add this to your configuration:

json
"provider": {
  "docker": {
    "npm": "@ai-sdk/openai-compatible",
    "name": "Docker",
    "options": {
      "baseURL": "http://localhost:8080"
    },
    "models": {
      "qwen3-coder-docker": {
        "name": "qwen3-coder-docker",
        "modalities": {
          "input": ["text"],
          "output": ["text"]
        },
        "interleaved": true,
        "tool_call": true
      }
    }
  }
}

Now OpenCode cheerfully sends prompts to your local model - not the cloud.

0 API keys, rate limits and billing surprises (unless your electric bill develops a personality).

📊 Step 4: Monitor GPU Usage

You will want to keep an eye on your GPU, because it will be working hard.

Use nvtop for a live graphical view: nvtop
Or use nvidia-smi with watch: watch -n 1 nvidia-smi

These tools help you verify that your GPU is not melting and that VRAM stays under the 8GB limit.

📈 Benchmarks

Under load: ~7028MiB / 8192MiB
~8-10 t/s

🎉 Conclusion

Running Qwen3‑Coder‑30B‑A3B on an 8GB GPU feels like cheating physics, but with quantization, partial GPU offload, and the efficiency of llama.cpp, it’s absolutely doable. With Docker and OpenCode integration, you can now use this powerful model locally with minimal resources.

You get:

A powerful 30B coding model
Running locally
With Docker
On consumer hardware
Without touching Python

If you want to push this further, you could explore f16 KV cache, CPU‑only modes, or even multi‑GPU setups — but this setup already gives you a shockingly capable local LLM workstation. I do not recommend to use Q4 (or smaller) KV cache if you want to use tools because in my experience it does not work.