AILLM
1 min read
TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6
How to build llama.cpp w/ TurboQuant and MTP and use it on consumer HW - 32GB RAM + 8GB VRAM GPU.
Browse all articles tagged with llama.cpp. Found 3 articles covering this topic.
How to build llama.cpp w/ TurboQuant and MTP and use it on consumer HW - 32GB RAM + 8GB VRAM GPU.
Forget Docker images and pre-built binaries. Here's how I compiled a custom llama.cpp fork with TurboQuant, ran a 26B Gemma 4 MoE model on a consumer RTX 3070, and squeezed 262K context into less than 4GB of VRAM.
A 30B model on an 8GB GPU sounds impossible, but quantization and llama.cpp make it work. This guide shows how to run it with Docker and use it in OpenCode.