AILLM
1 min read
TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6
How to build llama.cpp w/ TurboQuant and MTP and use it on consumer HW - 32GB RAM + 8GB VRAM GPU.
Browse all articles tagged with AI. Found 4 articles covering this topic.
How to build llama.cpp w/ TurboQuant and MTP and use it on consumer HW - 32GB RAM + 8GB VRAM GPU.
Forget Docker images and pre-built binaries. Here's how I compiled a custom llama.cpp fork with TurboQuant, ran a 26B Gemma 4 MoE model on a consumer RTX 3070, and squeezed 262K context into less than 4GB of VRAM.
A 30B model on an 8GB GPU sounds impossible, but quantization and llama.cpp make it work. This guide shows how to run it with Docker and use it in OpenCode.
Private, self-healing Playwright test loop on WSL2 using llama.cpp via Docker, GPU acceleration, and OpenCode agents