From WSL2 to Self-Healing Tests: A Local LLM Pipeline with llama.cpp and Playwright

Your AI test suite shouldn't need a credit card. This guide builds a fully private pipeline on WSL2: llama.cpp in Docker with CUDA acceleration, OpenCode for agent orchestration, and Playwright for browser automation. Tests that break get fixed automatically. No cloud APIs, no bills.

AI-generated Playwright tests fail constantly not because the technology is bad, but because we use it like a crutch instead of a circulatory system. You're handed a brittle script and left to debug it alone. The real power emerges when you architect a closed-loop where the AI agent generates tests, observes their execution, and autonomously repairs failures—all without leaking sensitive data to the cloud or racking up API bills. This isn't futuristic speculation; it's a practical pipeline you can build today using WSL2, a local LLM, and Playwright's built-in agents.

We'll construct this system in three distinct phases. First, the baseline: an isolated, reproducible environment on WSL2 with all necessary runtimes. Second, the improvement: wiring a local LLM into Playwright's Model Context Protocol (MCP) to generate actionable test code from plain English. Third, production-hardening: implementing the healer agent for automatic repair and slotting the whole loop into a CI/CD pipeline that runs on commit. Suppah! By the end, your test suite will possess a rudimentary form of immunity.

Laying the Unbreakable Foundation: WSL2 and Prerequisites

Your local machine is a liability if your setup isn't replicable. WSL2 provides a consistent Linux environment on Windows, crucial for running Linux-optimized local LLM servers and ensuring your pipeline works the same everywhere. Skip this and you'll waste hours debugging "it works on my machine" failures when moving to CI.

Install WSL2 with Ubuntu from an elevated PowerShell, then install the core dependencies. This isn't just about having Node.js; it's about pinning versions so your automation doesn't break on a random update.

OFC if you're already on a UNIX-based OS, you don't need WSL — gratz, you're already in the cool kids' club. Still, most of the points in this article will apply to you

3.. 2.. 1.. ⋆.˚☕︎

bash
# In PowerShell (Admin)
wsl --install -d Ubuntu

# After reboot, launch Ubuntu from Start menu, then run:
sudo apt update && sudo apt upgrade -y
sudo apt install -y nodejs npm python3 python3-pip docker.io

# Install Playwright and its browsers globally to avoid project-specific issues
npm init -y
npm install --global @playwright/test
npx playwright install

The docker.io package is for running llama.cpp in a container later, which simplifies GPU management and model serving. Global Playwright installation ensures the CLI and agents are always available, reducing path errors.

Gotcha #1: Don't use the default Node.js version from Ubuntu's repo; it's often outdated. Instead, use NodeSource for a current LTS version. The Playwright agents may fail silently with older Node.

bash
# Replace the default Node.js
curl -fsSL https://deb.nodesource.com/setup_lts.x | sudo -E bash -
sudo apt-get install -y nodejs
node --version  # Should output v20.x or higher

Powering the Brain: llama.cpp via Docker with GPU Acceleration

Cloud LLM APIs are expensive and problematic for proprietary application data. llama.cpp is a high-performance C++ inference engine that runs quantized models locally with excellent GPU utilization. We'll run it in Docker with CUDA support — no Python dependencies, no bloat, just raw inference speed.

First, make sure you have the NVIDIA Container Toolkit installed for GPU passthrough:

bash
# Install NVIDIA Container Toolkit (if not already installed)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Download a quantized model into a local directory:

bash
mkdir -p ~/models
# Download Qwen3.5-9B Q4_K_M quantization (~5.5GB)
# Get it from HuggingFace or your preferred model hub
wget -O ~/models/Qwen3.5-9B-Q4_K_M.gguf \
  "https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q4_K_M.gguf"

Now spin up the llama.cpp server with GPU acceleration:

bash
docker run --name llm --rm -d \
  --gpus all \
  -p 8080:8080 \
  -v ~/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  --model /models/Qwen3.5-9B-Q4_K_M.gguf \
  --n-predict 2048 \
  --jinja --ctx-size 84000 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
  --repeat-penalty 1.05 \
  --fit on \
  --alias "qwen3.5-9B-docker" \
  --kv-unified \
  --gpu-layers auto \
  --split-mode none \
  --flash-attn on \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --chat-template-kwargs '{"enable_thinking":true}' \
  --host 0.0.0.0 --port 8080

Let's break down the key flags:

Flag	What it does
`--gpus all`	Passes all NVIDIA GPUs to the container
`--ctx-size 84000`	Sets context window to ~84K tokens — large enough for complex code analysis
`--flash-attn on`	Enables Flash Attention for faster inference
`--cache-type-k q4_0 --cache-type-v q4_0`	Quantizes the KV cache to save VRAM
`--kv-unified`	Uses unified KV cache for better memory efficiency
`--jinja`	Enables Jinja2 chat templates (required for Qwen's chat format)
`--alias "qwen3.5-9B-docker"`	Names the model for the API
`--chat-template-kwargs '{"enable_thinking":true}'`	Enables the model's chain-of-thought reasoning

Verify the server is running:

bash

curl http://localhost:8080/v1/models
# Should return a JSON with your model listed

The llama.cpp server exposes an OpenAI-compatible API at http://localhost:8080/v1 — this is key. Any tool that speaks the OpenAI API format can talk to it out of the box.

Trade-Off Alert: Local models are slower and less capable than GPT-4 or Claude Sonnet. You sacrifice some reasoning fluency for zero cost and total privacy. For test generation, this is an acceptable trade. With Qwen3.5-9B on a modern GPU, you'll get decent code generation at ~40-80 tokens/second — fast enough for interactive use.

Connecting the Dots: Playwright MCP and the OpenCode Agent

Playwright's Model Context Protocol (MCP) is the spine of this operation. It provides a standardized way for AI agents to interact with a browser context—clicking, typing, extracting text—without writing ad-hoc prompts. The opencode agent is a ready-to-use implementation that leverages MCP.

Install OpenCode and configure it to use your local llama.cpp server. In your project's opencode.json (or globally at ~/.config/opencode/opencode.json):

json
{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "qwen3.5-9B-docker": {
          "name": "Qwen 3.5-9B (local)",
          "tool_call": true,
          "temperature": true
        }
      }
    }
  }
}

The key here is @ai-sdk/openai-compatible — since llama.cpp exposes an OpenAI-compatible API, OpenCode can talk to it using the standard OpenAI SDK adapter. No custom integration needed.

Verify the connection by launching OpenCode and selecting the local model:

bash

opencode
# Switch to your local model with Ctrl+K, select "Qwen 3.5-9B (local)"

If the model responds, your local LLM is wired up. Now any agent you configure in OpenCode — build, plan, or custom agents — can use this model for reasoning while having full access to Playwright's browser automation.

Installing the Playwright Toolchain

Before generating tests, you need two Playwright tools installed:

1. Playwright Test (the test runner + built-in agents)

bash

npm install --save-dev @playwright/test
npx playwright install  # downloads browser binaries

Playwright v1.56+ includes three built-in AI agents: 🎭 planner, 🎭 generator, and 🎭 healer. Initialize them for OpenCode:

bash

npx playwright init-agents --loop=opencode

This creates agent definitions in your project that OpenCode can use. Regenerate them whenever you update Playwright to pick up new tools and instructions.

2. Playwright CLI (standalone browser automation)

@playwright/cli is a separate, token-efficient CLI for direct browser control — snapshotting, clicking, filling forms, all from the terminal:

bash

npm install -g @playwright/cli
playwright-cli install-browser

Verify it works:

bash

playwright-cli open https://example.com
playwright-cli snapshot   # prints accessibility tree
playwright-cli close

MCP vs CLI — when to use which?

Playwright Test Agents (planner/generator/healer) → structured test generation and self-healing workflows

playwright-cli → quick browser inspection, debugging UI issues, ad-hoc automation from any agent with bash access

From English to Executable Code: Playwright Test Agents

Playwright's built-in agents form a pipeline: plan → generate → heal. Each agent has a specific job and they chain together.

Step 1: Create a Seed Test

The seed test sets up the environment — fixtures, auth, base URL — and serves as an example for generated tests:

typescript
// tests/seed.spec.ts
import { test, expect } from '@playwright/test';

test('seed', async ({ page }) => {
  await page.goto('http://localhost:3000');
  // Any setup: login, accept cookies, navigate to starting point
});

Step 2: 🎭 Planner — Explore and Plan

Ask the planner to explore your app and produce a structured Markdown test plan. In OpenCode:

@planner Generate a test plan for the login flow. Use tests/seed.spec.ts as the seed test.

The planner opens the browser, explores the page, and outputs a Markdown plan in specs/:

markdown
# Login Flow Test Plan

# Test Scenarios

### 1. Successful Login
**Seed:** tests/seed.spec.ts
**Steps:**
1. Navigate to /login
2. Fill username field with "testuser"
3. Fill password field with "validpass"
4. Click submit button
**Expected Results:**
- Redirected to /dashboard
- Welcome message is visible

### 2. Failed Login with Invalid Password
**Steps:**
1. Navigate to /login
2. Fill username with "testuser"
3. Fill password with "wrongpass"
4. Click submit button
**Expected Results:**
- Error message is visible
- Stays on /login page

The plan is human-readable — review and edit it before generation. This is where you add edge cases the AI might miss.

Step 3: 🎭 Generator — Plan to Code

Feed the plan to the generator:

@generator Generate Playwright tests from specs/login-flow.md

The generator reads the plan, interacts with the live app to verify selectors, and produces executable test files:

typescript
// tests/login/successful-login.spec.ts
import { test, expect } from '@playwright/test';

test.describe('Login Flow', () => {
  test('successful login with valid credentials', async ({ page }) => {
    await page.goto('http://localhost:3000/login');
    const usernameInput = page.getByRole('textbox', { name: 'Username' });
    await usernameInput.fill('testuser');
    const passwordInput = page.getByRole('textbox', { name: 'Password' });
    await passwordInput.fill('validpass');
    await page.getByRole('button', { name: 'Sign in' }).click();
    await expect(page).toHaveURL('http://localhost:3000/dashboard');
    await expect(page.getByText('Welcome')).toBeVisible();
  });

  test('failed login with invalid password', async ({ page }) => {
    await page.goto('http://localhost:3000/login');
    await page.getByRole('textbox', { name: 'Username' }).fill('testuser');
    await page.getByRole('textbox', { name: 'Password' }).fill('wrongpass');
    await page.getByRole('button', { name: 'Sign in' }).click();
    await expect(page.getByText('Invalid credentials')).toBeVisible();
  });
});

Notice the generator uses role-based selectors (getByRole, getByText) instead of fragile CSS paths — these survive CSS refactors and layout changes.

Gotcha #2: Generated tests may include initial errors. That's expected — the healer handles them in the next step.

Introducing Immunity: The 🎭 Healer Agent

Tests break. Locators change. The healer agent closes the loop: it runs failing tests, inspects the current UI, repairs the selectors, and re-runs until they pass.

When a test fails, invoke the healer:

@healer Fix the failing test in tests/login/successful-login.spec.ts

The healer will:

Run the test → observe the failure
Replay the failing steps in the browser
Inspect the current UI to find equivalent elements
Patch the test (locator update, wait adjustment, data fix)
Re-run to confirm the fix works
If the functionality itself is broken (not just the test), it marks the test as test.skip() with a reason

For example, if a refactor changed the submit button from "Sign in" to "Log in":

typescript
// Before (failing):
await page.getByRole('button', { name: 'Sign in' }).click();

// After (healer's fix):
await page.getByRole('button', { name: 'Log in' }).click();

Using playwright-cli for Manual Debugging

For quick, ad-hoc browser inspection — outside the plan/generate/heal pipeline — use playwright-cli directly:

bash
playwright-cli open http://localhost:3000/login
playwright-cli snapshot        # see the page structure
playwright-cli console error   # check for JS errors
playwright-cli network         # check for failed API calls
playwright-cli close

This is useful when you need to visually debug an issue before deciding which agent to invoke, or when setting up OpenCode's @ui-inspector agent (see my OpenCode Agents & Skills guide).

Automating the Automation: CI/CD Pipeline

A self-healing loop is useless if it only runs on your machine. Embed it in CI/CD to trigger on every commit. For CI, you have two options: use a cloud LLM API (faster, simpler) or use a self-hosted runner with GPU access for full local inference.

For teams without GPUs in CI: Use a cloud LLM (OpenAI, Anthropic, etc.) for the CI pipeline and keep the local llama.cpp setup for development. You'll pay a few cents per run but avoid the self-hosted runner complexity. Just swap the provider in opencode.json for the CI environment.

Scaling and Customizing the Loop

Now that you have the full pipeline — planner → generator → healer — you can customize each stage for your application's specific patterns.

Teach the Agents Your Domain

Create a PRD (Product Requirements Document) in your project root to give agents context about your app's specific patterns:

markdown
<!-- prd.md -->
# My App Testing PRD

## Custom Components
- **DatePicker**: uses `[data-testid="calendar"]`, click a date cell to select
- **DataTable**: sortable columns via header click, pagination via `.pagination-next`
- **Toast notifications**: appear in `.toast-container`, auto-dismiss after 5s

## Auth Flow
- Login via /login, session stored in httpOnly cookie
- API calls to /api/* require Bearer token in Authorization header

## Test Data
- Test user: testuser / validpass
- Admin user: admin / adminpass

Reference this PRD when invoking the planner:

@planner Generate a test plan for the admin dashboard.
PRD: prd.md
Seed: tests/seed.spec.ts

The planner will use your domain knowledge to produce more accurate test plans, and the generator will follow suit with proper selectors and flows.

Monitoring the Loop

Track metrics to tune effectiveness:

Heal rate: % of failures the healer fixes without human intervention
False fixes: tests that pass after healing but test the wrong behavior (review diffs!)
Generation accuracy: % of generated tests that pass on first run

A well-tuned loop can reduce maintenance time by 70%, but only if you review critical changes and keep the PRD updated as your app evolves.

What to Do on Monday Morning

Your immediate action: set up the llama.cpp Docker container and wire it to OpenCode. Pull a quantized model, run the docker run command, and verify with curl http://localhost:8080/v1/models. Then configure opencode.json with the provider block and install the Playwright toolchain:

bash

npm install --save-dev @playwright/test
npx playwright install
npm install -g @playwright/cli
npx playwright init-agents --loop=opencode

Ask the planner to generate a test plan for one page. Review it, feed it to the generator, run the tests, and let the healer fix whatever breaks. That first cycle will prove the value without a full infrastructure commitment.

The long-term play: build a custom multi-agent setup (see my OpenCode Agents & Skills guide) where the planner, generator, and healer chain together automatically on every commit. Containerize the entire loop — llama.cpp for inference, Playwright for browser automation, OpenCode for orchestration — and plug it into your CI pipeline. That's how you turn a clever script into a resilient, company-wide asset.

The pipeline won't write perfect tests on the first try. But it will learn, adapt, and reduce the grunt work to near zero. Your job shifts from writing repetitive code to curating the system that writes it—a far better use of a senior engineer's time.

Notes

I tried both Ollama, lmstudio and llmaca.cpp locally but I was struggling to achieve a stable workflow, finally I found using llama.cpp on Docker working great and easier to replicate elsewhere, just a docker command (or a docker compose file - super portable).

My HW configuration:

CPU: AMD Ryzen 7 5800x3d
GPU: NVIDIA RTX 3070 8GB
RAM: 32GB DDR4

From WSL2 to Self-Healing Tests: A Local LLM Pipeline with llama.cpp and Playwright

Laying the Unbreakable Foundation: WSL2 and Prerequisites

Powering the Brain: llama.cpp via Docker with GPU Acceleration

Connecting the Dots: Playwright MCP and the OpenCode Agent

Installing the Playwright Toolchain

1. Playwright Test (the test runner + built-in agents)

2. Playwright CLI (standalone browser automation)

From English to Executable Code: Playwright Test Agents

Step 1: Create a Seed Test

Step 2: 🎭 Planner — Explore and Plan

Step 3: 🎭 Generator — Plan to Code

Introducing Immunity: The 🎭 Healer Agent

Using playwright-cli for Manual Debugging

Automating the Automation: CI/CD Pipeline

Scaling and Customizing the Loop

Teach the Agents Your Domain

Monitoring the Loop

What to Do on Monday Morning

Notes

My HW configuration:

Luca

Keep Reading

TurboQuant + MTP: Get 40.6 tok/s Out of Qwen3.6

Playwright v1.60’s HAR Recording: The Debugging Superpower You Didn’t Know You Needed