// the find
ngxson/wllama
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
wllama is a TypeScript wrapper around llama.cpp compiled to WebAssembly, letting you run GGUF models entirely in the browser with no backend. V3 added WebGPU acceleration, multimodal inputs, and tool calling. It's for developers building privacy-first or offline-capable AI features in web apps.
OpenAI-compatible API means you can swap it in without rewriting call sites. Worker-based inference keeps the UI thread unblocked, which is the obvious right call and they made it. Parallel chunk downloading is a real UX win for large models — 512MB shards load noticeably faster than a monolithic 4GB file. WebGPU support lands with sane fallback behavior, defaulting to WASM SIMD when GPU isn't available rather than hard-failing.
The COOP/COEP header requirement for multi-threading is a deployment landmine — it breaks most third-party iframes and analytics scripts, and the docs treat it as a footnote. WASM inference on CPU is genuinely slow for anything beyond tiny quantized models; even Q4 Mistral-7B will test users' patience on mid-range hardware. No multi-sequence support means you can't run parallel requests, so any use case beyond a single chat session is blocked. Pre-built WASM binaries in the repo means you're trusting the maintainer's Docker build unless you compile yourself, which requires Docker and a full llama.cpp submodule checkout.