// the find

mudler/LocalAI

★ 46,821 · Go · MIT · updated Jun 2026

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

LocalAI is a self-hosted inference server that exposes OpenAI-compatible (and now Anthropic/ElevenLabs-compatible) REST APIs across a modular set of backends — llama.cpp, vLLM, whisper.cpp, stable diffusion, and 30+ others. Each backend ships as a separate OCI image pulled on demand, so you don't bloat the install with engines you don't use. It's aimed at anyone who needs a drop-in OpenAI replacement running entirely on their own hardware, from a Raspberry Pi to a multi-GPU cluster.

The composable backend architecture is genuinely well-designed — backends are gRPC servers in separate images, pulled only when needed, which means the core binary stays small and adding a new engine doesn't require recompiling anything. The hardware coverage is unusually broad: CUDA 12/13, ROCm, oneAPI, Metal, Vulkan, and CPU-only all work through the same API surface, which is rare for a single project. The prompt cache now on by default in 4.3.0 is a real quality-of-life win for repeated system prompts — that's the kind of production detail that matters. The distributed mode with VRAM-aware smart routing and per-request replica selection is a legitimately interesting feature that most self-hosted inference projects don't touch.

The scope has grown so fast that the project is showing seams — 36+ backends means the testing matrix is enormous and issues in less-popular backends (ROCm, Intel oneAPI) tend to linger. The macOS DMG isn't Apple-signed, requiring users to run xattr to clear the quarantine flag, which is a friction point that will trip up non-technical users and signals the release process isn't fully automated end-to-end. The distributed mode requires PostgreSQL and NATS as dependencies, which is a significant operational burden compared to the 'just run a container' story the README sells. Documentation quality is uneven across backends — the core llama.cpp path is well-documented, but newer backends like voxtral or rfdetr-cpp have almost no usage examples.

View on GitHub → Homepage ↗