// the find

raullenchai/Rapid-MLX

★ 3,148 · Python · Apache-2.0 · updated Jun 2026

The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.

Rapid-MLX is an OpenAI-compatible inference server wrapping Apple's MLX framework with continuous batching, prompt caching, and per-family tool-call parsers. It targets Mac developers who want to run local LLMs as a backend for coding assistants like Claude Code, Cursor, and Cline without cloud API costs. The throughput gains over vanilla mlx-lm are real; the Ollama comparisons in the README are not.

The continuous batching implementation delivers a genuine 1.2–1.5x throughput improvement over mlx-lm on identical weights — that's a real win for coding agents that fire parallel tool-use requests. The 17 tool-call parsers with per-family auto-detection and broken-output recovery are the most practically useful part of this project: local models emit malformed JSON constantly, and having a parser that knows Qwen3's Hermes format vs. DeepSeek's format vs. GLM's format saves hours of debugging. The `rapid-mlx launch` IDE-wiring command does atomic config patching (write-temp + rename) with timestamped backups before overwriting, which is the right way to touch files you don't own. The 0.08s cached TTFT on repeat prompts matters for coding agents where the system prompt is large and static across turns.

The '4.2x faster than Ollama' headline is benchmarked against Ollama 0.24, which has no in-flight batching — so the B=4 concurrent test compares a batching server against a serializing one, not two inference engines. The actual engine-level advantage over mlx-lm is 1.2–1.5x, which is the honest number. Apple Silicon only, full stop — MLX has no Linux or Windows backend, so this is a non-starter for any team where developers aren't all on M-series Macs. The MHI scoring (Model-Harness Index) is entirely self-reported: the prompts, scoring weights, and evaluation scripts all live in the same repo, with no independent audit and a community benchmark directory that has a handful of submissions despite 3k stars. The `rapid-mlx share` tunnel feature silently exposes a local inference server through a third-party relay; the README buries the security implications in a tip, and casual users enabling it for 'live chat' probably don't realize they're publishing an authenticated endpoint to the internet.

View on GitHub → Homepage ↗