// the find

madroidmaq/mlx-omni-server

★ 723 · Python · MIT · updated May 2026

MLX Omni Server is a local inference server powered by Apple's MLX framework, specifically designed for Apple Silicon (M-series) chips. It implements OpenAI-compatible API endpoints, enabling seamless integration with existing OpenAI SDK clients while leveraging the power of local ML inference.

A local inference server for Apple Silicon that wraps MLX-accelerated models behind OpenAI and Anthropic-compatible REST APIs. Point your existing OpenAI SDK client at localhost:10240 and it works without code changes. Useful if you want private, offline inference on an M-series Mac without switching SDKs.

Dual API compatibility (OpenAI + Anthropic) means zero client-side code changes when switching from hosted to local inference. The per-model tool-call parsers (llama3, mistral, qwen3, glm4.5) handle the fact that different models use wildly different function-calling formats — that's real work that would bite you otherwise. Prompt cache pooling is a thoughtful addition that reduces redundant prefill cost across requests. Test coverage is solid and model-specific, not just happy-path smoke tests.

Hard Apple Silicon dependency means it's useless for anyone on Linux or Windows, which limits team adoption and CI. No GPU memory management surface — if a loaded model OOMs your chip, you're debugging MLX internals. Image generation support is listed as implemented but the images module is thin and model selection for diffusion is underspecified in the docs. The Anthropic compatibility layer is newer and shallower than the OpenAI one — edge cases like prompt caching headers and batch requests are not handled.

View on GitHub → Homepage ↗