// the find
trymirai/uzu
A high-performance inference engine for AI models
uzu is a Rust-based on-device inference engine targeting Apple Silicon via Metal, with bindings for Python, Swift, and TypeScript. It handles model downloading, KV cache, speculative decoding, structured output, and TTS in a single library. The target audience is iOS/macOS app developers who want to ship local LLM inference without depending on a cloud API.
1. Speculative decoding is a first-class feature with named presets (classification, summarization) — not something you bolt on later, and the README shows it cutting generation steps meaningfully. 2. The unified memory path on Apple Silicon is real: no CPU↔GPU copy overhead, which matters a lot for 7B+ models on MacBook. 3. Bindings across four languages share the same Rust core, so you're not getting a Python reimplementation that diverges from the Swift one. 4. Structured output via JSON schema grammar constraints is built in, not a prompt hack — the model is forced into valid JSON at the decode level.
1. Apple-only. The Metal backend is the only real accelerated path; non-Apple hardware gets a CPU fallback that will be unusably slow for anything beyond toy models. The repo markets itself as a general inference engine but is effectively an Apple Silicon library. 2. No published benchmarks against llama.cpp or MLX — the README gestures at 'high-performance' but gives no numbers you can compare. The benchmark directory exists but results aren't in the docs. 3. The Rust crate is not published to crates.io; you depend on a git branch. That's fine for experimentation, but it means no semver guarantees and your lockfile will silently drift if the branch moves. 4. Model support is gated through their own registry at trymirai.com/models, not a standard GGUF/ONNX/SafeTensors path. You can't just point it at an arbitrary HuggingFace model — you're dependent on them having added it.