// the find
travisvn/chatterbox-tts-api
Local, OpenAI-compatible text-to-speech (TTS) API using Chatterbox, enabling users to generate voice cloned speech anywhere the OpenAI API is used (e.g. Open WebUI, AnythingLLM, etc.)
A FastAPI wrapper around Resemble AI's Chatterbox TTS model that exposes an OpenAI-compatible `/v1/audio/speech` endpoint, making it a drop-in local replacement for OpenAI TTS in tools like Open WebUI. You get voice cloning from a sample file, 22-language support, and streaming — all self-hosted. Target audience is anyone running a local LLM stack who wants TTS without paying per-character.
OpenAI API compatibility is the real selling point — no code changes needed in clients that already speak that protocol. The Docker story is well thought out, with separate compose files for CPU, GPU, uv, and Blackwell GPUs rather than one bloated file that tries to do everything. The voice library management (upload once, reference by name) is a sensible abstraction over the raw per-request upload pattern. SSE streaming with base64-encoded chunks matches what OpenAI's streaming TTS returns, so clients that handle one should handle the other.
The upstream `resemble-ai/chatterbox` is currently broken on non-CUDA setups and the README admits it prominently — that's a real problem for anyone on Mac or a CPU-only machine, and this wrapper can't fix it. No authentication on any endpoint by default, so you're one misconfigured firewall rule away from an open TTS endpoint anyone can hammer. The `stable` branch exists as a fallback for the pre-multilingual model, but maintaining two diverging branches long-term is a maintenance trap waiting to spring. Memory management is done manually with periodic GC calls and CUDA cache clears, which suggests the model isn't being properly unloaded between requests — fine for a single-user local box, rough if you try to run this under any real concurrency.