// the find
2noise/ChatTTS
A generative speech model for daily dialogue.
ChatTTS is a Chinese-first autoregressive TTS model trained on 100k+ hours of dialogue audio, with a 40k-hour checkpoint available on HuggingFace. It's aimed at LLM assistant use cases where you want speech that sounds conversational rather than narrated — with natural pauses, laughter, and interjections. The license is CC BY-NC 4.0 on the model weights, so commercial use is blocked.
The prosody control is the real differentiator — inline tokens like [laugh], [uv_break], and [lbreak] let you inject conversational cues at the text level without post-processing. Speaker embeddings are sampled from a Gaussian space, which means you can reproduce a specific voice by saving the embedding vector. Streaming audio generation is supported. The codebase is well-structured with a clean separation between model, tokenizer, and vocoder (vocos), and it ships a custom vLLM-style scheduler in the velocity module for batched inference.
The model is academic-only by license, which rules it out for any production product immediately. English is explicitly described as 'experimental' — it's a Chinese model that handles English as a secondary capability, not a peer. The autoregressive architecture means generation is non-deterministic and you'll get occasional speaker bleed or quality drops that require multiple samples to work around — the FAQ admits this outright. The open-sourced checkpoint is the pre-trained base without SFT, so quality is below what the team is running internally, and the roadmap items (multi-emotion, ChatTTS.cpp) have been sitting incomplete for over a year.