// the find

modelscope/FunASR

★ 17,930 · Python · MIT · updated Jun 2026

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

FunASR is Alibaba DAMO Academy's production speech recognition toolkit, built around their Paraformer and SenseVoice models. It packages VAD, ASR, punctuation restoration, speaker diarization, and emotion detection into a single Python call, with an OpenAI-compatible API server included. The target audience is teams who want Whisper-level ease of use but need to run on CPU or need Chinese-language accuracy that Whisper can't match.

The speed numbers are real and independently meaningful — SenseVoice-Small running 17x realtime on CPU is the headline: you can transcribe a one-hour meeting in under four minutes on a commodity server with no GPU. The AutoModel abstraction genuinely collapses what would normally be a four-model pipeline (VAD → ASR → punctuation → diarization) into one call with sensible defaults. The OpenAI-compatible API server means drop-in replacement for Whisper API clients with zero code changes. Chinese ASR quality is the strongest justification for choosing this over Whisper — Paraformer-zh was trained on Alibaba's production call-center data at a scale Whisper's Chinese training can't touch.

The model zoo is split between ModelScope (Alibaba's platform) and HuggingFace, and the ModelScope-first distribution creates friction for teams outside China — first-run downloads can be slow or flaky depending on your region, and some models are ModelScope-only. Streaming support exists but is limited to Paraformer-zh-streaming; Fun-ASR-Nano's streaming is a recent addition still shaking out (the README links to a guide with an appendix titled 'DynamicStreamingVAD', which is not a reassuring sign of stability). English accuracy is solid but not a reason to switch from Whisper-large-v3-turbo unless you specifically need CPU inference or the bundled diarization. The codebase mixes training scripts, inference demos, and production server code without a clear boundary, making it harder to understand what's stable API versus internal tooling.

View on GitHub → Homepage ↗