// the find

FunAudioLLM/SenseVoice

★ 8,702 · C · NOASSERTION · updated Jun 2026

Multilingual speech understanding: ASR + emotion recognition + audio event detection. 50+ languages, 15x faster than Whisper, non-autoregressive.

SenseVoice is a speech recognition model from Alibaba's ModelScope team that bundles ASR, language identification, emotion detection, and audio event detection into one model. The non-autoregressive architecture makes the Small variant genuinely fast — 70ms for 10 seconds of audio — which is the main reason to choose it over Whisper. It's aimed at developers who need multilingual transcription (especially Mandarin/Cantonese) with emotion tags baked in.

The 15x speed advantage over Whisper-Large on the Small model is real and meaningful for production workloads where latency matters. The new llama.cpp/GGUF path (254MB q8 model, built-in VAD, single binary, no Python) is a significant addition for edge and CPU-only deployments. Emotion and audio event labels come out of the same inference pass with no extra model weight, which is a clean design. The ecosystem around it — ONNX export, libtorch, Sherpa-onnx for 10+ languages and mobile platforms — means you're not locked into one deployment path.

It's deeply entangled with FunASR: basic usage requires `trust_remote_code=True` and loads model code from a local file, which is a supply chain risk most teams should think twice about. Audio event detection is trained on speech data and trails specialized AED models on ESC-50 — it detects applause and laughter but isn't a replacement for BEATS or PANN. Speaker diarization requires installing FunASR from source rather than a stable release, which is a rough edge for anything close to production. The benchmarks are self-reported against Whisper on datasets where Mandarin performance was always going to favor the Alibaba model; English accuracy numbers against Whisper on diverse Common Voice subsets are harder to find.

View on GitHub → Homepage ↗