// the find

k2-fsa/sherpa-onnx

★ 12,935 · C++ · Apache-2.0 · updated Jun 2026

Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, RK NPU, Axera NPU, Ascend NPU, x86_64 servers, websocket server/client, support 12 programming languages

sherpa-onnx is an offline speech processing library built on top of onnxruntime and next-gen Kaldi. It covers the full stack — ASR (streaming and batch), TTS, VAD, speaker diarization, source separation, keyword spotting, and more — all running locally without any network dependency. The target audience is embedded/mobile developers who need production-grade speech on constrained hardware, and desktop developers who want privacy-preserving speech features without cloud APIs.

1. Platform coverage is genuinely impressive: x86_64, arm32/64, RISC-V, Android, iOS, HarmonyOS, WebAssembly, and explicit support for Rockchip/Qualcomm/Ascend/Axera NPUs — not just claimed, there are separate CI workflows and pre-built binaries for each.

2. Language bindings span 12 languages (C++, C, Python, Go, C#, Java, Kotlin, Swift, Rust, Dart, JavaScript, Pascal) with actual NuGet/pip/npm/pub packages published and tested in CI, not just header wrappers someone committed once.

3. ONNX as the model format is the right call here: it decouples model training from inference, lets you swap in quantized int8 variants (several are listed), and onnxruntime's hardware EP support gives NPU acceleration without custom backends.

4. Active model zoo with real variety — Whisper, Paraformer, Zipformer, SenseVoice, NeMo Parakeet, Moonshine — covering languages well beyond English/Chinese, with export scripts checked in for reproducibility.

1. Model management is entirely manual — you download a tarball, point a config struct at the directory, and hope you got the right variant. There's no version-pinned model registry, no hash verification in the API itself, and mixing up streaming vs. non-streaming model files gives you a silent bad-output situation rather than a clear error.

2. The C API is the lingua franca for all the language bindings, which means the abstraction level is very low: you're passing config structs with a dozen string fields by hand. The C# and Go wrappers are thin and expose the same config sprawl directly, so onboarding for anything beyond the example scripts takes real effort.

3. No built-in batching for offline ASR — the API is one audio stream in, one result out. If you're processing a queue of files on a server, you're writing your own batching logic and leaving throughput on the table, especially on GPU.

4. Documentation quality is inconsistent: the k2-fsa.github.io/sherpa docs are reasonably complete for the main paths, but NPU-specific setup (QNN, RKNN, Ascend) is sparse enough that the GitHub issues are often the only place to find what actually works.

View on GitHub → Homepage ↗