// the find

alesaccoia/VoiceStreamAI

★ 958 · Python · MIT · updated Oct 2024

Near-Realtime audio transcription using self-hosted Whisper and WebSocket in Python/JS

VoiceStreamAI is a self-hosted real-time transcription server that pipes browser microphone audio over WebSocket through VAD (pyannote) and then Whisper (faster-whisper by default). It's for developers who want on-premise speech-to-text without sending audio to a third-party API.

The VAD-before-ASR pipeline is the right architecture — skipping silence before feeding Whisper saves meaningful GPU time. The factory/strategy pattern for swapping VAD and ASR backends is genuinely clean; adding a new ASR implementation means touching one file. faster-whisper as the default is a good call, it's 2-4x faster than the original for the same accuracy. Docker setup with a named volume for the Hugging Face model cache means you don't re-download 1.5GB on every container restart.

Audio chunks are saved to disk files before being fed to the model — the README admits this, and it's a real bottleneck both for latency and I/O under multiple concurrent clients. The 'strategies' in buffering_strategies.py are implemented as if/else blocks inside a single file rather than actual strategy objects, despite the README claiming OOP strategy pattern. Last commit was October 2024 and the project looks stalled — faster-whisper has moved significantly since, and pyannote VAD has breaking API changes that may not be reflected. There's no auth or rate-limiting on the WebSocket server, so exposing port 8765 directly to untrusted networks is a bad idea.

View on GitHub →