// the find
jianfch/stable-ts
Transcription, forced alignment, and audio indexing with OpenAI's Whisper
stable-ts patches OpenAI's Whisper to produce more accurate word-level timestamps by suppressing silence, applying VAD, and adjusting segment boundaries post-inference. It also supports forced alignment — syncing existing text to audio — and works as a drop-in wrapper for faster-whisper, HuggingFace Transformers, and MLX Whisper backends. Useful for anyone building subtitles, captions, or audio indexing pipelines who finds vanilla Whisper's timestamps drifting or off-by-a-word.
The silence suppression and VAD-based timestamp adjustment actually works well — vanilla Whisper regularly stamps words 200–500ms early, and this fixes most of that. Forced alignment via `model.align()` is genuinely useful: if you have a corrected transcript, you don't have to re-run the full inference just to get tight word timings. Backend agnosticism is solid — the same postprocessing pipeline works across vanilla Whisper, faster-whisper, HF Transformers, and Apple MLX with minimal API surface changes. Saving results as JSON and reprocessing without re-running inference is a nice touch for iterating on postprocessing settings.
Development is indefinitely paused as of the README note — that's not a soft warning, it means bug reports and compatibility issues with newer Whisper or faster-whisper versions will go unaddressed. The library monkey-patches Whisper internals, so any upstream Whisper update can silently break things in ways that are hard to debug. The parameter surface is enormous (transcribe() takes 40+ arguments) with defaults scattered across `stable_whisper.default.DEFAULT_VALUES`, making it hard to reason about what you actually get with a basic call. No streaming or real-time transcription support despite the `stream=True` parameter — it still loads audio in 30-second chunks, not true streaming.