whisperX and CrisperWhisper
WhisperX provides the foundational diarization and word-level timestamping infrastructure that CrisperWhisper builds upon, making them complements rather than competitors—CrisperWhisper adds filler detection refinements to WhisperX's base output.
About whisperX
m-bain/whisperX
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Builds on OpenAI's Whisper by combining faster-whisper for batched GPU inference (70x speedup) with wav2vec2 forced phoneme alignment to achieve sub-word timing accuracy. Integrates pyannote-audio for speaker diarization and includes VAD preprocessing to reduce hallucinations while maintaining quality. Supports multiple languages with automatic language-specific alignment model selection from HuggingFace and torchaudio.
About CrisperWhisper
nyrahealth/CrisperWhisper
Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Built on OpenAI's Whisper, CrisperWhisper employs a custom tokenizer and attention loss mechanism during training to achieve precise word-level timestamp alignment, particularly around disfluencies and pauses. It integrates seamlessly with both 🤗 Transformers and Faster Whisper pipelines, enabling deployment in existing speech recognition workflows. The model prioritizes verbatim transcription including fillers ("um", "uh") and speech artifacts, ranking first on the OpenASR Leaderboard for verbatim datasets like TED-LIUM and AMI.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work