MahmoudAshraf97/whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

56
/ 100
Established

Combines Whisper with NVIDIA NeMo's voice activity detection and speaker embedding models (MarbleNet/TitaNet) to attribute transcribed text to individual speakers. Uses source separation (Demucs) for vocal extraction, CTC-forced alignment for precise timestamp correction, and punctuation-based realignment to compensate for temporal drift across segments. Outputs speaker-labeled transcriptions with segment-level timestamps, supporting configurable Whisper models and parallel inference modes for systems with sufficient VRAM.

5,437 stars.

No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 20 / 25

How are scores calculated?

Stars

5,437

Forks

500

Language

Jupyter Notebook

License

BSD-2-Clause

Last pushed

Feb 23, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/MahmoudAshraf97/whisper-diarization"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.