Speech Corpora Datasets Voice AI Tools
Collections and catalogs of annotated speech audio data for training ASR, TTS, and voice AI models. Does NOT include tools for processing/cleaning datasets, annotation pipelines, or model implementations.
There are 63 speech corpora datasets tools tracked. 1 score above 50 (established tier). The highest-rated is ynop/audiomate at 53/100 with 138 stars and 252 monthly downloads.
Get all 63 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=speech-corpora-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
ynop/audiomate
Python library for handling audio datasets. |
|
Established |
| 2 |
davidmartinrius/speech-dataset-generator
🔊 Create labeled datasets, enhance audio quality, identify speakers, support... |
|
Emerging |
| 3 |
common-voice/cv-dataset
Metadata and versioning details for the Common Voice dataset |
|
Emerging |
| 4 |
reazon-research/ReazonSpeech
Massive open Japanese speech corpus |
|
Emerging |
| 5 |
EgorLakomkin/KTSpeechCrawler
Automatically constructing corpus for automatic speech recognition from... |
|
Emerging |
| 6 |
coqui-ai/open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies |
|
Emerging |
| 7 |
yc9701/pansori
Tools for ASR Corpus Generation from Online Video |
|
Emerging |
| 8 |
Niger-Volta-LTI/yoruba-text
Yorùbá language training text for NLP, ASR and TTS tasks |
|
Emerging |
| 9 |
jim-schwoebel/download_audioset
📁 This repo makes it easy to download the raw audio files from AudioSet... |
|
Emerging |
| 10 |
Appen/UHV-OTS-Speech
A data annotation pipeline to generate high-quality, large-scale speech... |
|
Emerging |
| 11 |
candlewill/Speech-Corpus-Collection
A Collection of Speech Corpus for ASR and TTS |
|
Emerging |
| 12 |
dsfsi/dsfsi-datasets
Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+... |
|
Emerging |
| 13 |
robmsmt/ASR-Audio-Data-Links
A list of publically available audio data that anyone can download for ASR... |
|
Emerging |
| 14 |
Umbaji/NMTMD
Official repository for the Opensource Textdataset for NMT for local langues... |
|
Emerging |
| 15 |
unza-speech-lab/zambezi-voice
Repository for multilingual speech data resources for native languages of Zambia. |
|
Emerging |
| 16 |
wspr-ncsu/robocall-audio-dataset
A dataset of real-world robocall audio recordings |
|
Emerging |
| 17 |
AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish
AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech |
|
Emerging |
| 18 |
silenterus/deepspeech-cleaner
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework |
|
Experimental |
| 19 |
IS2AI/ISSAI_SAIDA_Kazakh_ASR
the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus... |
|
Experimental |
| 20 |
yc9701/pansori-tedxkr-corpus
Korean ASR Corpus generated from TEDx talks |
|
Experimental |
| 21 |
khuangaf/ITRI-speech-recognition-dataset-generation
Automatic Speech Recognition Dataset Generation |
|
Experimental |
| 22 |
PranavMishra17/VoicePersona-Dataset
A comprehensive voice persona dataset for character consistency in voice... |
|
Experimental |
| 23 |
egorsmkv/asr-corpus-creator
This app is intended to automatically create a corpus for ASR systems using... |
|
Experimental |
| 24 |
xinjli/ucla-phonetic-corpus
Dataset of ICASSP 2021 MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH... |
|
Experimental |
| 25 |
swarms/mozilla-common-voice
Swarms supports the Common Voice Project from Mozilla! This repo contains... |
|
Experimental |
| 26 |
AI-TOOLKIT/VoiceData
Automatic Speech Recognition (ASR) Data Generator Toolkit |
|
Experimental |
| 27 |
97jamie/public-police-footage
Code for Constructing Datasets From Public Police Body Camera Footage (ICASSP 2025) |
|
Experimental |
| 28 |
csikasote/bigc
This repository contains the data resources for the LacunaFund supported... |
|
Experimental |
| 29 |
cyrta/broadcast-news-videos-dataset
Collection of broadcast news video clips |
|
Experimental |
| 30 |
turinaf/Sagalee
Automatic Speech Recognition Dataset for Oromo Language |
|
Experimental |
| 31 |
jhdeov/armenian-intonation
Repository of question-answer dialogues of Armenian, for an intonation study. |
|
Experimental |
| 32 |
skit-ai/speech-to-intent-dataset
Dataset Release for Intent Classification from Speech |
|
Experimental |
| 33 |
Niger-Volta-LTI/urhobo-asr-spoken-digits
URH-DIGITS is a connected digits speech recognition task |
|
Experimental |
| 34 |
BYO-UPM/Neurovoz_Dababase
Neurovoz corpus of parkinosnian speech |
|
Experimental |
| 35 |
goodmike31/pl-asr-speech-data-survey
Survey of available speech datasets for Polish ASR development |
|
Experimental |
| 36 |
zhongyuchen/DSPSpeech-20
A speech dataset of 20 isolated words each with 680 recordings from 34 individuals |
|
Experimental |
| 37 |
r9y9/jsut-lab
HTS-style full-context labels for JSUT v1.1 |
|
Experimental |
| 38 |
Anwarvic/mTEDx_auxiliary
These are different files I created to do different tasks when I was working... |
|
Experimental |
| 39 |
Prem-kumar27/Fast-KTSpeechCrawler
Parallelized automatic corpus collection for ASR. Forked from... |
|
Experimental |
| 40 |
german-asr/megs
A merged version of multiple open-source German speech datasets. |
|
Experimental |
| 41 |
rusiaaman/PCPM
Presenting Collection of Pretrained Models. Links to pretrained models in... |
|
Experimental |
| 42 |
Pogayo/african-voices-web
Website that hosts the African Voices projects. Users can download datasets... |
|
Experimental |
| 43 |
qcri/Arabic_speech_code_switching
The first Dialectal Arabic Code Switching - DACS corpus from broadcast... |
|
Experimental |
| 44 |
bunyaminergen/awesome-speech-dataset
Awesome Speech Dataset, including download links and a brief explanation for... |
|
Experimental |
| 45 |
apluka34/audio-crawler
A tool for crawling and creating audio dataset |
|
Experimental |
| 46 |
motazsaad/jsc-news-broadcast
JSC news broadcast (speech corpus) |
|
Experimental |
| 47 |
antouanbg/Bulgarian_Linguistic
Collection and resources for Bulgarian Corpus, Datasets and Models used in... |
|
Experimental |
| 48 |
speakingofdata/80_Excerpts
4 voices x 80 transcripts = 320 audio recordings |
|
Experimental |
| 49 |
labsensacional/ASMRDataset
Recordings and transcriptions of ASMR artists compiled for the purpose of... |
|
Experimental |
| 50 |
jp1924/HF_builders
🤗 Datasets의 builder script를 모와둔 repo |
|
Experimental |
| 51 |
czyzi0/the-mc-speech-dataset
Free speech dataset consisting of 24018 short audio clips of a single... |
|
Experimental |
| 52 |
Mormolykos/bedvibe-datasets
Multilingual emotional speech datasets for TTS training |
|
Experimental |
| 53 |
harveenchadha/Speech-Learning-Resources
Repo containing resources to learn about various verticals of speech. ASR , TTS |
|
Experimental |
| 54 |
vislupus/Bulgarian-TTS-dataset
LibriVox dataset for Bulgarian language TTS |
|
Experimental |
| 55 |
Rumeysakeskin/Speech-Datasets-for-ASR
Download speech datasets (English and non-English) for Automatic Speech Recognition |
|
Experimental |
| 56 |
egorsmkv/asr-datasets-cleaner
A pipeline to make ASR datasets better |
|
Experimental |
| 57 |
ubisoft/ubisoft-laforge-french-homograph-dataset
Dataset for La Forge Speech Synthesis System Submission to the Blizzard... |
|
Experimental |
| 58 |
weimeng23/audio-speech-datasets
:scroll: A list of various Audio/Speech datasets about Speech Recognition,... |
|
Experimental |
| 59 |
speakingofdata/LJ2_Corpus
Single speaker, 26,200 transcribed audio recordings, 48 hours total |
|
Experimental |
| 60 |
mrcraked/WordAudio
A massive collection of high-quality MP3 word pronunciations. Download,... |
|
Experimental |
| 61 |
navalnica/be_nlp_speech_resources
Links to Belarusian NLP and Speech resources |
|
Experimental |
| 62 |
Umbaji/Yodi
This is the official repository for Yodi, the speech recognition model for 8... |
|
Experimental |
| 63 |
Aditya-ds-1806/Alar-voice-corpus
Voice corpus for the Alar Kannada-English Dictionary |
|
Experimental |