Speech Corpora Datasets Voice AI Tools

Collections and catalogs of annotated speech audio data for training ASR, TTS, and voice AI models. Does NOT include tools for processing/cleaning datasets, annotation pipelines, or model implementations.

There are 63 speech corpora datasets tools tracked. 1 score above 50 (established tier). The highest-rated is ynop/audiomate at 53/100 with 138 stars and 252 monthly downloads.

Get all 63 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=speech-corpora-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	ynop/audiomate Python library for handling audio datasets.	53	Established	138	Python
2	davidmartinrius/speech-dataset-generator 🔊 Create labeled datasets, enhance audio quality, identify speakers, support...	48	Emerging	257	Python
3	common-voice/cv-dataset Metadata and versioning details for the Common Voice dataset	46	Emerging	168	JavaScript
4	reazon-research/ReazonSpeech Massive open Japanese speech corpus	45	Emerging	373	Python
5	EgorLakomkin/KTSpeechCrawler Automatically constructing corpus for automatic speech recognition from...	40	Emerging	157	Python
6	coqui-ai/open-speech-corpora 💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies	39	Emerging	1,390	—
7	yc9701/pansori Tools for ASR Corpus Generation from Online Video	39	Emerging	140	Python
8	Niger-Volta-LTI/yoruba-text Yorùbá language training text for NLP, ASR and TTS tasks	38	Emerging	82	Python
9	jim-schwoebel/download_audioset 📁 This repo makes it easy to download the raw audio files from AudioSet...	37	Emerging	105	Python
10	Appen/UHV-OTS-Speech A data annotation pipeline to generate high-quality, large-scale speech...	36	Emerging	106	Forth
11	candlewill/Speech-Corpus-Collection A Collection of Speech Corpus for ASR and TTS	36	Emerging	113	—
12	dsfsi/dsfsi-datasets Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+...	34	Emerging	6	Jupyter Notebook
13	robmsmt/ASR-Audio-Data-Links A list of publically available audio data that anyone can download for ASR...	33	Emerging	231	Shell
14	Umbaji/NMTMD Official repository for the Opensource Textdataset for NMT for local langues...	33	Emerging	26	—
15	unza-speech-lab/zambezi-voice Repository for multilingual speech data resources for native languages of Zambia.	32	Emerging	20	—
16	wspr-ncsu/robocall-audio-dataset A dataset of real-world robocall audio recordings	31	Emerging	14	—
17	AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech	30	Emerging	19	—
18	silenterus/deepspeech-cleaner Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework	29	Experimental	48	Python
19	IS2AI/ISSAI_SAIDA_Kazakh_ASR the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus...	28	Experimental	56	Shell
20	yc9701/pansori-tedxkr-corpus Korean ASR Corpus generated from TEDx talks	28	Experimental	27	—
21	khuangaf/ITRI-speech-recognition-dataset-generation Automatic Speech Recognition Dataset Generation	27	Experimental	37	Jupyter Notebook
22	PranavMishra17/VoicePersona-Dataset A comprehensive voice persona dataset for character consistency in voice...	27	Experimental	5	Python
23	egorsmkv/asr-corpus-creator This app is intended to automatically create a corpus for ASR systems using...	26	Experimental	27	Python
24	xinjli/ucla-phonetic-corpus Dataset of ICASSP 2021 MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH...	26	Experimental	46	Python
25	swarms/mozilla-common-voice Swarms supports the Common Voice Project from Mozilla! This repo contains...	26	Experimental	5	Python
26	AI-TOOLKIT/VoiceData Automatic Speech Recognition (ASR) Data Generator Toolkit	26	Experimental	5	—
27	97jamie/public-police-footage Code for Constructing Datasets From Public Police Body Camera Footage (ICASSP 2025)	25	Experimental	2	Jupyter Notebook
28	csikasote/bigc This repository contains the data resources for the LacunaFund supported...	25	Experimental	10	—
29	cyrta/broadcast-news-videos-dataset Collection of broadcast news video clips	25	Experimental	5	—
30	turinaf/Sagalee Automatic Speech Recognition Dataset for Oromo Language	25	Experimental	28	Python
31	jhdeov/armenian-intonation Repository of question-answer dialogues of Armenian, for an intonation study.	24	Experimental	4	—
32	skit-ai/speech-to-intent-dataset Dataset Release for Intent Classification from Speech	24	Experimental	48	Python
33	Niger-Volta-LTI/urhobo-asr-spoken-digits URH-DIGITS is a connected digits speech recognition task	24	Experimental	4	—
34	BYO-UPM/Neurovoz_Dababase Neurovoz corpus of parkinosnian speech	24	Experimental	9	Python
35	goodmike31/pl-asr-speech-data-survey Survey of available speech datasets for Polish ASR development	24	Experimental	17	Python
36	zhongyuchen/DSPSpeech-20 A speech dataset of 20 isolated words each with 680 recordings from 34 individuals	23	Experimental	2	—
37	r9y9/jsut-lab HTS-style full-context labels for JSUT v1.1	22	Experimental	51	—
38	Anwarvic/mTEDx_auxiliary These are different files I created to do different tasks when I was working...	22	Experimental	1	Python
39	Prem-kumar27/Fast-KTSpeechCrawler Parallelized automatic corpus collection for ASR. Forked from...	22	Experimental	23	Python
40	german-asr/megs A merged version of multiple open-source German speech datasets.	22	Experimental	34	Jupyter Notebook
41	rusiaaman/PCPM Presenting Collection of Pretrained Models. Links to pretrained models in...	22	Experimental	23	—
42	Pogayo/african-voices-web Website that hosts the African Voices projects. Users can download datasets...	20	Experimental	7	Python
43	qcri/Arabic_speech_code_switching The first Dialectal Arabic Code Switching - DACS corpus from broadcast...	20	Experimental	15	—
44	bunyaminergen/awesome-speech-dataset Awesome Speech Dataset, including download links and a brief explanation for...	18	Experimental	26	—
45	apluka34/audio-crawler A tool for crawling and creating audio dataset	16	Experimental	3	Python
46	motazsaad/jsc-news-broadcast JSC news broadcast (speech corpus)	15	Experimental	1	Python
47	antouanbg/Bulgarian_Linguistic Collection and resources for Bulgarian Corpus, Datasets and Models used in...	15	Experimental	25	Java
48	speakingofdata/80_Excerpts 4 voices x 80 transcripts = 320 audio recordings	15	Experimental	—	—
49	labsensacional/ASMRDataset Recordings and transcriptions of ASMR artists compiled for the purpose of...	14	Experimental	12	—
50	jp1924/HF_builders 🤗 Datasets의 builder script를 모와둔 repo	14	Experimental	3	Python
51	czyzi0/the-mc-speech-dataset Free speech dataset consisting of 24018 short audio clips of a single...	14	Experimental	9	—
52	Mormolykos/bedvibe-datasets Multilingual emotional speech datasets for TTS training	14	Experimental	—	—
53	harveenchadha/Speech-Learning-Resources Repo containing resources to learn about various verticals of speech. ASR , TTS	13	Experimental	5	—
54	vislupus/Bulgarian-TTS-dataset LibriVox dataset for Bulgarian language TTS	13	Experimental	8	—
55	Rumeysakeskin/Speech-Datasets-for-ASR Download speech datasets (English and non-English) for Automatic Speech Recognition	12	Experimental	15	Jupyter Notebook
56	egorsmkv/asr-datasets-cleaner A pipeline to make ASR datasets better	12	Experimental	3	Python
57	ubisoft/ubisoft-laforge-french-homograph-dataset Dataset for La Forge Speech Synthesis System Submission to the Blizzard...	12	Experimental	3	—
58	weimeng23/audio-speech-datasets :scroll: A list of various Audio/Speech datasets about Speech Recognition,...	12	Experimental	3	—
59	speakingofdata/LJ2_Corpus Single speaker, 26,200 transcribed audio recordings, 48 hours total	11	Experimental	—	—
60	mrcraked/WordAudio A massive collection of high-quality MP3 word pronunciations. Download,...	11	Experimental	2	Python
61	navalnica/be_nlp_speech_resources Links to Belarusian NLP and Speech resources	11	Experimental	48	—
62	Umbaji/Yodi This is the official repository for Yodi, the speech recognition model for 8...	10	Experimental	1	Python
63	Aditya-ds-1806/Alar-voice-corpus Voice corpus for the Alar Kannada-English Dictionary	10	Experimental	1	Go