Speech Corpora Datasets Voice AI Tools

Collections and catalogs of annotated speech audio data for training ASR, TTS, and voice AI models. Does NOT include tools for processing/cleaning datasets, annotation pipelines, or model implementations.

There are 63 speech corpora datasets tools tracked. 1 score above 50 (established tier). The highest-rated is ynop/audiomate at 53/100 with 138 stars and 252 monthly downloads.

Get all 63 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=speech-corpora-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 ynop/audiomate

Python library for handling audio datasets.

53
Established
2 davidmartinrius/speech-dataset-generator

🔊 Create labeled datasets, enhance audio quality, identify speakers, support...

48
Emerging
3 common-voice/cv-dataset

Metadata and versioning details for the Common Voice dataset

46
Emerging
4 reazon-research/ReazonSpeech

Massive open Japanese speech corpus

45
Emerging
5 EgorLakomkin/KTSpeechCrawler

Automatically constructing corpus for automatic speech recognition from...

40
Emerging
6 coqui-ai/open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

39
Emerging
7 yc9701/pansori

Tools for ASR Corpus Generation from Online Video

39
Emerging
8 Niger-Volta-LTI/yoruba-text

Yorùbá language training text for NLP, ASR and TTS tasks

38
Emerging
9 jim-schwoebel/download_audioset

📁 This repo makes it easy to download the raw audio files from AudioSet...

37
Emerging
10 Appen/UHV-OTS-Speech

A data annotation pipeline to generate high-quality, large-scale speech...

36
Emerging
11 candlewill/Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS

36
Emerging
12 dsfsi/dsfsi-datasets

Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+...

34
Emerging
13 robmsmt/ASR-Audio-Data-Links

A list of publically available audio data that anyone can download for ASR...

33
Emerging
14 Umbaji/NMTMD

Official repository for the Opensource Textdataset for NMT for local langues...

33
Emerging
15 unza-speech-lab/zambezi-voice

Repository for multilingual speech data resources for native languages of Zambia.

32
Emerging
16 wspr-ncsu/robocall-audio-dataset

A dataset of real-world robocall audio recordings

31
Emerging
17 AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish

AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech

30
Emerging
18 silenterus/deepspeech-cleaner

Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework

29
Experimental
19 IS2AI/ISSAI_SAIDA_Kazakh_ASR

the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus...

28
Experimental
20 yc9701/pansori-tedxkr-corpus

Korean ASR Corpus generated from TEDx talks

28
Experimental
21 khuangaf/ITRI-speech-recognition-dataset-generation

Automatic Speech Recognition Dataset Generation

27
Experimental
22 PranavMishra17/VoicePersona-Dataset

A comprehensive voice persona dataset for character consistency in voice...

27
Experimental
23 egorsmkv/asr-corpus-creator

This app is intended to automatically create a corpus for ASR systems using...

26
Experimental
24 xinjli/ucla-phonetic-corpus

Dataset of ICASSP 2021 MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH...

26
Experimental
25 swarms/mozilla-common-voice

Swarms supports the Common Voice Project from Mozilla! This repo contains...

26
Experimental
26 AI-TOOLKIT/VoiceData

Automatic Speech Recognition (ASR) Data Generator Toolkit

26
Experimental
27 97jamie/public-police-footage

Code for Constructing Datasets From Public Police Body Camera Footage (ICASSP 2025)

25
Experimental
28 csikasote/bigc

This repository contains the data resources for the LacunaFund supported...

25
Experimental
29 cyrta/broadcast-news-videos-dataset

Collection of broadcast news video clips

25
Experimental
30 turinaf/Sagalee

Automatic Speech Recognition Dataset for Oromo Language

25
Experimental
31 jhdeov/armenian-intonation

Repository of question-answer dialogues of Armenian, for an intonation study.

24
Experimental
32 skit-ai/speech-to-intent-dataset

Dataset Release for Intent Classification from Speech

24
Experimental
33 Niger-Volta-LTI/urhobo-asr-spoken-digits

URH-DIGITS is a connected digits speech recognition task

24
Experimental
34 BYO-UPM/Neurovoz_Dababase

Neurovoz corpus of parkinosnian speech

24
Experimental
35 goodmike31/pl-asr-speech-data-survey

Survey of available speech datasets for Polish ASR development

24
Experimental
36 zhongyuchen/DSPSpeech-20

A speech dataset of 20 isolated words each with 680 recordings from 34 individuals

23
Experimental
37 r9y9/jsut-lab

HTS-style full-context labels for JSUT v1.1

22
Experimental
38 Anwarvic/mTEDx_auxiliary

These are different files I created to do different tasks when I was working...

22
Experimental
39 Prem-kumar27/Fast-KTSpeechCrawler

Parallelized automatic corpus collection for ASR. Forked from...

22
Experimental
40 german-asr/megs

A merged version of multiple open-source German speech datasets.

22
Experimental
41 rusiaaman/PCPM

Presenting Collection of Pretrained Models. Links to pretrained models in...

22
Experimental
42 Pogayo/african-voices-web

Website that hosts the African Voices projects. Users can download datasets...

20
Experimental
43 qcri/Arabic_speech_code_switching

The first Dialectal Arabic Code Switching - DACS corpus from broadcast...

20
Experimental
44 bunyaminergen/awesome-speech-dataset

Awesome Speech Dataset, including download links and a brief explanation for...

18
Experimental
45 apluka34/audio-crawler

A tool for crawling and creating audio dataset

16
Experimental
46 motazsaad/jsc-news-broadcast

JSC news broadcast (speech corpus)

15
Experimental
47 antouanbg/Bulgarian_Linguistic

Collection and resources for Bulgarian Corpus, Datasets and Models used in...

15
Experimental
48 speakingofdata/80_Excerpts

4 voices x 80 transcripts = 320 audio recordings

15
Experimental
49 labsensacional/ASMRDataset

Recordings and transcriptions of ASMR artists compiled for the purpose of...

14
Experimental
50 jp1924/HF_builders

🤗 Datasets의 builder script를 모와둔 repo

14
Experimental
51 czyzi0/the-mc-speech-dataset

Free speech dataset consisting of 24018 short audio clips of a single...

14
Experimental
52 Mormolykos/bedvibe-datasets

Multilingual emotional speech datasets for TTS training

14
Experimental
53 harveenchadha/Speech-Learning-Resources

Repo containing resources to learn about various verticals of speech. ASR , TTS

13
Experimental
54 vislupus/Bulgarian-TTS-dataset

LibriVox dataset for Bulgarian language TTS

13
Experimental
55 Rumeysakeskin/Speech-Datasets-for-ASR

Download speech datasets (English and non-English) for Automatic Speech Recognition

12
Experimental
56 egorsmkv/asr-datasets-cleaner

A pipeline to make ASR datasets better

12
Experimental
57 ubisoft/ubisoft-laforge-french-homograph-dataset

Dataset for La Forge Speech Synthesis System Submission to the Blizzard...

12
Experimental
58 weimeng23/audio-speech-datasets

:scroll: A list of various Audio/Speech datasets about Speech Recognition,...

12
Experimental
59 speakingofdata/LJ2_Corpus

Single speaker, 26,200 transcribed audio recordings, 48 hours total

11
Experimental
60 mrcraked/WordAudio

A massive collection of high-quality MP3 word pronunciations. Download,...

11
Experimental
61 navalnica/be_nlp_speech_resources

Links to Belarusian NLP and Speech resources

11
Experimental
62 Umbaji/Yodi

This is the official repository for Yodi, the speech recognition model for 8...

10
Experimental
63 Aditya-ds-1806/Alar-voice-corpus

Voice corpus for the Alar Kannada-English Dictionary

10
Experimental