Speech Recognition Datasets ML Frameworks

Multilingual audio corpora for training speech recognition, synthesis, and conversational AI models. Does NOT include general audio processing tools, music datasets, or non-speech audio collections.

There are 8 speech recognition datasets frameworks tracked. The highest-rated is Ijwi-ry-Ikirundi-AI/Kirundi_Dataset at 36/100 with 7 stars.

Get all 8 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=speech-recognition-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Framework	Score	Tier	Stars	Language
1	Ijwi-ry-Ikirundi-AI/Kirundi_Dataset 🇧🇮 The first large-scale, open-source speech and text dataset for Kirundi...	36	Emerging	7	Jupyter Notebook
2	hstsethi/in-mob-prefix Dataset, charts, models of 4 digit mobile number prefixes in India by state,...	31	Emerging	5	Jupyter Notebook
3	apple/ml-spatial-librispeech A large synthetic dataset of spatial audio with multiple labels	29	Experimental	125	—
4	Jahangirbd23/WenetSpeech-Yue 📑 Explore WenetSpeech-Yue, a comprehensive Cantonese speech corpus with rich...	22	Experimental	—	Python
5	Nexdata-AI/359-Hours-Indonesian-Speech-Data-by-Mobile-Phone_Reading Indonesian Speech Dataset	18	Experimental	7	—
6	Nexdata-AI/207-Hours-Japanese-Speaking-English-Speech-Data-by-Mobile-Phone Japanese Speaking English Speech Dataset	16	Experimental	2	—
7	Nexdata-AI/338-Hours-Spanish-Speech-Data-by-Mobile-Phone Spanish Speech Dataset	14	Experimental	1	—
8	Nexdata-AI/98-Hours-Taiwan-Mandarin-Speech-Data-by-Mobile-Phone_Reading Taiwan Speech Dataset	14	Experimental	1	—