Multilingual Speech Datasets Voice AI Tools

Curated speech corpora and audio datasets across multiple languages for training ASR and speech processing models. Does NOT include text-to-speech synthesis, voice cloning, or speech recognition inference tools.

There are 11 multilingual speech datasets tools tracked. The highest-rated is apluka34/Bud500 at 37/100 with 69 stars.

Get all 11 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multilingual-speech-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	apluka34/Bud500 Bud500: A Comprehensive Vietnamese ASR Dataset	37	Emerging	69	—
2	qianchang/zici 字词：收集国学/汉语字词拼音相关资源	36	Emerging	31	—
3	gheyret/UQSpeechDataset Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット	33	Emerging	34	—
4	speechio/BigCiDian Pronunciation lexicon covering both English and Chinese languages for...	33	Emerging	262	Python
5	harisbinzia/PronouncUR PronouncUR: An Urdu Pronunciation Lexicon Generator	32	Emerging	16	Python
6	jonsafari/buckeye_dict Buckeye Pronunciation Dictionary	24	Experimental	2	—
7	Dragon745/urdu-roman-dictionary A growing open-source Urdu → Roman Urdu dictionary and lexicon for...	22	Experimental	—	—
8	gheyret/thuyg20_scripts Script files of THUYG-20(A free Uyghur speech database Released by...	15	Experimental	19	—
9	Nexdata-AI/100-Hours-Thai-Children-Spontaneous-Speech-Data Thai Child's Spontaneous Speech Data	14	Experimental	1	—
10	Nexdata-AI/650-Hours-Uyghur-Spontaneous-Speech-Data 650-Hours-Uyghur-Spontaneous-Speech-Data	14	Experimental	1	—
11	skit-ai/phone-number-entity-dataset Dataset Release for Phone Number Entity capture task	14	Experimental	14	—