Multilingual Speech Datasets Voice AI Tools
Curated speech corpora and audio datasets across multiple languages for training ASR and speech processing models. Does NOT include text-to-speech synthesis, voice cloning, or speech recognition inference tools.
There are 11 multilingual speech datasets tools tracked. The highest-rated is apluka34/Bud500 at 37/100 with 69 stars.
Get all 11 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multilingual-speech-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
apluka34/Bud500
Bud500: A Comprehensive Vietnamese ASR Dataset |
|
Emerging |
| 2 |
qianchang/zici
字词:收集国学/汉语字词拼音相关资源 |
|
Emerging |
| 3 |
gheyret/UQSpeechDataset
Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット |
|
Emerging |
| 4 |
speechio/BigCiDian
Pronunciation lexicon covering both English and Chinese languages for... |
|
Emerging |
| 5 |
harisbinzia/PronouncUR
PronouncUR: An Urdu Pronunciation Lexicon Generator |
|
Emerging |
| 6 |
jonsafari/buckeye_dict
Buckeye Pronunciation Dictionary |
|
Experimental |
| 7 |
Dragon745/urdu-roman-dictionary
A growing open-source Urdu → Roman Urdu dictionary and lexicon for... |
|
Experimental |
| 8 |
gheyret/thuyg20_scripts
Script files of THUYG-20(A free Uyghur speech database Released by... |
|
Experimental |
| 9 |
Nexdata-AI/100-Hours-Thai-Children-Spontaneous-Speech-Data
Thai Child's Spontaneous Speech Data |
|
Experimental |
| 10 |
Nexdata-AI/650-Hours-Uyghur-Spontaneous-Speech-Data
650-Hours-Uyghur-Spontaneous-Speech-Data |
|
Experimental |
| 11 |
skit-ai/phone-number-entity-dataset
Dataset Release for Phone Number Entity capture task |
|
Experimental |