Multilingual Speech Datasets Voice AI Tools

Curated speech corpora and audio datasets across multiple languages for training ASR and speech processing models. Does NOT include text-to-speech synthesis, voice cloning, or speech recognition inference tools.

There are 11 multilingual speech datasets tools tracked. The highest-rated is apluka34/Bud500 at 37/100 with 69 stars.

Get all 11 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multilingual-speech-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 apluka34/Bud500

Bud500: A Comprehensive Vietnamese ASR Dataset

37
Emerging
2 qianchang/zici

字词:收集国学/汉语字词拼音相关资源

36
Emerging
3 gheyret/UQSpeechDataset

Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット

33
Emerging
4 speechio/BigCiDian

Pronunciation lexicon covering both English and Chinese languages for...

33
Emerging
5 harisbinzia/PronouncUR

PronouncUR: An Urdu Pronunciation Lexicon Generator

32
Emerging
6 jonsafari/buckeye_dict

Buckeye Pronunciation Dictionary

24
Experimental
7 Dragon745/urdu-roman-dictionary

A growing open-source Urdu → Roman Urdu dictionary and lexicon for...

22
Experimental
8 gheyret/thuyg20_scripts

Script files of THUYG-20(A free Uyghur speech database Released by...

15
Experimental
9 Nexdata-AI/100-Hours-Thai-Children-Spontaneous-Speech-Data

Thai Child's Spontaneous Speech Data

14
Experimental
10 Nexdata-AI/650-Hours-Uyghur-Spontaneous-Speech-Data

650-Hours-Uyghur-Spontaneous-Speech-Data

14
Experimental
11 skit-ai/phone-number-entity-dataset

Dataset Release for Phone Number Entity capture task

14
Experimental