apluka34/Bud500
Bud500: A Comprehensive Vietnamese ASR Dataset
Spans 500 hours of multi-regional Vietnamese speech across diverse topics (podcasts, travel, food) with 16kHz sampling rate, structured as 634K training samples paired with transcriptions. Hosted on Hugging Face Datasets with parquet-based distribution supporting both streaming and batch loading via the `datasets` library. Curated by VietAI to provide regional accent diversity and publicly sourced material for reproducible ASR research.
Stars
69
Forks
9
Language
—
License
Apache-2.0
Category
Last pushed
Oct 10, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/apluka34/Bud500"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
qianchang/zici
字词:收集国学/汉语字词拼音相关资源
gheyret/UQSpeechDataset
Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット
speechio/BigCiDian
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
harisbinzia/PronouncUR
PronouncUR: An Urdu Pronunciation Lexicon Generator
jonsafari/buckeye_dict
Buckeye Pronunciation Dictionary