apluka34/Bud500

Bud500: A Comprehensive Vietnamese ASR Dataset

/ 100

Emerging

Spans 500 hours of multi-regional Vietnamese speech across diverse topics (podcasts, travel, food) with 16kHz sampling rate, structured as 634K training samples paired with transcriptions. Hosted on Hugging Face Datasets with parquet-based distribution supporting both streaming and batch loading via the `datasets` library. Curated by VietAI to provide regional accent diversity and publicly sourced material for reproducible ASR research.

No Package No Dependents

Maintenance 6 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

—

License

Apache-2.0

Category

multilingual-speech-datasets

Last pushed

Oct 10, 2025

Commits (30d)

GitHub

Multilingual Speech Datasets · 11 tools

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/apluka34/Bud500"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Related tools

qianchang/zici

字词：收集国学/汉语字词拼音相关资源

gheyret/UQSpeechDataset

Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット

speechio/BigCiDian

Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.

harisbinzia/PronouncUR

PronouncUR: An Urdu Pronunciation Lexicon Generator

jonsafari/buckeye_dict

Buckeye Pronunciation Dictionary

Explore Voice AI Tools

All categories Trending Voice AI directory Insights