OpenMOSS/MOSS-Audio-Tokenizer
MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, it supports streaming and variable bitrates, delivering SOTA reconstruction and strong performance in generation and understanding—serving as a unified interface for next-generation native audio language models.
Based on the README, here's a technical summary that goes deeper: Employs a 32-layer Residual Vector Quantizer with pure Causal Transformer blocks (1.6B parameters total) to compress 24kHz audio to 12.5Hz frame rate while supporting bitrates from 0.125–4kbps. Trained end-to-end without pretrained encoders or distillation, jointly optimizing encoder, quantizer, decoder, discriminator, and an LLM component for semantic alignment. Available in PyTorch on Hugging Face and ModelScope, with ONNX Runtime and TensorRT backends for deployment-ready inference without PyTorch dependencies.
162 stars.
Stars
162
Forks
11
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 06, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/OpenMOSS/MOSS-Audio-Tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Spr-Aachen/Easy-Voice-Toolkit
A user-friendly audio toolkit for voice recognition, voice transcription, voice conversion etc.
ftyers/commonvoice-utils
Linguistic processing for Common Voice
alphacep/awesome-russian-speech
Russian speech technology links
microsoft/UniSpeech
UniSpeech - Large Scale Self-Supervised Learning for Speech
microsoft/SpeechT5
Unified-Modal Speech-Text Pre-Training for Spoken Language Processing