OpenMOSS/MOSS-Audio-Tokenizer

MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, it supports streaming and variable bitrates, delivering SOTA reconstruction and strong performance in generation and understanding—serving as a unified interface for next-generation native audio language models.

41
/ 100
Emerging

Based on the README, here's a technical summary that goes deeper: Employs a 32-layer Residual Vector Quantizer with pure Causal Transformer blocks (1.6B parameters total) to compress 24kHz audio to 12.5Hz frame rate while supporting bitrates from 0.125–4kbps. Trained end-to-end without pretrained encoders or distillation, jointly optimizing encoder, quantizer, decoder, discriminator, and an LLM component for semantic alignment. Available in PyTorch on Hugging Face and ModelScope, with ONNX Runtime and TensorRT backends for deployment-ready inference without PyTorch dependencies.

162 stars.

No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 11 / 25
Community 10 / 25

How are scores calculated?

Stars

162

Forks

11

Language

Python

License

Apache-2.0

Last pushed

Mar 06, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/OpenMOSS/MOSS-Audio-Tokenizer"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.