OpenMOSS/MOSS-Audio-Tokenizer

MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, it supports streaming and variable bitrates, delivering SOTA reconstruction and strong performance in generation and understanding—serving as a unified interface for next-generation native audio language models.

/ 100

Emerging

Based on the README, here's a technical summary that goes deeper: Employs a 32-layer Residual Vector Quantizer with pure Causal Transformer blocks (1.6B parameters total) to compress 24kHz audio to 12.5Hz frame rate while supporting bitrates from 0.125–4kbps. Trained end-to-end without pretrained encoders or distillation, jointly optimizing encoder, quantizer, decoder, discriminator, and an LLM component for semantic alignment. Available in PyTorch on Hugging Face and ModelScope, with ONNX Runtime and TensorRT backends for deployment-ready inference without PyTorch dependencies.

162 stars.

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 11 / 25

Community 10 / 25

How are scores calculated?

Stars

162

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

Spr-Aachen/Easy-Voice-Toolkit

A user-friendly audio toolkit for voice recognition, voice transcription, voice conversion etc.

ftyers/commonvoice-utils

Linguistic processing for Common Voice

alphacep/awesome-russian-speech

Russian speech technology links

microsoft/UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech

microsoft/SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing

Explore Voice AI Tools

All categories Trending Voice AI directory Insights