MOSS-TTSD and MOSS-Speech
MOSS-TTSD handles the output side (text-to-speech synthesis) while MOSS-Speech handles the input side (speech-to-speech processing), making them complementary components of an end-to-end voice conversation pipeline.
About MOSS-TTSD
OpenMOSS/MOSS-TTSD
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enabling zero-shot voice cloning from short audio references.
Built on transformer-based architecture with audio tokenization via XY-Tokenizer, the model uses a continuation-based workflow where speaker reference audio and dialogue scripts enable seamless multi-speaker synthesis over extended contexts. Optimized for SGLang inference engine acceleration (up to 16x speedup), it supports streaming generation, fine-tuning via LoRA and full-parameter training, and integrates with Hugging Face model hub and spaces for easy deployment across 20 languages.
About MOSS-Speech
OpenMOSS/MOSS-Speech
MOSS-Speech is a true speech-to-speech large language model without text guidance.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work