MOSS-TTSD and MOSS-Speech

MOSS-TTSD handles the output side (text-to-speech synthesis) while MOSS-Speech handles the input side (speech-to-speech processing), making them complementary components of an end-to-end voice conversation pipeline.

MOSS-TTSD

Established

MOSS-Speech

Emerging

Maintenance 13/25

Adoption 10/25

Maturity 15/25

Community 19/25

Maintenance 10/25

Adoption 10/25

Maturity 15/25

Community 9/25

Stars: 1,202

Forks: 116

Downloads: —

Commits (30d): 4

Language: Python

License: Apache-2.0

Stars: 127

Forks: 7

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

No Package No Dependents

About MOSS-TTSD

OpenMOSS/MOSS-TTSD

MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enabling zero-shot voice cloning from short audio references.

Built on transformer-based architecture with audio tokenization via XY-Tokenizer, the model uses a continuation-based workflow where speaker reference audio and dialogue scripts enable seamless multi-speaker synthesis over extended contexts. Optimized for SGLang inference engine acceleration (up to 16x speedup), it supports streaming generation, fine-tuning via LoRA and full-parameter training, and integrates with Hugging Face model hub and spaces for easy deployment across 20 languages.

About MOSS-Speech

OpenMOSS/MOSS-Speech

MOSS-Speech is a true speech-to-speech large language model without text guidance.

Related comparisons

MOSS-TTSD and MOSS-TTS MOSS-TTSD and MOSS-TTS

Scores updated daily from GitHub, PyPI, and npm data. How scores work