dia and dia2
Dia2 is the successor to Dia, offering streaming capabilities and real-time generation as an evolutionary improvement rather than a parallel alternative.
About dia
nari-labs/dia
A TTS model capable of generating ultra-realistic dialogue in one pass.
Built on a 1.6B parameter architecture, Dia directly synthesizes multi-speaker dialogue from transcripts with audio conditioning for voice cloning and emotion control, supporting nonverbal tags like laughter and coughing. Integrates with Hugging Face Transformers and provides inference through Python APIs, CLI, and Gradio UI, with realtime factor performance ranging from 0.9x–2.2x on RTX 4090 depending on precision. Uses the Descript Audio Codec for audio generation and supports speaker consistency via seed fixing or audio prompts.
About dia2
nari-labs/dia2
TTS model capable of streaming conversational audio in realtime.
Builds on the Kyutai Mimi codec to generate dialogue with speaker conditioning, enabling natural back-and-forth conversations by accepting audio context as input. Supports incremental generation from partial text without waiting for complete input, with 1B/2B model variants optimized for CUDA inference using bfloat16 precision and optional CUDA graph acceleration. Audio conditioning via Whisper transcription allows stable voice output when prefixed with speaker examples, supporting up to 2 minutes of English generation with word-level timestamps.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work