VALL-E-X and vall-e
Both tools are independent PyTorch implementations of Microsoft's VALL-E zero-shot text-to-speech model, making them direct competitors offering alternative open-source reproductions of the same underlying research.
About VALL-E-X
Plachtaa/VALL-E-X
An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io/vallex/
Supports multilingual synthesis across English, Chinese, and Japanese with emotion and accent control from short acoustic prompts. Uses an autoregressive architecture combining acoustic token prediction with Vocos neural vocoding for high-quality audio reconstruction. Integrates OpenAI's Whisper for speaker embedding extraction and includes Python APIs compatible with PyTorch 2.0+ on CUDA platforms.
About vall-e
lifeiteng/vall-e
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
Implements a two-stage autoregressive-then-non-autoregressive decoder architecture that leverages neural codec tokens for speaker-preserving synthesis, trainable on a single GPU. Integrates with the Lhotse dataset framework and k2/Icefall speech processing toolkit for phoneme tokenization, feature extraction, and dataset preparation. Supports both English (LibriTTS) and Mandarin Chinese (AISHELL-1) with configurable acoustic prompt strategies during training.
Scores updated daily from GitHub, PyPI, and npm data. How scores work