fluxions-ai/vui
100M parameter lightweight conversational text-to-speech model with breaths, laughter, multi-speaker dialogue, voice cloning, and streaming. Llama-based, on-device.
Built on a Llama-style causal transformer with ByT5 tokenization, it uses a custom audio codec (Fluac) combining Descript's DAC with Finite Scalar Quantization to reduce token rates 4x (21.5Hz vs 86Hz), enabling longer context windows and streaming synthesis via CUDA graphs. Trained on 40,000 hours of real conversational audio, it includes specialized checkpoint variants: ABRAHAM for single-speaker context-aware responses and COHOST for two-speaker dialogue synthesis. Integrates with Hugging Face (model hosting, VAD/segmentation pipelines) and torchcodec for audio encoding/decoding.
641 stars.
Stars
641
Forks
63
Language
Python
License
MIT
Category
Last pushed
Feb 25, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/fluxions-ai/vui"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
edwko/OuteTTS
Interface for OuteTTS models.
OpenVoiceOS/ovos-audio-transformer-plugin-ggwave
data over sound plugin
mbzuai-oryx/LLMVoX
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
inboxpraveen/LLM-Minutes-of-Meeting
🎤📄 An innovative tool that transforms audio or video files into text transcripts and generates...
Aratako/T5Gemma-TTS
Multilingual TTS model with voice cloning and duration control, based on T5Gemma encoder-decoder LLM