Things AI Won't Tell You About Building a Voice App
You'll ask for a TTS library and get one. You won't be told you also need STT, intent handling, orchestration, and evaluation. Here's the full stack, with opinionated recommendations at every layer.
You ask your AI coding assistant "I need to add voice to my app." It recommends a TTS library. You integrate it. It works. Then you realise you also need speech-to-text. You ask again, integrate again. Then you discover your voice agent can't understand intent. Then you learn about latency budgets. Then you find out you have no way to measure whether any of it is actually working.
Each time, the AI cheerfully helps you solve the piece you asked about. It never volunteers that you're working at the wrong level of abstraction — that "adding voice to your app" is actually a five-layer architecture problem, and choosing a TTS library is just one layer.
This guide is the thing the AI should have told you up front. Here's the full stack, what's dominant at each layer, where the real decisions are, and where there's nothing good yet.
Layer 1: Speech-to-text — just use Whisper
This is a solved problem. Whisper won. Don't spend time evaluating alternatives unless you have a very specific constraint that Whisper can't meet.
The only real decision is which Whisper variant fits your deployment:
| Project | Score | Stars | Use when |
|---|---|---|---|
| whisperX | 90/100 | 20,758 | Python app, need word timestamps + speaker diarisation |
| faster-whisper | 65/100 | 21,444 | Python app, need speed (4x faster than OpenAI Whisper) |
| whisper.cpp | 72/100 | 47,665 | C/C++ deployment, edge devices, minimal dependencies |
| sherpa-onnx | 91/100 | 10,885 | Mobile, embedded, cross-platform, runs on anything |
If you're building a Python backend, WhisperX (90/100) is the default choice. If you need it to run on a phone or a Raspberry Pi, sherpa-onnx (91/100, 138 commits last month). That's it. Move on to the hard decisions.
The one edge case: if you need a lightweight Python library that wraps multiple backends (Google, Sphinx, Whisper) behind a single API, speech_recognition (90/100, 8,959 stars) is the established option.
Layer 2: Text-to-speech — this is where you actually have a decision
Unlike STT, there is no single dominant TTS solution. The right choice depends on what you're optimising for, and the trade-offs are real.
If you need the best quality and don't mind paying
ElevenLabs (92/100). Industry benchmark for voice quality. Well-maintained SDK. The trade-off is cost and vendor dependency. If voice quality is what sells your product, this is the safe choice.
If you need free and good-enough
edge-tts (76/100, 10,304 stars). Wraps Microsoft Edge's TTS API — high quality, zero cost, dozens of voices and languages. An entire ecosystem of wrappers exists around it: edge-tts-universal for cross-platform use, GUI wrappers, video translation tools. The risk: it's an unofficial API. Microsoft tolerates it but doesn't guarantee it.
If you need to run locally
mlx-audio (93/100) for Apple Silicon. sherpa-onnx for everything else. kokoro-onnx (70/100) is a newer option specifically for TTS with ONNX. Local TTS eliminates latency and cost but voice quality is below ElevenLabs.
If you just need something simple in Python
pyttsx3 (75/100) for offline, uses system voices. gTTS (78/100) for Google Translate TTS. Neither sounds amazing. Both work in 3 lines of code.
If you need real-time streaming TTS
RealtimeTTS (84/100, 3,800 stars). Streams audio as it's generated, supports multiple TTS backends. This matters for voice agents where the user is waiting for a response — 200ms to first audio is the difference between feeling natural and feeling broken.
Layer 3: Intent and understanding — the layer AI never mentions
You now have speech-to-text (layer 1) and text-to-speech (layer 4). The AI coding assistant considers the problem solved. It isn't.
Between "the user said something" and "the app responds," you need to understand what the user meant. In a simple case, this might be keyword matching. In a real application, it's semantic understanding: intent classification, entity extraction, context management.
This is where the voice stack intersects with the embeddings and NLP ecosystems:
- For keyword/command recognition: speech_recognition can detect hotwords. sherpa-onnx supports keyword spotting on-device.
- For semantic intent: You need embeddings. Embed user utterances, compare against known intents via cosine similarity. Typesense (74/100) handles this if you're already using it for search.
- For open-ended conversation: Route through an LLM. The user's transcribed speech becomes a prompt. The LLM's response becomes TTS input. This is conceptually simple but the latency budget is brutal: STT (200-500ms) + LLM (500-2000ms) + TTS (200-500ms) = 1-3 seconds of silence before the user hears anything.
The latency budget is the thing nobody warns you about. Each layer adds delay, and voice interactions have much lower latency tolerance than text. A chatbot can take 3 seconds to respond. A voice assistant that takes 3 seconds sounds broken. This is why streaming TTS (RealtimeTTS) and local inference (sherpa-onnx, mlx-audio) matter — they compress the latency budget at the edges so the LLM in the middle has more time.
Layer 4: Orchestration — if you're building a voice agent
If you're building a simple TTS feature (read this text aloud), you don't need this layer. If you're building a voice agent — something that listens, understands, acts, and responds — you do.
The honest answer: the tooling here is immature. Most voice agent repos are demos — a script that wires STT → LLM → TTS and calls it a day. The projects worth watching are the ones building infrastructure:
| Project | Score | Stars | What it does |
|---|---|---|---|
| voice-ai | 69/100 | 686 | Rapida is an open-source, end-to-end voice AI orchestration platform for... |
| voice-devtools | 30/100 | 50 | Developer tools to debug and build realtime voice agents. Supports multiple models. |
| CosyVoice | 64/100 | 19,991 | Multi-lingual large voice generation model, providing inference, training... |
rapida voice-ai (69/100) is building a Go-based voice agent framework. voice-devtools from Outspeed tackles the debugging problem — how do you inspect what's happening in a real-time voice pipeline? CosyVoice (19,991 stars) is a comprehensive voice generation system from the FunAudioLLM team.
If you're building a voice agent today, you're mostly assembling the stack yourself: pick a STT, pick a TTS, wire them through an LLM, handle interruption, manage conversation state, optimise for latency. There's no "Rails for voice agents" yet. That's both the challenge and the opportunity.
Layer 5: Evaluation — nobody thinks about this until production
You've built the thing. It works on your laptop. How do you know it works for real users? How do you measure whether the STT is accurate enough, the TTS sounds natural enough, the latency is fast enough?
This layer barely exists in open source:
| Project | Score | Stars | What it measures |
|---|---|---|---|
| autovoiceevals | 47/100 | 83 | End-to-end voice AI evaluation |
| werpy | 63/100 | 23 | Word error rate calculation |
| meeteval | 63/100 | 149 | Meeting transcription evaluation (WER, cpWER, ORC-WER) |
Most evaluation tools focus on ASR accuracy (word error rate). Almost nothing exists for evaluating TTS quality, voice agent conversation quality, or end-to-end latency in production. autovoiceevals is one of the few projects attempting end-to-end voice AI evaluation.
In practice, teams building voice apps are writing custom evaluation scripts. If you're heading toward production, budget time for this — it's the layer that determines whether your voice app feels polished or feels like a demo.
The full picture
Here's what "add voice to my app" actually means:
- STT: Solved. Pick a Whisper variant for your deployment target. 30 seconds of decision-making.
- TTS: Real decision. Free vs paid vs local vs streaming. Spend your evaluation time here.
- Intent/understanding: The layer you don't know you need until you need it. Embeddings for semantic matching, LLM for open conversation, latency budget for both.
- Orchestration: If building a voice agent, you're assembling this yourself. No dominant framework yet.
- Evaluation: Almost nothing exists. Plan to build custom metrics.
An AI coding assistant would have walked you through layer 1, then layer 2 if you asked, and never mentioned layers 3-5 until you hit a wall. Now you know the full landscape before you start.
Go deeper
Every project mentioned here has a quality-scored page in our directory, updated daily:
- Voice AI categories — 166 categories covering TTS, ASR, voice agents, and more
- Trending voice AI projects — what's moving this week
- Embeddings categories — for the intent and understanding layer
- Voice AI landscape deep dive — the full quality assessment of 9,679 voice AI repos
Related analysis
Choosing a Voice AI Library in 2026: What's Actually Worth Building On
TTS, speech recognition, and voice agents — scored on quality daily. A decision guide based on which projects are...
You're Shipping AI You Can't Measure
1,159 repos are building LLM evaluation infrastructure. Most teams are still eyeballing outputs. Here's the decision...
Agent Memory in 2026: What Actually Works for Persistent AI
977 repos, 5 domains, 10+ names for the same concept. A decision guide for builders navigating the most fragmented...
AI Agents for Obsidian and Personal Knowledge: What Actually Works in 2026
From established plugins to MCP bridges to experimental agent tools — scored on quality daily. A decision guide for...