Things AI Won't Tell You About Building a Voice App

You ask your AI coding assistant "I need to add voice to my app." It recommends a TTS library. You integrate it. It works. Then you realise you also need speech-to-text. You ask again, integrate again. Then you discover your voice agent can't understand intent. Then you learn about latency budgets. Then you find out you have no way to measure whether any of it is actually working.

Each time, the AI cheerfully helps you solve the piece you asked about. It never volunteers that you're working at the wrong level of abstraction — that "adding voice to your app" is actually a five-layer architecture problem, and choosing a TTS library is just one layer.

This guide is the thing the AI should have told you up front. Here's the full stack, what's dominant at each layer, where the real decisions are, and where there's nothing good yet.

Layer 1: Speech-to-text — just use Whisper

This is a solved problem. Whisper won. Don't spend time evaluating alternatives unless you have a very specific constraint that Whisper can't meet.

The only real decision is which Whisper variant fits your deployment:

Project	Score	Stars	Use when
whisperX	90/100	20,758	Python app, need word timestamps + speaker diarisation
faster-whisper	65/100	21,444	Python app, need speed (4x faster than OpenAI Whisper)
whisper.cpp	72/100	47,665	C/C++ deployment, edge devices, minimal dependencies
sherpa-onnx	91/100	10,885	Mobile, embedded, cross-platform, runs on anything

If you're building a Python backend, WhisperX (90/100) is the default choice. If you need it to run on a phone or a Raspberry Pi, sherpa-onnx (91/100, 138 commits last month). That's it. Move on to the hard decisions.

The one edge case: if you need a lightweight Python library that wraps multiple backends (Google, Sphinx, Whisper) behind a single API, speech_recognition (90/100, 8,959 stars) is the established option.

Layer 2: Text-to-speech — this is where you actually have a decision

Unlike STT, there is no single dominant TTS solution. The right choice depends on what you're optimising for, and the trade-offs are real.

If you need the best quality and don't mind paying

ElevenLabs (92/100). Industry benchmark for voice quality. Well-maintained SDK. The trade-off is cost and vendor dependency. If voice quality is what sells your product, this is the safe choice.

If you need free and good-enough

edge-tts (76/100, 10,304 stars). Wraps Microsoft Edge's TTS API — high quality, zero cost, dozens of voices and languages. An entire ecosystem of wrappers exists around it: edge-tts-universal for cross-platform use, GUI wrappers, video translation tools. The risk: it's an unofficial API. Microsoft tolerates it but doesn't guarantee it.

If you need to run locally

mlx-audio (93/100) for Apple Silicon. sherpa-onnx for everything else. kokoro-onnx (70/100) is a newer option specifically for TTS with ONNX. Local TTS eliminates latency and cost but voice quality is below ElevenLabs.

If you just need something simple in Python

pyttsx3 (75/100) for offline, uses system voices. gTTS (78/100) for Google Translate TTS. Neither sounds amazing. Both work in 3 lines of code.

If you need real-time streaming TTS

RealtimeTTS (84/100, 3,800 stars). Streams audio as it's generated, supports multiple TTS backends. This matters for voice agents where the user is waiting for a response — 200ms to first audio is the difference between feeling natural and feeling broken.

Layer 3: Intent and understanding — the layer AI never mentions

You now have speech-to-text (layer 1) and text-to-speech (layer 4). The AI coding assistant considers the problem solved. It isn't.

Between "the user said something" and "the app responds," you need to understand what the user meant. In a simple case, this might be keyword matching. In a real application, it's semantic understanding: intent classification, entity extraction, context management.

This is where the voice stack intersects with the embeddings and NLP ecosystems:

For keyword/command recognition: speech_recognition can detect hotwords. sherpa-onnx supports keyword spotting on-device.
For semantic intent: You need embeddings. Embed user utterances, compare against known intents via cosine similarity. Typesense (74/100) handles this if you're already using it for search.
For open-ended conversation: Route through an LLM. The user's transcribed speech becomes a prompt. The LLM's response becomes TTS input. This is conceptually simple but the latency budget is brutal: STT (200-500ms) + LLM (500-2000ms) + TTS (200-500ms) = 1-3 seconds of silence before the user hears anything.

The latency budget is the thing nobody warns you about. Each layer adds delay, and voice interactions have much lower latency tolerance than text. A chatbot can take 3 seconds to respond. A voice assistant that takes 3 seconds sounds broken. This is why streaming TTS (RealtimeTTS) and local inference (sherpa-onnx, mlx-audio) matter — they compress the latency budget at the edges so the LLM in the middle has more time.

Layer 4: Orchestration — if you're building a voice agent

If you're building a simple TTS feature (read this text aloud), you don't need this layer. If you're building a voice agent — something that listens, understands, acts, and responds — you do.

The honest answer: the tooling here is immature. Most voice agent repos are demos — a script that wires STT → LLM → TTS and calls it a day. The projects worth watching are the ones building infrastructure:

Project	Score	Stars	What it does
voice-ai	69/100	686	Rapida is an open-source, end-to-end voice AI orchestration platform for...
voice-devtools	30/100	50	Developer tools to debug and build realtime voice agents. Supports multiple models.
CosyVoice	64/100	19,991	Multi-lingual large voice generation model, providing inference, training...

rapida voice-ai (69/100) is building a Go-based voice agent framework. voice-devtools from Outspeed tackles the debugging problem — how do you inspect what's happening in a real-time voice pipeline? CosyVoice (19,991 stars) is a comprehensive voice generation system from the FunAudioLLM team.

If you're building a voice agent today, you're mostly assembling the stack yourself: pick a STT, pick a TTS, wire them through an LLM, handle interruption, manage conversation state, optimise for latency. There's no "Rails for voice agents" yet. That's both the challenge and the opportunity.

Layer 5: Evaluation — nobody thinks about this until production

You've built the thing. It works on your laptop. How do you know it works for real users? How do you measure whether the STT is accurate enough, the TTS sounds natural enough, the latency is fast enough?

This layer barely exists in open source:

Project	Score	Stars	What it measures
autovoiceevals	47/100	83	End-to-end voice AI evaluation
werpy	63/100	23	Word error rate calculation
meeteval	63/100	149	Meeting transcription evaluation (WER, cpWER, ORC-WER)

Most evaluation tools focus on ASR accuracy (word error rate). Almost nothing exists for evaluating TTS quality, voice agent conversation quality, or end-to-end latency in production. autovoiceevals is one of the few projects attempting end-to-end voice AI evaluation.

In practice, teams building voice apps are writing custom evaluation scripts. If you're heading toward production, budget time for this — it's the layer that determines whether your voice app feels polished or feels like a demo.

The full picture

Here's what "add voice to my app" actually means:

STT: Solved. Pick a Whisper variant for your deployment target. 30 seconds of decision-making.
TTS: Real decision. Free vs paid vs local vs streaming. Spend your evaluation time here.
Intent/understanding: The layer you don't know you need until you need it. Embeddings for semantic matching, LLM for open conversation, latency budget for both.
Orchestration: If building a voice agent, you're assembling this yourself. No dominant framework yet.
Evaluation: Almost nothing exists. Plan to build custom metrics.

An AI coding assistant would have walked you through layer 1, then layer 2 if you asked, and never mentioned layers 3-5 until you hit a wall. Now you know the full landscape before you start.

Go deeper

Every project mentioned here has a quality-scored page in our directory, updated daily:

Voice AI categories — 166 categories covering TTS, ASR, voice agents, and more
Trending voice AI projects — what's moving this week
Embeddings categories — for the intent and understanding layer
Voice AI landscape deep dive — the full quality assessment of 9,679 voice AI repos