Building voice agents that don't feel like they're fighting you comes down to one critical piece: the speech-to-text layer. Get the transcription wrong, and your LLM responds to garbage input. Get the timing wrong, and users feel like they're talking to a broken robot.
I've been building voice agents for two years now, and the difference between a demo that impresses your team and something users actually want to use isn't the LLM choice or the voice synthesis. It's whether your STT can handle real conversation patterns without cutting people off mid-sentence or waiting three seconds after they finish talking.
What Is Speech to Text for Voice Agents?
Speech to text for voice agents is streaming transcription optimized for real-time conversation flow. Unlike batch transcription that processes complete audio files, voice agent STT needs to deliver partial results instantly, detect when speakers finish their turns, and maintain accuracy on the structured data (phone numbers, emails, account IDs) that voice interactions often involve.
Why Traditional STT Falls Short for Voice Agents
Most speech-to-text services were built for transcription workflows, not conversation orchestration. The problems show up in three places:
Turn detection timing. Voice Activity Detection (VAD) systems wait for silence, then guess when someone's done talking. This creates two failure modes: false triggers on natural mid-sentence pauses, and awkward delays when someone trails off instead of stopping cleanly. Neither feels natural.
Accuracy on structured data. Voice agents handle account lookups, appointment scheduling, address collection. Traditional STT models routinely mangle phone numbers, email addresses, and product names. When someone says "The order number is A-B-1-2-3-4," you need "AB1234," not "hey, be one, two, three, four."
Latency consistency. Batch-optimized models can deliver transcripts anywhere from 200ms to 2+ seconds after speech ends. Voice agents need predictable, sub-500ms response times or the conversation feels broken.
These aren't minor UX issues. They're the difference between a voice agent people use and one they hang up on.
Real-Time STT Options: What Actually Works
The streaming STT market breaks into three tiers: general-purpose models, conversation-optimized models, and voice-agent-specific offerings.
General-purpose streaming models like Google Speech-to-Text streaming or AWS Transcribe streaming work for basic use cases. They're fast, cheap, and adequate for English-only applications where accuracy isn't critical. But they use silence-based turn detection and struggle with the structured entities that voice agents handle daily.
Conversation-optimized models like Deepgram Nova-2 or AssemblyAI Universal-1 add better noise handling and some domain-specific training. Nova-2 delivers solid performance at competitive pricing but still relies on VAD for turn detection. Universal-1 offers good accuracy but lacks the neural turn detection that makes conversations feel natural.
Voice-agent-specific models like AssemblyAI's Universal-3 Pro use neural networks trained specifically on conversation patterns. Instead of waiting for silence, they analyze linguistic and acoustic cues to understand when someone has actually finished their thought. This eliminates most false triggers and reduces response latency.
In practice, Universal-3 Pro consistently delivers sub-400ms P50 latency with 21% fewer errors on alphanumeric data compared to general-purpose alternatives. At $0.45/hour, it's positioned between budget options (around $0.15/hour) and premium services that can run $1+/hour.
Building Voice Agents with Streaming Transcription
Here's how I set up voice agents using Scriptivox for the transcription pipeline and real-time processing:
Step 1: Audio capture and streaming. Your voice agent needs continuous audio input, either from WebRTC for web-based agents or SIP trunking for phone-based systems. The key is maintaining consistent 16kHz or 24kHz sampling with minimal buffering.
Step 2: STT configuration. Configure your streaming STT with turn detection parameters tuned for conversation. For Universal-3 Pro, I typically set minimum turn silence to 200-300ms for responsive back-and-forth, and maximum turn silence to 1200ms to avoid cutting off deliberate speakers.
Step 3: Transcript processing. As partial transcripts stream in, your system needs to decide when to send text to the LLM. With neural turn detection, you get explicit end-of-turn signals. With VAD-based systems, you're estimating based on silence duration and transcript stability.
Step 4: Response generation. Once you have a complete user turn, the LLM generates a response. This is where accuracy matters most—if the transcript says "order number AB1234" but your STT delivered "order number hey, be twelve thirty-four," your agent will fail the lookup.
Step 5: Text to speech and playback. Convert the LLM response to audio and stream it back. Most voice agent frameworks handle the audio routing automatically once you provide the generated text.
For transcription accuracy testing, I often upload sample conversations to Scriptivox first to benchmark how different models handle the specific vocabulary and accent patterns my agent will encounter. The word-level timestamps help identify exactly where accuracy breaks down.
Step 6: Interruption handling. Voice agents need to handle interruptions gracefully. When someone starts talking while your agent is responding, you need to detect the interruption, stop TTS playback, and switch back to listening mode. This requires coordination between your STT system and audio pipeline.
Comparing Voice Agent STT Providers
I've tested most major streaming STT providers for voice agent workflows. Here's what actually matters in production:
AssemblyAI Universal-3 Pro excels at conversation flow and structured data accuracy. The neural turn detection eliminates most false triggers, and accuracy on phone numbers and email addresses is consistently better than alternatives. Pricing at $0.45/hour is mid-market. Best for customer service, sales, and appointment scheduling where data accuracy is critical.
Deepgram Nova-2 offers solid general accuracy with very competitive pricing around $0.12/hour. The API is well-designed and latency is good, but turn detection relies on VAD which creates more false triggers. Good choice for content-focused voice agents where perfect turn timing isn't essential.
OpenAI Whisper (via their API) provides excellent transcription quality but isn't optimized for streaming. Latency can be inconsistent, and there's no built-in turn detection. Better suited for voice agents that can afford to buffer longer segments.
Google Speech-to-Text streaming works well for basic English-only use cases with predictable vocabulary. Pricing is competitive, but accuracy drops significantly on proper nouns, technical terms, and structured data. Limited language support compared to newer models.
Rev.ai and Trint focus primarily on batch transcription rather than real-time streaming, making them less suitable for voice agent applications.
For most production voice agents, the choice comes down to Universal-3 Pro if you need maximum accuracy and natural conversation flow, or Nova-2 if you're optimizing for cost and can handle some turn detection quirks.
Optimizing STT Performance for Voice Agents

Getting consistently good results from streaming STT requires tuning several parameters that most developers skip:
Audio preprocessing. Clean audio input dramatically improves accuracy. Use noise suppression, automatic gain control, and echo cancellation before sending audio to your STT provider. Poor audio quality amplifies every weakness in the transcription model.
Vocabulary customization. Most STT providers support custom vocabularies or keyword hints. If your voice agent handles specific product names, medical terms, or technical jargon, providing a vocabulary list can reduce errors by 20-30% on domain-specific terms.
Language detection. If you're building multilingual voice agents, test whether automatic language detection works reliably with your user base, or implement explicit language selection. Misdetected language creates catastrophic accuracy failures.
Turn detection tuning. Adjust silence thresholds based on your use case. Customer service agents can use shorter silence windows (200-300ms) for snappy interactions. Healthcare or legal applications might need longer windows (500-800ms) for users who speak more deliberately.
Fallback handling. Plan for STT failures. Network issues, audio problems, or extremely noisy environments can cause transcription to fail entirely. Your agent should detect these conditions and gracefully request clarification.
Integration Patterns and Architecture
Voice agents typically implement STT in one of three architectural patterns:
Direct streaming integration connects your application directly to the STT provider's WebSocket or gRPC API. This gives you maximum control over audio processing and latency, but requires handling connection management, authentication, and error recovery. Best for custom voice agent frameworks.
Framework-based integration uses platforms like LiveKit Agents, Voiceflow, or similar orchestration frameworks that handle STT integration as a plugin. You configure the STT provider and parameters, but the framework manages the audio pipeline. Faster to implement but less customizable.
API gateway patterns route audio through a custom service layer that can switch between STT providers or implement custom preprocessing. This adds architectural complexity but enables A/B testing different providers or implementing custom accuracy improvements.
For most teams, framework-based integration offers the best balance of development speed and functionality. You can always refactor to direct integration later if you need more control.
Measuring and Improving STT Accuracy

Voice agent STT accuracy isn't just about word error rates. You need to measure conversation-specific metrics:
Turn detection accuracy. Track false positives (agent responds mid-sentence) and false negatives (long delays after user finishes). Aim for under 5% false positive rate in production.
Entity extraction accuracy. Separately measure accuracy on structured data like phone numbers, emails, and account numbers. These often have lower accuracy than conversational speech but cause more user frustration when wrong.
Language-specific performance. If you support multiple languages, measure accuracy separately for each language and accent combination your users actually speak.
Latency distribution. Monitor P50, P95, and P99 latency from end-of-speech to transcript delivery. Consistent sub-500ms P95 latency is the threshold where conversations feel natural.
To measure these metrics effectively, collect audio samples and ground truth transcripts from real user interactions. Many teams make the mistake of testing only with clean studio audio or artificial test cases that don't represent production conditions.
Voice Agent STT Provider Comparison
| Provider | Best For | Pricing | Key Strength | Limitation |
|---|---|---|---|---|
| AssemblyAI Universal-3 Pro | Customer service, sales | $0.45/hour | Neural turn detection | Mid-market pricing |
| Deepgram Nova-2 | Content-focused agents | $0.12/hour | Competitive pricing | VAD-based turn detection |
| OpenAI Whisper | Quality over speed | Variable | Excellent accuracy | Inconsistent latency |
| Google Speech-to-Text | Basic English use cases | Competitive | Simple integration | Poor on structured data |
Frequently Asked Questions
About the author

Abhishek co-founded Scriptivox and built its early optimization and scalability layer — the part that turns a working transcription tool into one that holds up under real load. Today he leads growth and marketing at Scriptivox. He writes about transcription accuracy, multi-language coverage, and what it takes to build an AI transcription product that stays fast and reliable as it scales.



