How accurate does speech-to-text need to be for function calling?

Word accuracy isn't the right metric. You need entity accuracy for the specific data types your functions require. A service with 98% word accuracy might still mangle 50% of order IDs or phone numbers. Test with real examples of the entities you'll encounter.

Should I use streaming or batch transcription for voice agents?

Always streaming. Batch transcription adds too much latency for real-time conversation. Users expect responses within 2-3 seconds. Streaming lets you start processing as soon as you detect turn completion.

How do I handle transcription errors in function parameters?

Implement validation in your functions and provide helpful error messages. When validation fails, ask customers to spell out the problematic information letter by letter. Don't just say "invalid input."

What's the best way to test voice agent accuracy?

Create test recordings with real examples of the entities your agent needs to handle. Include order IDs, phone numbers, addresses, and product codes spoken at normal speed. Measure how often your agent extracts the correct parameters, not just transcription word accuracy.

How do I reduce latency in voice agent responses?

Optimize each component: use streaming transcription, stream LLM responses, implement async function calls, and use fast text-to-speech. Consider caching common function results. Budget 2-3 seconds total response time for complex interactions.

Voice Agent Function Calling: Real-Time Speech Guide 2026

Your voice agent just told a customer their order number is "Z3-4859" when they clearly said "B3-4859." The speech-to-text layer mangled one letter. Your function call failed. The customer is frustrated, and your AI agent looks incompetent.

This is the hidden problem with voice agents that use function calling. Everyone focuses on prompt engineering and LLM orchestration while treating speech-to-text as a commodity. But when your agent needs to capture order IDs, phone numbers, or email addresses as function parameters, transcription accuracy becomes mission-critical.

I've built dozens of voice agents over the past two years. The ones that fail in production almost always fail because of bad transcription, not bad code. Let me show you how to build voice agents that actually work.

What Is Voice Agent Function Calling?

Voice agent function calling lets AI systems execute specific actions based on spoken commands. When a customer says "Check my order AB3792," the system transcribes the speech, extracts the order ID, and calls a function like get_order_status("AB3792") to retrieve real data.

The Speech-to-Text Foundation Problem

Most developers treat transcription as solved. They grab the cheapest API, focus on the LLM layer, then wonder why their agent fails in production.

Function calling puts unique demands on speech recognition that conversational AI doesn't face. When someone says "My order number is A-B-3-7-9-2," your system needs exactly "AB3792" as a parameter. Not "a b 37 92" or "ABE 3792" or "ab three seven nine two."

The failure modes are subtle and expensive:

Phone numbers get transcribed with wrong digits
Email addresses lose characters or gain spurious ones
Order IDs flip letters or numbers
Product codes become unrecognizable

Each mistake triggers a failed function call. Your agent says "I can't find that order" when the order exists. The customer repeats themselves. The loop continues until they hang up.

I learned this the hard way building a support agent for an e-commerce client. We used a popular transcription service with 95% word accuracy. Sounds good, right? But 95% word accuracy doesn't guarantee entity accuracy. Our agent was getting customer names right but mangling order IDs constantly.

We switched to Scriptivox for real-time transcription specifically because it handles alphanumeric sequences better. Upload a test file with order numbers and phone numbers. Compare the output. The difference is immediately obvious.

Building Your Voice Agent Stack

Here's the architecture I use for production voice agents:

Speech Input Layer

Real-time transcription with word-level timestamps
Entity optimization for IDs, phone numbers, emails
Turn detection that doesn't cut off mid-sentence

Function Orchestration

LLM that understands tool schemas
Parameter validation before function calls
Graceful error handling for bad transcription

Voice Output

Natural text-to-speech that doesn't sound robotic
Streaming output to reduce latency
Interruption handling for better conversation flow

Let me walk through building each layer.

Real-Time Transcription Setup

The transcription layer needs three capabilities: real-time streaming, entity accuracy, and proper turn detection.

For streaming transcription, you need WebSocket connections that can handle audio chunks and return partial transcripts. Most services provide this, but the quality varies dramatically.

Scriptivox handles this through their real-time API. You stream audio chunks and get back timestamped transcripts with confidence scores. More importantly, you get word-level timestamps for every token, which helps with entity extraction.

Here's the connection pattern I use:

import asyncio
import websockets
import json

async def connect_transcription():
    # Configure for real-time streaming
    config = {
        "sample_rate": 16000,
        "format": "pcm",
        "enable_timestamps": True,
        "language": "auto"
    }
    
    # Connect and start streaming
    async with websockets.connect(transcription_url) as ws:
        await ws.send(json.dumps(config))
        
        # Handle responses
        async for message in ws:
            data = json.loads(message)
            if data.get("type") == "transcript":
                yield data["text"]

The key is configuring for entity accuracy. Enable formatting, use auto-detection for mixed languages, and request confidence scores so you can filter out low-quality transcriptions.

Function Definition and Parameter Validation

Once you have clean transcripts, you need functions that the LLM can call. Define them using JSON Schema so the model understands parameter types and constraints.

Here's how I structure customer support functions:

FUNCTIONS = [
    {
        "name": "check_order_status",
        "description": "Look up customer order by ID",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "pattern": "^[A-Z]{2}[0-9]{4}$",
                    "description": "Order ID in format AB1234"
                }
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "schedule_callback",
        "description": "Schedule customer callback", 
        "parameters": {
            "type": "object",
            "properties": {
                "phone": {
                    "type": "string",
                    "description": "10-digit US phone number"
                },
                "name": {
                    "type": "string",
                    "description": "Customer full name"
                }
            },
            "required": ["phone", "name"]
        }
    }
]

Notice the regex pattern for order IDs. This helps the LLM understand the expected format and provides validation before making the function call.

Implement validation that catches transcription errors:

def validate_order_id(order_id):
    # Clean common transcription artifacts
    cleaned = order_id.replace("-", "").replace(" ", "").upper()
    
    # Check format
    if not re.match(r'^[A-Z]{2}[0-9]{4}$', cleaned):
        return None, f"Invalid order ID format: {order_id}"
    
    return cleaned, None

def check_order_status(order_id):
    cleaned_id, error = validate_order_id(order_id)
    if error:
        return f"I couldn't understand that order ID. Please spell it out letter by letter."
    
    # Actual lookup logic here
    return lookup_order(cleaned_id)

This catches cases where transcription adds spaces or hyphens, normalizes case, and provides helpful error messages when validation fails.

LLM Integration with Error Recovery

The LLM layer orchestrates everything. It receives transcripts, decides when to call functions, and generates natural responses.

I use OpenAI's function calling API because it's reliable and handles parameter extraction well:

import openai

class VoiceAgent:
    def __init__(self):
        self.conversation = [
            {
                "role": "system",
                "content": "You're a customer support agent. Keep responses brief - this is a phone call. When you can't understand an order ID or phone number, ask the customer to spell it out slowly."
            }
        ]
    
    async def process_speech(self, transcript):
        self.conversation.append({"role": "user", "content": transcript})
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=self.conversation,
            functions=FUNCTIONS,
            function_call="auto"
        )
        
        message = response.choices[0].message
        
        if message.get("function_call"):
            return await self.handle_function_call(message)
        else:
            return message.content
    
    async def handle_function_call(self, message):
        fn_name = message.function_call.name
        fn_args = json.loads(message.function_call.arguments)
        
        # Call the actual function
        if fn_name == "check_order_status":
            result = check_order_status(fn_args["order_id"])
        elif fn_name == "schedule_callback":
            result = schedule_callback(fn_args["phone"], fn_args["name"])
        
        # Generate natural response based on function result
        self.conversation.append({
            "role": "function",
            "name": fn_name,
            "content": result
        })
        
        final_response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=self.conversation
        )
        
        return final_response.choices[0].message.content

The two-phase approach is critical. First call decides whether to use a function. Second call generates the natural response. This prevents the agent from reading raw function output to customers.

Speech-to-Text Service Comparison

I've tested most major transcription services for voice agent use cases. Here's what I've learned:

Google Speech-to-Text: Fast and cheap, but struggles with spelled-out alphanumeric sequences. When customers spell "B-3-4-8-5-9," it often returns "be 34859" or similar. Good for conversational AI, problematic for function calling.

Amazon Transcribe: Better entity handling than Google, but inconsistent turn detection. Sometimes cuts off customers mid-sentence when they're reciting long order numbers.

Rev.ai: Excellent accuracy for general speech, but their real-time API has noticeable latency. Works better for post-call analysis than live agents.

Scriptivox: What I use in production now. Handles alphanumeric entities well, provides word-level timestamps, and has reliable turn detection. The API is straightforward and the pricing is transparent at $0.20 per hour of audio.

For voice agents specifically, entity accuracy matters more than conversational accuracy. A 95% word accuracy score means nothing if the service mangles every order ID.

Production Deployment Considerations

Building a demo voice agent is different from running one in production. Here are the issues you'll hit:

Latency Stacking: Every component adds delay. Transcription, LLM processing, function calls, and text-to-speech all take time. Budget 2-3 seconds total for complex interactions.

Audio Quality: Phone calls have different audio characteristics than clean microphone input. Test with actual phone audio, not laptop recordings.

Concurrent Users: Voice agents hold persistent connections. Plan your infrastructure accordingly. One customer call might maintain WebSocket connections to transcription, LLM streaming, and audio output simultaneously.

Error Graceful Degradation: When function calls fail, your agent needs fallback behaviors. Don't just say "something went wrong." Offer alternatives like "I'm having trouble with our order system. Let me transfer you to someone who can help."

Conversation State: Unlike chatbots, voice conversations have natural flow and interruptions. Implement conversation memory that survives brief disconnections.

I deploy using a microservices pattern: separate services for transcription, LLM orchestration, and audio output. This lets me scale components independently and swap providers without rebuilding everything.

Advanced Features Worth Implementing

Once your basic agent works, these features make a real difference:

Interruption Handling: Let customers interrupt the agent mid-response. Requires canceling text-to-speech output and processing new speech immediately.

Confidence Scoring: When transcription confidence is low for entities, ask for clarification proactively. "I think I heard order number B3859, is that correct?"

Entity Spell-Out: Train your agent to request spelling for critical information. "Could you spell that order number for me, letter by letter?"

Context Carryover: Remember entities mentioned earlier in the conversation. If someone says their order number once, don't ask again.

Async Function Calls: For slow operations like database lookups, stream a "working on it" response while the function executes in the background.

The key insight is that voice agents fail differently than text agents. Most failures aren't logic errors - they're communication breakdowns caused by bad transcription or unnatural speech output.

You can test this workflow yourself with Scriptivox. Upload a recording with order numbers and phone numbers. See how the transcription handles entities that your functions would need as parameters.

Speech-to-Text Services for Voice Agents

Service	Entity Handling	Turn Detection	Latency	Best Use
Google Speech-to-Text	Struggles with spelled alphanumeric	Good	Fast	Conversational AI
Amazon Transcribe	Better than Google	Inconsistent	Moderate	General transcription
Rev.ai	Excellent accuracy	Good	High latency	Post-call analysis
Scriptivox	Handles entities well	Reliable	Low	Voice agent production

Frequently Asked Questions

Abhishek ChauhanCo-founder, Scriptivox

Abhishek co-founded Scriptivox and built its early optimization and scalability layer — the part that turns a working transcription tool into one that holds up under real load. Today he leads growth and marketing at Scriptivox. He writes about transcription accuracy, multi-language coverage, and what it takes to build an AI transcription product that stays fast and reliable as it scales.