Scriptivox Logo - AI-powered transcription platformScriptivox
    FeaturesPricingReviewsFAQBlogAPI
    Go back

    Voice Agent Function Calling: Real-Time Speech to Text Guide

    Build voice agents that actually work in production. Learn why speech-to-text accuracy determines function calling success, plus real implementation strategies.

    May 22, 20267 min read

    Key Takeaways

    • ▸Voice agents fail in production primarily due to bad transcription, not bad code or prompts.
    • ▸Function calling requires entity accuracy for order IDs and phone numbers, not just conversational accuracy.
    • ▸Implement parameter validation in functions and provide helpful error messages for transcription failures.
    • ▸Test transcription services with real examples of entities your agent needs to handle.
    • ▸Use streaming transcription and optimize each component to keep total response time under 2-3 seconds.
    Build reliable voice agents with function calling. Real-time speech-to-text setup, entity accuracy, LLM integration, and p...

    Your voice agent just told a customer their order number is "Z3-4859" when they clearly said "B3-4859." The speech-to-text layer mangled one letter. Your function call failed. The customer is frustrated, and your AI agent looks incompetent.

    This is the hidden problem with voice agents that use function calling. Everyone focuses on prompt engineering and LLM orchestration while treating speech-to-text as a commodity. But when your agent needs to capture order IDs, phone numbers, or email addresses as function parameters, transcription accuracy becomes mission-critical.

    I've built dozens of voice agents over the past two years. The ones that fail in production almost always fail because of bad transcription, not bad code. Let me show you how to build voice agents that actually work.

    What Is Voice Agent Function Calling?

    Voice agent function calling lets AI systems execute specific actions based on spoken commands. When a customer says "Check my order AB3792," the system transcribes the speech, extracts the order ID, and calls a function like get_order_status("AB3792") to retrieve real data.

    The Speech-to-Text Foundation Problem

    The Speech-to-Text Foundation Problem

    Most developers treat transcription as solved. They grab the cheapest API, focus on the LLM layer, then wonder why their agent fails in production.

    Function calling puts unique demands on speech recognition that conversational AI doesn't face. When someone says "My order number is A-B-3-7-9-2," your system needs exactly "AB3792" as a parameter. Not "a b 37 92" or "ABE 3792" or "ab three seven nine two."

    The failure modes are subtle and expensive:

    • Phone numbers get transcribed with wrong digits
    • Email addresses lose characters or gain spurious ones
    • Order IDs flip letters or numbers
    • Product codes become unrecognizable

    Each mistake triggers a failed function call. Your agent says "I can't find that order" when the order exists. The customer repeats themselves. The loop continues until they hang up.

    I learned this the hard way building a support agent for an e-commerce client. We used a popular transcription service with 95% word accuracy. Sounds good, right? But 95% word accuracy doesn't guarantee entity accuracy. Our agent was getting customer names right but mangling order IDs constantly.

    We switched to Scriptivox for real-time transcription specifically because it handles alphanumeric sequences better. Upload a test file with order numbers and phone numbers. Compare the output. The difference is immediately obvious.

    Building Your Voice Agent Stack

    Here's the architecture I use for production voice agents:

    Speech Input Layer

    • Real-time transcription with word-level timestamps
    • Entity optimization for IDs, phone numbers, emails
    • Turn detection that doesn't cut off mid-sentence

    Function Orchestration

    • LLM that understands tool schemas
    • Parameter validation before function calls
    • Graceful error handling for bad transcription

    Voice Output

    • Natural text-to-speech that doesn't sound robotic
    • Streaming output to reduce latency
    • Interruption handling for better conversation flow

    Let me walk through building each layer.

    Real-Time Transcription Setup

    The transcription layer needs three capabilities: real-time streaming, entity accuracy, and proper turn detection.

    For streaming transcription, you need WebSocket connections that can handle audio chunks and return partial transcripts. Most services provide this, but the quality varies dramatically.

    Scriptivox handles this through their real-time API. You stream audio chunks and get back timestamped transcripts with confidence scores. More importantly, you get word-level timestamps for every token, which helps with entity extraction.

    Here's the connection pattern I use:

    import asyncio
    import websockets
    import json
    
    async def connect_transcription():
        # Configure for real-time streaming
        config = {
            "sample_rate": 16000,
            "format": "pcm",
            "enable_timestamps": True,
            "language": "auto"
        }
        
        # Connect and start streaming
        async with websockets.connect(transcription_url) as ws:
            await ws.send(json.dumps(config))
            
            # Handle responses
            async for message in ws:
                data = json.loads(message)
                if data.get("type") == "transcript":
                    yield data["text"]
    

    The key is configuring for entity accuracy. Enable formatting, use auto-detection for mixed languages, and request confidence scores so you can filter out low-quality transcriptions.

    Function Definition and Parameter Validation

    Once you have clean transcripts, you need functions that the LLM can call. Define them using JSON Schema so the model understands parameter types and constraints.

    Here's how I structure customer support functions:

    FUNCTIONS = [
        {
            "name": "check_order_status",
            "description": "Look up customer order by ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "pattern": "^[A-Z]{2}[0-9]{4}$",
                        "description": "Order ID in format AB1234"
                    }
                },
                "required": ["order_id"]
            }
        },
        {
            "name": "schedule_callback",
            "description": "Schedule customer callback", 
            "parameters": {
                "type": "object",
                "properties": {
                    "phone": {
                        "type": "string",
                        "description": "10-digit US phone number"
                    },
                    "name": {
                        "type": "string",
                        "description": "Customer full name"
                    }
                },
                "required": ["phone", "name"]
            }
        }
    ]
    

    Notice the regex pattern for order IDs. This helps the LLM understand the expected format and provides validation before making the function call.

    Implement validation that catches transcription errors:

    def validate_order_id(order_id):
        # Clean common transcription artifacts
        cleaned = order_id.replace("-", "").replace(" ", "").upper()
        
        # Check format
        if not re.match(r'^[A-Z]{2}[0-9]{4}$', cleaned):
            return None, f"Invalid order ID format: {order_id}"
        
        return cleaned, None
    
    def check_order_status(order_id):
        cleaned_id, error = validate_order_id(order_id)
        if error:
            return f"I couldn't understand that order ID. Please spell it out letter by letter."
        
        # Actual lookup logic here
        return lookup_order(cleaned_id)
    

    This catches cases where transcription adds spaces or hyphens, normalizes case, and provides helpful error messages when validation fails.

    LLM Integration with Error Recovery

    The LLM layer orchestrates everything. It receives transcripts, decides when to call functions, and generates natural responses.

    I use OpenAI's function calling API because it's reliable and handles parameter extraction well:

    import openai
    
    class VoiceAgent:
        def __init__(self):
            self.conversation = [
                {
                    "role": "system",
                    "content": "You're a customer support agent. Keep responses brief - this is a phone call. When you can't understand an order ID or phone number, ask the customer to spell it out slowly."
                }
            ]
        
        async def process_speech(self, transcript):
            self.conversation.append({"role": "user", "content": transcript})
            
            response = openai.ChatCompletion.create(
                model="gpt-4",
                messages=self.conversation,
                functions=FUNCTIONS,
                function_call="auto"
            )
            
            message = response.choices[0].message
            
            if message.get("function_call"):
                return await self.handle_function_call(message)
            else:
                return message.content
        
        async def handle_function_call(self, message):
            fn_name = message.function_call.name
            fn_args = json.loads(message.function_call.arguments)
            
            # Call the actual function
            if fn_name == "check_order_status":
                result = check_order_status(fn_args["order_id"])
            elif fn_name == "schedule_callback":
                result = schedule_callback(fn_args["phone"], fn_args["name"])
            
            # Generate natural response based on function result
            self.conversation.append({
                "role": "function",
                "name": fn_name,
                "content": result
            })
            
            final_response = openai.ChatCompletion.create(
                model="gpt-4",
                messages=self.conversation
            )
            
            return final_response.choices[0].message.content
    

    The two-phase approach is critical. First call decides whether to use a function. Second call generates the natural response. This prevents the agent from reading raw function output to customers.

    Speech-to-Text Service Comparison

    I've tested most major transcription services for voice agent use cases. Here's what I've learned:

    Google Speech-to-Text: Fast and cheap, but struggles with spelled-out alphanumeric sequences. When customers spell "B-3-4-8-5-9," it often returns "be 34859" or similar. Good for conversational AI, problematic for function calling.

    Amazon Transcribe: Better entity handling than Google, but inconsistent turn detection. Sometimes cuts off customers mid-sentence when they're reciting long order numbers.

    Rev.ai: Excellent accuracy for general speech, but their real-time API has noticeable latency. Works better for post-call analysis than live agents.

    Scriptivox: What I use in production now. Handles alphanumeric entities well, provides word-level timestamps, and has reliable turn detection. The API is straightforward and the pricing is transparent at $0.20 per hour of audio.

    For voice agents specifically, entity accuracy matters more than conversational accuracy. A 95% word accuracy score means nothing if the service mangles every order ID.

    Production Deployment Considerations

    Building a demo voice agent is different from running one in production. Here are the issues you'll hit:

    Latency Stacking: Every component adds delay. Transcription, LLM processing, function calls, and text-to-speech all take time. Budget 2-3 seconds total for complex interactions.

    Audio Quality: Phone calls have different audio characteristics than clean microphone input. Test with actual phone audio, not laptop recordings.

    Concurrent Users: Voice agents hold persistent connections. Plan your infrastructure accordingly. One customer call might maintain WebSocket connections to transcription, LLM streaming, and audio output simultaneously.

    Error Graceful Degradation: When function calls fail, your agent needs fallback behaviors. Don't just say "something went wrong." Offer alternatives like "I'm having trouble with our order system. Let me transfer you to someone who can help."

    Conversation State: Unlike chatbots, voice conversations have natural flow and interruptions. Implement conversation memory that survives brief disconnections.

    I deploy using a microservices pattern: separate services for transcription, LLM orchestration, and audio output. This lets me scale components independently and swap providers without rebuilding everything.

    Advanced Features Worth Implementing

    Advanced Features Worth Implementing

    Once your basic agent works, these features make a real difference:

    Interruption Handling: Let customers interrupt the agent mid-response. Requires canceling text-to-speech output and processing new speech immediately.

    Confidence Scoring: When transcription confidence is low for entities, ask for clarification proactively. "I think I heard order number B3859, is that correct?"

    Entity Spell-Out: Train your agent to request spelling for critical information. "Could you spell that order number for me, letter by letter?"

    Context Carryover: Remember entities mentioned earlier in the conversation. If someone says their order number once, don't ask again.

    Async Function Calls: For slow operations like database lookups, stream a "working on it" response while the function executes in the background.

    The key insight is that voice agents fail differently than text agents. Most failures aren't logic errors - they're communication breakdowns caused by bad transcription or unnatural speech output.

    You can test this workflow yourself with Scriptivox. Upload a recording with order numbers and phone numbers. See how the transcription handles entities that your functions would need as parameters.

    Speech-to-Text Services for Voice Agents

    ServiceEntity HandlingTurn DetectionLatencyBest Use
    Google Speech-to-TextStruggles with spelled alphanumericGoodFastConversational AI
    Amazon TranscribeBetter than GoogleInconsistentModerateGeneral transcription
    Rev.aiExcellent accuracyGoodHigh latencyPost-call analysis
    ScriptivoxHandles entities wellReliableLowVoice agent production

    Frequently Asked Questions

    About the author

    Abhishek Chauhan portrait
    Abhishek ChauhanCo-founder, Scriptivox

    Abhishek co-founded Scriptivox and built its early optimization and scalability layer — the part that turns a working transcription tool into one that holds up under real load. Today he leads growth and marketing at Scriptivox. He writes about transcription accuracy, multi-language coverage, and what it takes to build an AI transcription product that stays fast and reliable as it scales.

    Tags:

    Accuracy & WERAI ChatAPILive Transcription
    Tutorials & How-To Guides
    On this page
      Scriptivox

      Turn meetings, podcasts & interviews into accurate text

      119 languagesAI-powered
      Sign Up for Free

      Continue Reading

      All articles
      Build Voice Agents with Speech to Text: Real Implementation Guide
      Tutorials & How-To Guides
      May 10, 2026

      Build Voice Agents with Speech to Text: Real Implementation Guide

      Build production-ready voice agents with accurate speech to text. Real implementation guide covering streaming transcription, entity recognition, and testing.

      blog.card.by Arsh Singh

      AI Transcription Legal Compliance: GDPR, CCPA & PCI Guide
      Transcription
      Jun 8, 2026

      AI Transcription Legal Compliance: GDPR, CCPA & PCI Guide

      AI transcription compliance requires consent laws, GDPR/CCPA storage rules, PCI redaction, and vendor data policies. Single gaps expose criminal penalties.

      blog.card.by Abhishek Chauhan

      Meeting Transcripts to Agendas: AI Workflow Guide
      Tutorials & How-To Guides
      May 22, 2026

      Meeting Transcripts to Agendas: AI Workflow Guide

      Transform meeting recordings into structured agendas automatically. Learn the complete workflow for using AI to extract action items and create follow-up agenda...

      blog.card.by Abhishek Chauhan

      Scriptivox logo - AI transcription service
      Scriptivox

      AI-powered transcription made simple and secure. Transform your audio content into accurate text with enterprise-grade reliability.

      Product

      • Features
      • Pricing
      • Tools
      • Integrations

      Core Services

      • Audio to Text
      • Video to Text
      • SRT Generator
      • VTT Generator

      Support

      • FAQ
      • Contact
      • common.footer.status
      • Founders
      • Privacy Policy
      • Terms of Use

      All Supported Formats

      Audio Formats

      MP3WAVAACOGGOPUSFLACAIFFALACWMA

      Video Formats

      MP4MP4AAVIMOVMKVWEBMVOBMTSTS3GPMPEGQuickTimeDivX

      File Generators

      SRT GeneratorVTT GeneratorAudio to SRTAudio to VTTMP3 to SRTMP3 to VTTVideo to SRTVideo to VTTMP4 to SRTMP4 to VTT

      © 2025 Scriptivox. All rights reserved.