Skip to main content

Deploying a Production Voice Agent: Full Walkthrough

This final article brings together everything from the series: STT, full-duplex audio, VAD, turn-taking, barge-in, latency optimization, tool use, telephony, and prompt engineering. You'll build and deploy a complete, production-ready voice agent that handles real phone calls, manages conversation state robustly, and scales.

The agent answers customer service calls, checks account information, processes refunds, and escalates complex issues to human agents. It's a blueprint you can adapt to any domain.

Architecture Overview

Incoming Call (PSTN/SIP)

┌───────────────────────────────────┐
│ Telephony Gateway (Twilio/SIP) │
│ - Call setup/teardown │
│ - Audio codec handling │
│ - RTP streaming │
└───────────┬───────────────────────┘

┌───────────────────────────────────┐
│ Voice Agent Core │
│ ┌─────────────────────────────┐ │
│ │ Async Event Loop │ │
│ │ - Audio I/O (STT/TTS) │ │
│ │ - Turn-taking FSM │ │
│ │ - VAD & Barge-in │ │
│ │ - LLM orchestration │ │
│ │ - Tool invocation (async) │ │
│ └─────────────────────────────┘ │
└───────────┬───────────────────────┘

┌───────────────────────────────────┐
│ Cloud Services (APIs) │
│ - STT (OpenAI/Google) │
│ - LLM (Claude/GPT-4) │
│ - TTS (ElevenLabs/Google) │
│ - Tools: DB, CRM, Billing │
└───────────────────────────────────┘

Complete Implementation

Here's a production voice agent in ~500 lines of code:

import asyncio
import logging
import os
import json
from dataclasses import dataclass
from enum import Enum
from typing import Optional, Dict, Any
from datetime import datetime
import webrtcvad
import numpy as np
from openai import AsyncOpenAI
from twilio.rest import Client as TwilioClient
from twilio.twiml.voice_response import VoiceResponse
from flask import Flask, request

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ============================================================================
# STATE MANAGEMENT
# ============================================================================

class ConversationState(Enum):
LISTENING = "listening"
PROCESSING = "processing"
SPEAKING = "speaking"
FINISHING = "finishing"

@dataclass
class ConversationContext:
"""Tracks state and metrics for a single conversation."""
call_id: str
user_name: Optional[str] = None
account_id: Optional[str] = None
state: ConversationState = ConversationState.LISTENING
turn_count: int = 0
start_time: datetime = None
messages: list = None # Conversation history

def __post_init__(self):
if self.messages is None:
self.messages = []
if self.start_time is None:
self.start_time = datetime.now()

def duration_seconds(self) -> float:
return (datetime.now() - self.start_time).total_seconds()

# ============================================================================
# CORE VOICE AGENT
# ============================================================================

class ProductionVoiceAgent:
"""
Production voice agent combining all components from the series.
"""

def __init__(self):
self.openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.vad = webrtcvad.VAD(mode=2)
self.sample_rate = 8000 # Phone quality
self.frame_duration_ms = 20
self.frame_size = int(self.sample_rate * self.frame_duration_ms / 1000)

# Conversation contexts (call_id -> ConversationContext)
self.contexts: Dict[str, ConversationContext] = {}

# System prompt for the agent
self.system_prompt = """
You are a helpful customer service agent for an online retailer. Your name is Sarah.

PERSONALITY:
- Warm, conversational, natural.
- Keep responses under 20 words. Split long answers into multiple exchanges.
- Use contractions: "I'm", "you're", "it's".

CAPABILITIES:
- Check order status
- Process returns
- Look up account information
- Offer refunds
- Escalate to human agent if needed

GUIDELINES:
- Answer directly. Lead with the answer, not the method.
- For ambiguous input, ask one clarifying question naturally.
- Never make up information. If unsure, say so.
"""

# Tool functions (mock implementations)
self.tools = {
"check_order_status": self.check_order_status,
"get_account_info": self.get_account_info,
"process_return": self.process_return,
"escalate_to_human": self.escalate_to_human,
}

async def handle_incoming_call(self, call_sid: str) -> str:
"""
Twilio webhook: incoming call.
Returns TwiML to connect to the agent.
"""
logger.info(f"[Call {call_sid}] Incoming")

# Create conversation context
ctx = ConversationContext(call_id=call_sid)
self.contexts[call_sid] = ctx

# TwiML to stream audio to our agent endpoint
response = VoiceResponse()
response.say("Please wait while we connect you.", voice="alice")

# Start WebSocket stream (Twilio Streams API)
response.start_stream(
url=f"wss://your-domain.com/media-stream/{call_sid}",
track="inbound_track",
)

response.pause(length=86400) # Keep call open

return str(response)

async def process_media_stream(self, call_sid: str, media_data: bytes):
"""
Process audio data from Twilio Streams API.
"""
ctx = self.contexts.get(call_sid)
if not ctx:
return

# Decode Twilio audio (µ-law)
try:
import audioop
pcm_audio = audioop.ulaw2lin(media_data, 2)
except Exception as e:
logger.error(f"[Call {call_sid}] Audio decode error: {e}")
return

# VAD: Detect if user is speaking
is_speech = self.vad.is_speech(pcm_audio, self.sample_rate)

# Update state machine
if ctx.state == ConversationState.LISTENING:
if is_speech:
# User speaking; start accumulating audio
if not hasattr(ctx, "pending_audio"):
ctx.pending_audio = bytearray()
ctx.pending_audio.extend(pcm_audio)

elif hasattr(ctx, "pending_audio") and len(ctx.pending_audio) > 0:
# User finished speaking; process utterance
logger.info(f"[Call {call_sid}] User finished speaking")
asyncio.create_task(
self.process_user_utterance(call_sid, bytes(ctx.pending_audio))
)
ctx.pending_audio = bytearray()

elif ctx.state == ConversationState.SPEAKING:
# Check for barge-in (user interrupting)
if is_speech:
logger.info(f"[Call {call_sid}] Barge-in detected")
ctx.state = ConversationState.LISTENING
ctx.pending_audio = bytearray()
ctx.pending_audio.extend(pcm_audio)

async def process_user_utterance(self, call_sid: str, audio_bytes: bytes):
"""
Process a complete user utterance: STT → LLM → TTS → Playback.
"""
ctx = self.contexts.get(call_sid)
if not ctx:
return

ctx.state = ConversationState.PROCESSING
ctx.turn_count += 1

logger.info(f"[Call {call_sid}] Turn #{ctx.turn_count}: Processing")

try:
# 1. STT: Transcribe audio
logger.info(f"[Call {call_sid}] Running STT...")
transcript = await self.speech_to_text(audio_bytes)
logger.info(f"[Call {call_sid}] User said: {transcript}")

# Add to conversation history
ctx.messages.append({"role": "user", "content": transcript})

# 2. LLM: Generate response
logger.info(f"[Call {call_sid}] Running LLM...")
response_text = await self.generate_response(ctx)
logger.info(f"[Call {call_sid}] Agent says: {response_text}")

# Add to conversation history
ctx.messages.append({"role": "assistant", "content": response_text})

# 3. TTS: Synthesize speech
logger.info(f"[Call {call_sid}] Running TTS...")
tts_audio = await self.text_to_speech(response_text)

# 4. Playback (Twilio Streams API)
ctx.state = ConversationState.SPEAKING
await self.send_audio_to_caller(call_sid, tts_audio)

ctx.state = ConversationState.FINISHING
await asyncio.sleep(0.5) # Brief pause
ctx.state = ConversationState.LISTENING

except Exception as e:
logger.error(f"[Call {call_sid}] Error: {e}")
ctx.state = ConversationState.LISTENING

async def speech_to_text(self, audio_bytes: bytes) -> str:
"""
Transcribe audio to text using OpenAI Whisper.
"""
# Mock: In production, call Whisper API
return "I would like to check my order status"

async def generate_response(self, ctx: ConversationContext) -> str:
"""
Generate LLM response using Claude or GPT-4.
Includes tool use for function calling.
"""
# Mock: In production, call OpenAI/Anthropic API
# Handle tool calls asynchronously
return "Your order is on its way! Tracking number 1Z999AA."

async def text_to_speech(self, text: str) -> bytes:
"""
Convert text to speech using ElevenLabs or Google Cloud TTS.
"""
# Mock: In production, call TTS API
# Return audio bytes (16-bit PCM, 8 kHz)
return b"\x00" * 8000 # Placeholder

async def send_audio_to_caller(self, call_sid: str, audio_bytes: bytes):
"""
Stream audio back to caller via Twilio.
"""
# In production, use Twilio Media Streams to send audio
logger.info(f"[Call {call_sid}] Sending {len(audio_bytes)} bytes of audio")

# Tool implementations (mocks)
async def check_order_status(self, order_id: str) -> Dict[str, Any]:
"""Lookup order status in database."""
await asyncio.sleep(0.3) # Simulate DB latency
return {"status": "shipped", "tracking": "1Z999AA"}

async def get_account_info(self, account_id: str) -> Dict[str, Any]:
"""Lookup account information."""
await asyncio.sleep(0.2)
return {"name": "John Doe", "account_type": "premium", "balance": 500.00}

async def process_return(self, order_id: str) -> Dict[str, Any]:
"""Initiate a return."""
await asyncio.sleep(0.5)
return {"status": "approved", "return_label": "1Z999BB"}

async def escalate_to_human(self) -> Dict[str, Any]:
"""Escalate to human agent."""
return {"status": "escalated", "queue_position": 3}

# ============================================================================
# FLASK APP & WEBHOOKS
# ============================================================================

app = Flask(__name__)
agent = ProductionVoiceAgent()

@app.route("/incoming-call", methods=["POST"])
async def incoming_call_webhook():
"""
Twilio incoming call webhook.
"""
call_sid = request.form.get("CallSid")
logger.info(f"Incoming call: {call_sid}")

twiml = await agent.handle_incoming_call(call_sid)
return twiml, 200, {"Content-Type": "application/xml"}

@app.route("/media-stream/<call_sid>", methods=["POST"])
async def media_stream_webhook(call_sid: str):
"""
Twilio Streams API: raw audio data.
"""
data = request.get_json()

if data["event"] == "media":
payload = data["media"]["payload"]
# Decode base64 audio
import base64
audio_bytes = base64.b64decode(payload)

await agent.process_media_stream(call_sid, audio_bytes)

elif data["event"] == "stop":
logger.info(f"[Call {call_sid}] Call ended")
ctx = agent.contexts.pop(call_sid, None)
if ctx:
logger.info(f"[Call {call_sid}] Duration: {ctx.duration_seconds():.1f}s, Turns: {ctx.turn_count}")

return "", 200

# ============================================================================
# DEPLOYMENT & MONITORING
# ============================================================================

def setup_monitoring():
"""Initialize monitoring (CloudWatch, Datadog, etc.)."""
# Log metrics: latency, error rate, call duration
# Set up alerts for high latency (>2s) or error rate (>5%)
pass

if __name__ == "__main__":
setup_monitoring()
app.run(debug=False, host="0.0.0.0", port=5000)

Deployment Checklist

Before going live:

Infrastructure

  • Choose hosting (AWS Lambda + API Gateway, Google Cloud Run, self-hosted)
  • Set up logging (CloudWatch, Datadog, Sentry)
  • Configure monitoring and alerting (latency, error rate, call duration)
  • Enable SSL/TLS for all endpoints
  • Set up backups and disaster recovery

Testing

  • Unit tests for each component (VAD, STT, LLM, TTS, turn-taking)
  • Integration tests with mock Twilio API
  • Load testing (simulate concurrent calls)
  • User acceptance testing with real users

Production Safety

  • Rate limiting (prevent abuse)
  • Call authentication (verify caller identity if needed)
  • Call recording and retention (comply with local laws)
  • Data privacy (GDPR, CCPA): encrypt audio, implement deletion policies
  • Failover/redundancy (multiple regions, graceful degradation)

Tuning

  • Measure end-to-end latency; optimize bottlenecks
  • Test barge-in sensitivity in target environments
  • Validate prompt effectiveness with target users
  • Monitor and adjust tool timeout thresholds

Monitoring and Metrics

Key metrics to track:

@dataclass
class CallMetrics:
call_id: str
duration_seconds: float
turn_count: int
avg_latency_ms: float
error_count: int
tool_calls: int
escalation_rate: float # % calls escalated to human
user_satisfaction: Optional[float] # Post-call survey

# Send to monitoring service
def log_metrics(metrics: CallMetrics):
"""Send call metrics to CloudWatch/Datadog."""
logger.info(f"Call metrics: {metrics}")
# datadog.client.metric(...) or cloudwatch.put_metric_data(...)

Cost Estimation (2026 Pricing)

For a voice agent processing 1,000 calls/month:

ComponentPrice per Call1,000 Calls/Month
STT (Whisper)$0.01$10
LLM (Claude Haiku)$0.05$50
TTS (ElevenLabs)$0.02$20
Hosting (AWS Lambda)$0.10$100
Telephony (Twilio SIP)$0.03$30
Total$0.21$210

Costs scale with call volume and complexity (tool use, model choice).

Key Takeaways

  • A production voice agent integrates STT, LLM, TTS, VAD, turn-taking, and tool use into a unified state machine.
  • Use managed services (Twilio, Vonage) for telephony; focus your engineering on the voice logic.
  • Monitor latency, error rate, and call metrics in production. Set alerts for degradation.
  • Test thoroughly: unit, integration, load, and user acceptance tests before deployment.
  • Implement graceful error handling and fallbacks: tools fail, networks drop, users get frustrated. Your agent must recover.

Frequently Asked Questions

How do I scale a voice agent to handle 1,000 concurrent calls?

Use serverless functions (AWS Lambda, Google Cloud Run) that auto-scale. Each Lambda handles one call. Use a message queue (SQS, Pub/Sub) to decouple call intake from processing. Cache LLM responses for common queries.

What's the SLA (service level agreement) I should target?

  • Availability: 99.9% uptime (9 hours/month downtime acceptable)
  • Latency: 95th percentile under 1.5 seconds
  • Error rate: less than 0.5% of calls These are typical for customer-facing voice agents.

How do I handle silence timeouts?

If the user is silent for 30 seconds, hang up or offer help. Implement this in your turn-taking state machine: track silence duration and transition to a "waiting for user" state.

Can I run the agent on-device (no cloud)?

Yes, but with limitations. Use open-source STT (Whisper), LLM (Llama), and TTS (Piper), all running locally. This eliminates cloud latency but requires significant compute (GPU recommended) and sacrifices accuracy. Hybrid models (local + cloud fallback) are increasingly common.

How do I handle GDPR/CCPA compliance?

  • Delete call recordings after 30–90 days
  • Implement right-to-deletion: if user requests, purge all data
  • Log consent: record that the user consented to recording
  • Use encryption for audio at rest and in transit
  • Anonymize transcripts for training (use opt-in user data only)

Further Reading