Skip to main content

Telephony Integration: Voice Agents on Phone Networks

A voice agent running on a laptop is impressive, but a voice agent answering real phone calls is transformative. Telephony integration means connecting your agent to the Session Initiation Protocol (SIP) or Public Switched Telephone Network (PSTN)—the infrastructure behind phone calls worldwide.

SIP and PSTN introduce constraints your agent must handle: lower audio quality (8 kHz sampling, compressed codecs), network jitter (variable latency), and call control semantics (dial tone, busy signal, voicemail). This article covers the telephony landscape and how to integrate voice agents into phone systems.

Understanding SIP and PSTN

The PSTN (Public Switched Telephone Network) is the legacy phone system: copper wires, analog switching, dial tones. When you call a friend's landline, you're using PSTN.

SIP (Session Initiation Protocol) is the modern, IP-based replacement: it handles call setup and teardown (signaling) while audio travels over RTP (Real-Time Protocol). Most carriers and modern PBX systems support SIP.

Key Differences

AspectPSTNSIP
TransportCircuit-switched (dedicated line)Packet-switched (IP network)
Call setupDial tone + digitsSIP INVITE message
AudioMu-law (8 kHz)Opus, G.711, G.729 (variable)
Latency50–100 ms (reliable)50–300 ms (variable jitter)
CostPer-minute billingFlat-rate or per-call
ExampleTraditional landlineVoIP (Twilio, Vonage)

SIP Architecture and Call Flow

A typical SIP call:

Caller               SIP Server (Registrar)      Voice Agent
| | |
|--- INVITE SIP -------->| |
| (initiate call) |--- INVITE ---------->|
| | (ring) |
|<-- 180 Ringing ---------|<-- 180 Ringing -------|
| | |
|<-- 200 OK -------------|<-- 200 OK ------------|
|--- ACK ------------->| |
|--- RTP Audio Stream (duplex) ================>|
| |
| (conversation happens over RTP) |
| |
|<===== RTP Audio Stream ========================|
| |
|--- BYE -------------| |
|<-- 200 OK -----------| |
| |

A voice agent handling a SIP call must:

  1. Accept INVITE: Recognize an incoming call request.
  2. Send 100 Trying, 180 Ringing, 200 OK: Signal call progress to the caller.
  3. Open RTP stream: Start receiving and sending audio over RTP.
  4. Process audio: Feed audio to STT, LLM, TTS as in previous articles.
  5. Handle BYE: End the call cleanly when either party disconnects.

Implementing a SIP Voice Agent

Here's a minimal SIP agent using the pjsua2 Python library:

import pjsua2 as pj
import asyncio
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class VoiceAgentAccount(pj.Account):
"""
Custom Account class to handle SIP account events.
"""
def onRegState(self, prm):
"""Called when account registration state changes."""
if self.getInfo().regIsActive:
logger.info("[SIP] Account registered successfully")
else:
logger.warning("[SIP] Account registration failed")

def onIncomingCall(self, prm):
"""Called when an incoming call arrives."""
call = VoiceAgentCall(self, prm.callId)
logger.info(f"[SIP] Incoming call from {call.getInfo().remoteUri}")

# Answer the call automatically
call_prm = pj.CallOpParam(True)
call.answer(call_prm)

class VoiceAgentCall(pj.Call):
"""
Custom Call class to handle call events and audio streaming.
"""
def onCallState(self, prm):
"""Called when call state changes (connected, disconnected, etc.)."""
call_info = self.getInfo()

if call_info.state == pj.PJSIP_INV_STATE_CONNECTING:
logger.info(f"[Call] Connecting to {call_info.remoteUri}")

elif call_info.state == pj.PJSIP_INV_STATE_CONFIRMED:
logger.info(f"[Call] Connected to {call_info.remoteUri}")
# Start voice processing
asyncio.create_task(self.process_voice())

elif call_info.state == pj.PJSIP_INV_STATE_DISCONNECTED:
logger.info("[Call] Disconnected")
# Clean up resources

async def process_voice(self):
"""
Main voice agent processing loop.
Identical to previous articles, but receives audio via RTP instead of microphone.
"""
# Get the media port (audio stream)
try:
call_info = self.getInfo()
media_idx = 0 # First media (audio)

if not call_info.media:
logger.error("[Call] No media in call info")
return

media = call_info.media[media_idx]
if media.status != pj.PJSUA_CALL_MEDIA_ACTIVE:
logger.error("[Call] Media not active")
return

logger.info(f"[Call] Audio codec: {media.codec}")

# From this point, audio flows through RTP
# The pjsua2 library handles RTP I/O; we can't directly access audio frames
# Instead, we use tone playback for simple interactions
await self.play_greeting()

except Exception as e:
logger.error(f"[Call] Error processing voice: {e}")

async def play_greeting(self):
"""Play a greeting message to the caller."""
message = "Hello, you've reached the voice agent. Please state your question."

# Generate TTS audio (in a real implementation, convert text to audio bytes)
tts_audio = await generate_tts(message) # Mock TTS

# Play audio to caller (simplified; actual RTP audio handling is complex)
logger.info(f"[Agent] Speaking: {message}")

# Wait for user input (simplified; would need audio capture from RTP)
await asyncio.sleep(3)

async def generate_tts(text: str) -> bytes:
"""Generate TTS audio (mock implementation)."""
return b"\x00" * 16000 # Placeholder

class SIPVoiceAgentServer:
"""
Main SIP voice agent server.
Listens for incoming calls and routes them to VoiceAgentCall handlers.
"""

def __init__(self, sip_server="sip.twilio.com", account="[email protected]", password="your_password"):
self.sip_server = sip_server
self.account_uri = account
self.password = password
self.ep = None
self.acc = None

async def start(self):
"""Initialize SIP stack and register account."""
try:
# Create endpoint
self.ep = pj.Endpoint()
self.ep.libCreate()

# Initialize with SIP transport
ep_cfg = pj.EpConfig()
ep_cfg.uaData = self
self.ep.libInit(ep_cfg)

# Create transports (UDP, TCP, TLS)
self.ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, pj.TransportConfig())

# Start library
self.ep.libStart()

logger.info("[SIP] Stack started")

# Create and register account
acc_cfg = pj.AccountConfig()
acc_cfg.idUri = f"sip:{self.account_uri}"
acc_cfg.regConfig.registrarUri = f"sip:{self.sip_server}"
acc_cfg.sipConfig.authCreds = [pj.AuthCredInfo("*", "*", self.account_uri.split("@")[0], 0, self.password)]

self.acc = VoiceAgentAccount()
self.acc.create(acc_cfg)

logger.info(f"[SIP] Account {self.account_uri} registered")

# Keep the server running
while True:
await asyncio.sleep(1)

except Exception as e:
logger.error(f"[SIP] Startup error: {e}")

async def stop(self):
"""Shutdown SIP stack."""
if self.ep:
self.ep.libDestroy()
logger.info("[SIP] Stack stopped")

# Usage
async def main():
server = SIPVoiceAgentServer(
sip_server="sip.twilio.com",
account="[email protected]",
password="your_password"
)
await server.start()

asyncio.run(main())

Using a Telephony Service (Twilio, Vonage)

Instead of building SIP infrastructure from scratch, use a managed service like Twilio:

# Twilio voice agent (incoming call webhook)
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse
from flask import Flask, request

app = Flask(__name__)
twilio_client = Client("account_sid", "auth_token")

@app.route("/voice", methods=["POST"])
def voice_webhook():
"""
Twilio calls this webhook for incoming calls.
Return TwiML (Twilio Markup Language) to control the call.
"""
response = VoiceResponse()

# Play a greeting
response.say("Hello. You have reached the voice agent.", voice="alice")

# Gather user input (DTMF tones, transcribed speech)
gather = response.gather(
num_digits=1,
action="/handle_input",
method="POST"
)
gather.say("Press 1 for sales, 2 for support, or speak your request.", voice="alice")

# If no input, hang up
response.say("Sorry, I didn't hear anything. Goodbye.")
response.hangup()

return str(response)

@app.route("/handle_input", methods=["POST"])
def handle_input():
"""Process user input (DTMF or speech)."""
user_input = request.form.get("Digits") or request.form.get("SpeechResult")

response = VoiceResponse()
response.say(f"You entered: {user_input}", voice="alice")
response.hangup()

return str(response)

if __name__ == "__main__":
app.run(debug=True, port=5000)

Handling Audio Codecs and 8 kHz Audio

Phone calls use 8 kHz audio (half the bandwidth of 16 kHz), compressed with codecs like µ-law, A-law, or G.729. When your agent runs on-device (not in the cloud), handle codec conversion:

import audioop

def ulaw_to_pcm(ulaw_bytes: bytes) -> bytes:
"""Convert µ-law compressed audio to 16-bit PCM."""
return audioop.ulaw2lin(ulaw_bytes, 2)

def pcm_to_ulaw(pcm_bytes: bytes) -> bytes:
"""Convert 16-bit PCM to µ-law."""
return audioop.lin2ulaw(pcm_bytes, 2)

# In your call handler:
# Incoming RTP audio is µ-law; convert to PCM for STT
incoming_ulaw = receive_rtp_audio()
incoming_pcm = ulaw_to_pcm(incoming_ulaw)

# STT processes 16-bit PCM at 8 kHz
transcript = await stt_client.transcribe(incoming_pcm, sample_rate=8000)

# TTS outputs 16-bit PCM at 8 kHz
response_pcm = await tts_client.synthesize(response_text, sample_rate=8000)

# Convert back to µ-law for RTP transmission
response_ulaw = pcm_to_ulaw(response_pcm)
send_rtp_audio(response_ulaw)

Handling Network Jitter and Packet Loss

SIP/RTP over the internet is unreliable. Implement jitter buffering and packet loss concealment:

class RTPAudioBuffer:
"""
Buffers RTP audio to handle jitter and packet loss.
"""

def __init__(self, buffer_size_ms=100, sample_rate=8000):
self.buffer_size_samples = int(buffer_size_ms * sample_rate / 1000)
self.buffer = deque(maxlen=self.buffer_size_samples)
self.last_seq = None

def add_packet(self, rtp_packet):
"""Add RTP packet to buffer."""
seq = rtp_packet.sequence_number
audio_data = rtp_packet.payload

# Detect packet loss (large gap in sequence numbers)
if self.last_seq is not None and seq != self.last_seq + 1:
loss = seq - self.last_seq - 1
print(f"[RTP] Detected {loss} lost packets")
# Silence/comfort noise fill (simplified)
silence = b"\x00" * (loss * 160) # 160 bytes per 20 ms at 8 kHz
self.buffer.extend(silence)

self.buffer.extend(audio_data)
self.last_seq = seq

def get_playback_audio(self):
"""Retrieve audio for immediate playback."""
if len(self.buffer) > 0:
return bytes(self.buffer)
return b""

Key Takeaways

  • SIP and PSTN are the standards for phone system integration. SIP is modern and IP-based; PSTN is legacy but ubiquitous.
  • Incoming calls arrive as SIP INVITE requests. Your agent must answer with 200 OK and begin receiving audio over RTP.
  • Phone audio is 8 kHz (not 16 kHz) and often compressed with µ-law or other codecs. Convert before feeding to STT.
  • Managed services (Twilio, Vonage, Amazon Connect) simplify telephony integration; build raw SIP only if you need custom control.
  • Implement jitter buffering and packet loss concealment to handle network unreliability inherent in VoIP.

Frequently Asked Questions

Can I run a voice agent on a traditional landline?

Not directly. Landlines use the analog PSTN. Your agent must connect via a SIP gateway (a device that converts analog PSTN to SIP). Most carriers offer SIP trunks that handle this conversion transparently.

What's the difference between a SIP trunk and a phone line?

A SIP trunk is a virtual phone line over IP. Instead of a physical copper wire, you have a SIP connection to your carrier. You can have hundreds of virtual trunks on a single internet connection, which is why SIP is cheaper and more scalable than PSTN.

How do I get a phone number for my voice agent?

Use a service like Twilio, Vonage, or your carrier. They assign you a number (e.g., +1-234-567-8900) that routes calls to your agent's SIP endpoint or HTTP webhook.

Can I use WebRTC instead of SIP for phone calls?

WebRTC is peer-to-peer (browser-to-browser or app-to-app). It doesn't integrate with the PSTN or carrier phone systems. Use WebRTC for customer service (click-to-call on a website) or inter-app calling; use SIP for PSTN integration.

How do I handle call recording and compliance (GDPR, CCPA)?

Record RTP audio to disk, and ensure caller consent. Many jurisdictions require two-party consent (both parties must agree to recording). Disclose: "This call may be recorded for quality assurance." Delete recordings after a retention period (30–90 days). Implement GDPR right-to-deletion if requested.

Further Reading