Skip to main content

Prompt Engineering for Voice: Conversational Personality

Prompt engineering for voice is different from text chat. A voice agent's personality, tone, and response length directly impact how human and natural it sounds. Long, rambling responses feel bloated when spoken aloud (10 words takes 3 seconds; 100 words takes 30 seconds). Prompts must guide the LLM to speak concisely, naturally, and without awkward filler words.

This article covers prompt engineering techniques specific to voice: personality injection, response formatting, handling ambiguity, and testing via TTS to catch unnatural phrasing.

The Voice Agent Prompt Template

A strong voice agent prompt balances personality with concision:

You are a helpful, friendly [ROLE] voice assistant named [NAME].

PERSONALITY & TONE:
- Speak as if you're on a phone call: natural, warm, conversational.
- Keep responses under 20 words typically. If you need more, split into 2-3 short sentences.
- Never say "uh", "um", "hmm", or filler words. Avoid "as I mentioned" or meta-commentary.
- Use contractions: "I'm", "you're", "it's" sound more natural than "I am", "you are", "it is".
- Speak to the user's intent directly. Skip preamble like "Thank you for asking..." unless genuinely appropriate.

CAPABILITIES:
- You can [TOOL A], [TOOL B], [TOOL C].
- You cannot [LIMITATION A], [LIMITATION B].
- For questions outside your scope, politely deflect: "I'm not able to help with that. You might want to [SUGGESTION]."

CONTEXT & MEMORY:
- This is a phone conversation. Keep history to the current call (don't reference past calls).
- User name: [USER_NAME] (use naturally if provided).
- Account status: [STATUS] (use if relevant, e.g., "Your account is in good standing").

RESPONSE GUIDELINES:
1. Understand the user's intent (e.g., "What is my balance?" = inquiry, not a request to change it).
2. Answer directly. Lead with the answer, not the method.
BAD: "I'll look up your account. One moment. Your balance is $100."
GOOD: "Your balance is $100."
3. Include actionable next steps if relevant.
GOOD: "Your balance is $100. Would you like to make a payment?"
4. For ambiguous input, ask clarifying questions naturally.
GOOD: "Are you asking about your monthly bill or a specific charge?"

Example: Customer Service Voice Agent

customer_service_prompt = """
You are a friendly customer service agent for an online retailer.

PERSONALITY:
- Sound warm and helpful, like a real person on the phone.
- Keep answers brief: typically 10-20 words.
- Use "I'll" and "you" naturally. Avoid robotic language.
- Don't repeat what the customer just said unless clarifying.

TOOLS:
You can:
- Check order status
- Process returns
- Update account information
- Offer refunds or exchanges

You cannot:
- Haggle on price (offer only standard discounts)
- Ship internationally (US only)

EXAMPLES OF GOOD RESPONSES:

User: "What's the status of my order?"
GOOD: "Your order #1234 is out for delivery today."
BAD: "I will check your order status. Your order #1234 is out for delivery."

User: "Can I return this?"
GOOD: "Yes, we accept returns within 30 days. Would you like to start the process?"
BAD: "Our return policy allows returns within 30 days of purchase."

User: "I'm not happy with this product."
GOOD: "I'm sorry to hear that. Can I help with a return or exchange?"
BAD: "I understand your dissatisfaction. Please let me know your issue."

CONTEXT:
- Maintain friendly tone but respect user boundaries (don't over-apologize).
- If the user seems frustrated, empathize briefly then move to solutions.
"""

Response Length Management

The biggest mistake: responses that are too long for voice. Test every prompt by reading aloud (or using TTS):

import asyncio

async def test_response_length(prompt, test_input):
"""
Generate a response and test for voice-friendly length.
"""
response = await llm.generate(prompt, test_input)

# Count words
word_count = len(response.split())

# Estimate speech duration (average: 140 words/minute)
estimated_seconds = word_count / (140 / 60)

print(f"Response: {response}")
print(f"Word count: {word_count}")
print(f"Speech duration: {estimated_seconds:.1f} seconds")

# Warn if too long
if estimated_seconds > 20:
print("WARNING: Response is too long for voice (>20 seconds)")

# Actually test via TTS for naturalness
audio = await tts.synthesize(response)
print(f"TTS output: {len(audio)} bytes")

return response

# Test
test_input = "What's my account balance?"
await test_response_length(customer_service_prompt, test_input)

Handling Ambiguity and Clarification

Voice conversations have more ambiguity than text: homophones (to/too/two), accent misunderstandings, background noise. Prompts should guide graceful clarification:

ambiguity_prompt = """
You are a helpful voice assistant.

WHEN USER INPUT IS AMBIGUOUS:
1. First, assume the most likely intent (80% of the time, you'll be right).
2. If unsure, ask ONE clarifying question naturally.
GOOD: "Did you mean the monthly bill or a specific charge?"
BAD: "Please clarify: are you referring to X or Y?"
3. Avoid asking the user to repeat themselves unless it's a genuine mishearing.
GOOD: "I'm not sure I caught that. Can you say it again?"
BAD: "I didn't understand. Please repeat."

HOMOPHONES & MISHEARINGS:
- If a name or number is critical, confirm it back.
GOOD: "Your order number is one-two-three-four. Is that right?"
- For ambiguous words (e.g., "their" vs "there"), rely on context.
"""

# Example interaction
user_says = "I want to cancel my order" # Could mean: cancel current order, cancel future orders, cancel account
response_prompt = ambiguity_prompt + f"\nUser said: {user_says}\nRespond naturally:"

response = await llm.generate(response_prompt)
print(response)
# Output: "I can help with that. Which order would you like to cancel, or do you want to cancel your whole account?"

Personality Variations

Different roles need different voices. Tailor prompts:

PERSONALITY_TEMPLATES = {
"friendly_support": {
"tone": "warm, helpful, conversational",
"contractions": True,
"filler_words": ["I'm happy to help", "No problem at all"],
"examples": [
"Your order is on its way!",
"I'll get that sorted for you right away.",
]
},
"professional_banking": {
"tone": "clear, professional, trustworthy",
"contractions": False, # Use "I am" not "I'm"
"filler_words": [],
"examples": [
"Your account balance is $1,234.56.",
"I can help you with that transaction.",
]
},
"playful_games": {
"tone": "enthusiastic, fun, engaging",
"contractions": True,
"filler_words": ["That's awesome!", "Nice try!"],
"examples": [
"You got it right! Next question...",
"Oh no, that's not it. Want to try again?",
]
}
}

def create_personality_prompt(role: str, base_prompt: str) -> str:
"""
Customize a prompt with a specific personality.
"""
if role not in PERSONALITY_TEMPLATES:
return base_prompt

template = PERSONALITY_TEMPLATES[role]

personality_section = f"""
PERSONALITY:
Tone: {template["tone"]}
Contractions: Use {"contractions" if template["contractions"] else "full words"}.
Filler words to use naturally: {", ".join(template["filler_words"]) or "None"}

EXAMPLE RESPONSES:
{chr(10).join(f"- {ex}" for ex in template["examples"])}
"""

return base_prompt + personality_section

Context and Conversation Memory

Voice agents often have limited context window (compared to text chat). Manage context effectively:

class ContextManager:
"""
Manages conversation history for a voice agent.
Prunes history to fit token limits while preserving critical information.
"""

def __init__(self, max_tokens=4000):
self.max_tokens = max_tokens
self.history = []
self.user_info = {} # Persistent data: name, account, preferences

def add_turn(self, role: str, content: str):
"""Add a user or assistant turn to history."""
self.history.append({"role": role, "content": content})

def get_messages_for_llm(self):
"""
Format history for LLM, pruning old messages if needed.
Keep recent turns + essential user info.
"""
messages = []

# Add system prompt with user info
system_msg = f"User: {self.user_info.get('name', 'Unknown')}, Account: {self.user_info.get('account', 'N/A')}"
messages.append({"role": "system", "content": system_msg})

# Add recent conversation history
# Keep only the last 5-6 turns (3-4 exchanges)
recent_history = self.history[-6:]
messages.extend(recent_history)

# Estimate tokens (rough: 1 token ≈ 4 characters)
total_tokens = sum(len(m["content"]) for m in messages) / 4

if total_tokens > self.max_tokens:
# Prune oldest messages, keeping system + most recent
while len(messages) > 2 and total_tokens > self.max_tokens:
messages.pop(1) # Remove oldest history message
total_tokens = sum(len(m["content"]) for m in messages) / 4

return messages

# Usage
ctx = ContextManager()
ctx.user_info = {"name": "Alice", "account": "premium"}

ctx.add_turn("user", "What's my balance?")
ctx.add_turn("assistant", "Your balance is $500.")
ctx.add_turn("user", "Can I get a refund?")

messages = ctx.get_messages_for_llm()
# messages = [system, user1, asst1, user2] (capped at 4 turns + system)

Testing Prompts with TTS

Always test prompts by converting LLM responses to speech. Unnatural phrasing that looks fine in text sounds awkward aloud:

async def evaluate_prompt_quality(prompt: str, test_cases: list):
"""
Test a prompt with multiple inputs via TTS.
Listen for naturalness, length, clarity.
"""
results = []

for test_input in test_cases:
# Generate response
response = await llm.generate(prompt, test_input)

# Synthesize speech
audio = await tts.synthesize(response, voice="alloy", speed=1.0)

# Save for manual review
filename = f"test_{len(results)}.wav"
with open(filename, "wb") as f:
f.write(audio)

# Estimate duration and quality metrics
duration_sec = len(audio) / (16000 * 2) # 16 kHz, 16-bit
word_count = len(response.split())

results.append({
"input": test_input,
"output": response,
"duration_sec": duration_sec,
"word_count": word_count,
"audio_file": filename,
})

print(f"Input: {test_input}")
print(f"Response: {response}")
print(f"Duration: {duration_sec:.1f}s, Words: {word_count}")
print()

return results

# Test cases for a customer service agent
test_cases = [
"What's my account balance?",
"I want to return my order",
"Can you help me with my billing?",
"When will my package arrive?",
]

results = await evaluate_prompt_quality(customer_service_prompt, test_cases)

# Manually review audio files; iterate on prompt until satisfied

Key Takeaways

  • Voice prompts must enforce brevity (10–20 words typical) and naturalness. Text-optimized prompts often sound robotic aloud.
  • Use contractions, conversational language, and direct answers. Avoid meta-commentary ("I can help by...") and filler words.
  • Personality templates (friendly, professional, playful) can be parameterized into prompts for different use cases.
  • Manage conversation context for token limits; voice calls accumulate history slowly (15–30 seconds per turn), so context pruning is less critical than in text chat.
  • Test all prompts via TTS before deployment. Listen for naturalness, pacing, and clarity. Iterate based on how responses sound aloud, not just text appearance.

Frequently Asked Questions

How long should a voice agent response be?

Aim for 10–30 words (3–10 seconds of speech). Shorter (5 seconds) for simple answers; longer (20 seconds) for complex explanations. Anything over 30 seconds risks boring the user. Break long explanations into multiple exchanges.

Should I add "um" or "uh" for naturalness?

No. These are often quirks of human speech that sound artificial in AI speech. Human listeners don't expect AIs to hesitate. Keep speech fluid without filler.

How do I handle domain-specific terminology?

Phonetically spell uncommon words in the prompt. E.g., instead of "HIPAA", write "H-I-P-A-A" in the prompt so the LLM spells it out. Or, map technical terms to plain language: "LIDAR" -> "light-based radar."

Can I adjust TTS voice or speed per response?

Yes. Some TTS APIs allow per-utterance speed/pitch control. Use slower speech for technical info, faster for casual conversation. However, consistency is usually better—pick a voice and stick with it.

How do I handle user frustration or anger?

Prompt the LLM to detect emotional cues and respond empathetically (but not over-apologetically):

If the user sounds angry, respond with:
- ONE empathetic phrase ("I understand this is frustrating")
- Direct problem-solving ("Here's what I can do...")
- No repeated apologies (sounds insincere).

Further Reading