Glossary

Text-to-Speech (TTS)

The technology that generates natural-sounding spoken audio from text — completing the voice AI loop alongside speech-to-text and language models.

Text-to-speech (TTS) converts the language model’s text response into spoken audio delivered to the caller. Modern TTS (ElevenLabs, Cartesia, OpenAI Voice, Azure Neural Voices, PlayHT) sounds essentially indistinguishable from human voice on clean phone audio, supports 30+ languages, allows custom voice cloning (your own brand voice, an actor’s voice with permission, etc.), and operates at sub-300ms latency for natural conversational pacing. TTS choice has a huge impact on caller perception — a "professional warm female voice" vs "cheery young male voice" can change CSAT scores measurably, and brand-voice consistency across all touchpoints is increasingly a competitive advantage.

Why it matters

  • Voice quality directly drives caller trust and CSAT.
  • Brand-voice consistency across web, app, and phone is a differentiator in mature markets.
  • Multilingual TTS opens markets without bilingual hiring.
  • Latency budget on TTS (target <300ms first audio) is critical for conversational feel.
  • Custom voice cloning enables distinct personality across competing AI voice deployments.