Glossary

Speech-to-Text (ASR)

The technology that transcribes spoken audio into text in real time — also called automatic speech recognition (ASR). One of the three core layers of voice AI.

Speech-to-text (STT/ASR) converts the caller’s audio stream into text that the language model can process. Modern STT (Deepgram, OpenAI Whisper, Google Speech-to-Text, Azure Speech, ElevenLabs Scribe) operates in streaming mode (transcribing as the caller speaks rather than after they finish), handles 30+ languages, and achieves 95%+ accuracy on clear phone-quality audio — with degradation on noisy lines, heavy accents, or specialized vocabulary. STT quality is the single most underrated determinant of overall AI voice agent performance: if the agent mishears the caller, every downstream layer fails.

Why it matters

  • STT accuracy directly drives intent classification accuracy — bad transcription = bad responses.
  • Streaming STT enables sub-second response latency vs batch-mode delays.
  • Multi-language STT is the foundation of multilingual call handling.
  • Domain-specific vocabulary (medical terms, legal terminology, brand names) often requires STT customization.
  • Phone-line audio (8 kHz) is harder than studio audio (16+ kHz) — pick STT providers that optimize for telephony.