Glossary
Speech-to-Text (ASR)
The technology that transcribes spoken audio into text in real time — also called automatic speech recognition (ASR). One of the three core layers of voice AI.
Speech-to-text (STT/ASR) converts the caller’s audio stream into text that the language model can process. Modern STT (Deepgram, OpenAI Whisper, Google Speech-to-Text, Azure Speech, ElevenLabs Scribe) operates in streaming mode (transcribing as the caller speaks rather than after they finish), handles 30+ languages, and achieves 95%+ accuracy on clear phone-quality audio — with degradation on noisy lines, heavy accents, or specialized vocabulary. STT quality is the single most underrated determinant of overall AI voice agent performance: if the agent mishears the caller, every downstream layer fails.
Why it matters
- STT accuracy directly drives intent classification accuracy — bad transcription = bad responses.
- Streaming STT enables sub-second response latency vs batch-mode delays.
- Multi-language STT is the foundation of multilingual call handling.
- Domain-specific vocabulary (medical terms, legal terminology, brand names) often requires STT customization.
- Phone-line audio (8 kHz) is harder than studio audio (16+ kHz) — pick STT providers that optimize for telephony.