Speech-to-Text (STT)

Simple Definition

Speech-to-text (STT) AI — also called automatic speech recognition (ASR) or transcription — converts spoken words into written text. You speak or provide an audio file, and the AI produces a text transcript.

It’s the technology behind Siri, Alexa, Google voice search, and meeting transcription tools like Otter.ai.

How It Works

Modern STT systems use deep learning to analyze audio signals and map them to words. They’re trained on vast amounts of audio paired with transcripts, learning the sounds, patterns, and context of language.

More advanced systems also handle:

  • Multiple speakers (diarization)
  • Background noise
  • Different accents and languages
  • Real-time transcription

Leading STT Tools

  • Whisper (OpenAI) — open-source, highly accurate, supports many languages
  • Otter.ai — meeting transcription and collaboration
  • Fireflies.ai — meeting notes and action items
  • Deepgram — fast, API-first, real-time capable
  • Google Speech-to-Text — enterprise scale, multilingual
  • Rev — high-accuracy transcription service

Use Cases

  • Meeting transcription — automatic notes from Zoom, Teams, or Google Meet
  • Podcast and video transcripts — make audio content searchable and accessible
  • Dictation — write by speaking instead of typing
  • Voice assistants — Siri, Alexa, Google Assistant
  • Accessibility — enable hearing-impaired users to follow conversations
  • Customer call analysis — transcribe and analyze support calls

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: