Text-to-Speech (TTS)
Simple Definition
Text-to-speech (TTS) is AI that converts written text into spoken audio. You provide text, and the AI produces a realistic-sounding voice reading it aloud.
Modern AI TTS is dramatically better than the robotic computer voices of the past. Tools like ElevenLabs can produce voices that are nearly indistinguishable from a real human.
How It Works
Modern TTS systems use deep learning to model how human speech sounds — the rhythm, intonation, emphasis, and natural variation in real voices. They’re trained on recordings of human speech and learn to reproduce speech patterns from text.
Some systems can also clone voices — given a sample of a specific person’s voice, they can generate new speech in that voice.
Leading TTS Tools
- ElevenLabs — the leading AI voice platform for quality and realism
- OpenAI TTS — fast, high-quality, available via API
- Google Cloud TTS — extensive language support
- Amazon Polly — AWS-native TTS service
- Play.ht — voice cloning and podcast-focused
Use Cases
- Video voiceovers — narrate videos without recording equipment
- Podcast content — generate audio from written scripts
- Accessibility — read content aloud for visually impaired users
- E-learning — narrate courses and educational content
- Audiobooks — produce audio versions of written content
- Customer service — power voice bots and IVR systems
Important Considerations
Voice cloning raises ethical and legal questions around consent and misuse. Responsible use means only cloning voices you have permission to use.
Related Terms
- Speech-to-Text — the reverse: converting audio into text
- Generative AI — TTS is a form of audio generative AI
- Multimodal AI — AI that handles audio alongside text and images
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: