Training Data
Simple Definition
Training data is the dataset an AI model learns from. During training, the model is exposed to this data millions or billions of times and adjusts its internal parameters to make better predictions.
For a language model like GPT-4 or Claude, training data includes billions of words from books, websites, code repositories, and other text sources.
Why Training Data Matters
The data shapes everything about a model:
- What it knows — a model can only know what was in its training data
- What it does well — more examples of a task → better performance on that task
- Its biases — patterns in the data become patterns in the model’s outputs
- Its knowledge cutoff — training has a date, so the model doesn’t know about events after it
Types of Training Data for LLMs
- Web text — scraped websites (Common Crawl, etc.)
- Books — broad vocabulary and reasoning patterns
- Code — improves coding ability
- Conversations — helps the model respond naturally
- Curated human feedback — used in RLHF to align the model with human preferences
Data Quality vs. Quantity
More data isn’t always better. Recent research shows that high-quality, curated data often outperforms larger but noisier datasets. This is why newer models focus as much on data curation as data scale.
Training Data and Bias
If training data over-represents certain viewpoints, demographics, or writing styles, the model will reflect those biases. This is one of the core challenges in building fair and reliable AI systems.
Related Terms
- Machine Learning — the process that uses training data to build models
- Fine-Tuning — additional training on a smaller, specialized dataset
- Bias in AI — how unbalanced training data produces skewed outputs
- LLM — large language models trained on massive text datasets
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: