Transformer (AI Architecture)
Simple Definition
The transformer is a type of neural network architecture that has become the foundation for almost all modern AI language models. GPT, Claude, Gemini, Llama — they all use transformer-based architectures.
It was introduced in the 2017 paper “Attention Is All You Need” by researchers at Google, and it revolutionized natural language processing.
What Made Transformers Different
Before transformers, language models processed text sequentially — word by word, like reading left to right. This was slow and made it hard to capture long-range relationships between words.
Transformers introduced self-attention: the ability for every word in a sequence to directly consider every other word at the same time. This meant the model could understand “The trophy didn’t fit in the bag because it was too big” — knowing that “it” refers to “trophy” not “bag” — by looking at all words simultaneously.
Key Component: Attention
The “attention mechanism” is the core innovation. It lets the model weigh how relevant each word is to every other word when building its understanding.
For example, in “The cat sat on the mat,” the word “sat” attends strongly to “cat” (who’s sitting?) and “mat” (sat on what?).
Why Transformers Scale So Well
Transformers process all tokens in parallel (unlike sequential models), which means they can be trained much faster on modern hardware. This scalability is why it became practical to train models on hundreds of billions of tokens.
Related Terms
- LLM — large language models built on transformers
- Neural Network — the broader category transformers belong to
- Deep Learning — the field transformers are central to
- GPT — OpenAI’s transformer-based model family
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: