Paper Overview
“Attention Is All You Need” Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Google Brain / Google Research Advances in Neural Information Processing Systems (NeurIPS 2017)
Why it matters: This paper introduced the Transformer architecture — the foundation of every modern LLM, including GPT-4, Claude, Gemini, and Llama. Before this paper, sequence modeling relied on RNNs and LSTMs. After it, transformers became the universal architecture for NLP, then vision, then biology, then code.
Citations (as of 2025): 130,000+. Arguably the most impactful ML paper of the 2010s.
The Problem They Were Solving
In 2017, the state of the art for sequence-to-sequence tasks (translation, summarization) was encoder-decoder architectures built on RNNs (Recurrent Neural Networks) and LSTMs.
The problems:
- Sequential processing — RNNs process one token at a time, left to right. You can’t parallelize this, so training is slow.
- Vanishing gradients — Information from early tokens fades through many recurrent steps. Long-range dependencies are hard to learn.
- Limited context — Even with LSTMs and attention mechanisms bolted on, capturing relationships across long sequences was difficult.
The question the authors asked: “What if we remove recurrence entirely and rely solely on attention?”
The Core Contribution: Self-Attention
The key innovation is scaled dot-product attention:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V
Where:
- Q (Query): what each position is “looking for”
- K (Key): what each position “offers”
- V (Value): the information each position carries
For each token, this computes a weighted average of all other tokens’ values, where the weights reflect relevance.
Intuition:
"The bank along the river was steep"
For "bank":
- Attends strongly to "river" (key clue → riverbank)
- Attends strongly to "steep" (describes a slope)
- Attends weakly to "The", "along", "was"
Result: "bank" is represented as a weighted blend of all positions,
concentrated on the contextually relevant ones.
This is O(n²) in sequence length but fully parallelizable — every token’s attention to every other token can be computed simultaneously on a GPU.
Multi-Head Attention
Running attention once captures one type of relationship. The paper introduces multi-head attention — running h parallel attention operations with different learned projections:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W_O
where head_i = Attention(Q × W_Q_i, K × W_K_i, V × W_V_i)
In their experiments: h=8 heads, each projecting to d_k = d_model/h = 64 dimensions.
Why it helps: Different heads learn different relationship types:
- Head 1 might focus on subject-verb agreement
- Head 2 might focus on coreference (“it” → “the cat”)
- Head 3 might focus on positional relationships
The Full Architecture
INPUT → Embedding + Positional Encoding
↓
┌─────────────────────┐
│ ENCODER (×6) │
│ ┌─────────────────┐ │
│ │ Multi-Head Attn │ │
│ │ + Residual/Norm │ │
│ ├─────────────────┤ │
│ │ Feed-Forward │ │
│ │ + Residual/Norm │ │
│ └─────────────────┘ │
└─────────────────────┘
↓
┌─────────────────────┐
│ DECODER (×6) │
│ ┌─────────────────┐ │
│ │ Masked MH Attn │ │← causal masking
│ ├─────────────────┤ │
│ │ Cross-Attention│ │← attends to encoder
│ ├─────────────────┤ │
│ │ Feed-Forward │ │
│ └─────────────────┘ │
└─────────────────────┘
↓
Linear + Softmax → output probabilities
The encoder processes the source sequence (e.g., English sentence). The decoder generates the target sequence (e.g., German translation) token by token, attending to both its own previous outputs (masked self-attention) and the full encoder output (cross-attention).
Positional Encoding
Since attention is permutation-invariant (it doesn’t inherently know token order), the paper adds positional information via sinusoidal encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
This allows the model to learn to attend by relative position — “two tokens to my left” — via linear combinations of sin/cos patterns.
Note: Modern LLMs have largely replaced sinusoidal encoding with RoPE (Rotary Positional Embeddings), which extends better to long contexts.
Results
The paper reports results on WMT 2014 English-German and English-French translation:
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost |
|---|---|---|---|
| Best RNN (at the time) | 26.4 | 41.0 | ~1-2 weeks |
| Transformer (base) | 27.3 | 38.1 | 0.5 day (8 GPUs) |
| Transformer (big) | 28.4 | 41.8 | 3.5 days (8 GPUs) |
Not only did the Transformer outperform RNNs — it trained significantly faster due to parallelization.
What the Paper Got Right (and What Changed)
Still standard in 2025:
- Multi-head self-attention mechanism (unchanged)
- Residual connections and layer normalization
- Feed-forward sublayers in each block
- Scaled dot-product attention formula
Evolved since:
- Positional encoding: sinusoidal → RoPE, ALiBi, learned
- Normalization placement: Post-LN → Pre-LN (more stable)
- Activation function: ReLU → GELU, SiLU
- Architecture: encoder-decoder → decoder-only (GPT lineage)
- Context length: 512 tokens → millions (Gemini 1.5)
- Efficient attention: Flash Attention, sliding window attention
Why Decoder-Only Won
The original paper uses encoder-decoder for translation. Modern LLMs are decoder-only.
Why?
- Decoder-only models can be used for any task by framing it as “completion” — translation, summarization, Q&A, all become next-token prediction
- Simpler architecture → easier to scale
- Pre-training on raw text (predict next token) is straightforward
- Scaling laws favor decoder-only architectures for general capability
GPT-1 (2018, OpenAI) demonstrated this — a decoder-only transformer pre-trained on text was surprisingly capable. The rest is history.
Frequently Asked Questions
Did the authors anticipate what the transformer would become?
The paper’s focus was machine translation. The authors didn’t claim to have solved general AI. Ilya Sutskever and others at OpenAI saw the scaling potential and built GPT on it. The “one ring to rule them all” architecture wasn’t obvious from the original paper.
What is “attention” replacing in RNNs?
RNNs have a hidden state that carries information from previous tokens. Attention replaces this with direct connections — any token can directly attend to any other, with no path length > 1.
Is the Transformer architecture patented?
No. The paper was published openly and the architecture is freely used. Google, Meta, OpenAI, Anthropic, and others have all built on it.
What is the “Transformer” in “Transformer XL”, “Vision Transformer”, etc.?
The transformer architecture generalized beyond NLP. Vision Transformers (ViT) apply self-attention to image patches. AlphaFold uses transformers for protein structure. The core mechanism — self-attention + position-wise FFN + residuals — transfers broadly.
Where can I read the paper?
Free on arXiv: arxiv.org/abs/1706.03762. The paper is well-written and accessible — highly recommended to read directly.
Next Steps
- The Transformer Architecture Explained — Implementation-level walkthrough
- ReAct Paper Explained — How transformers became agents
- What Is a Large Language Model? — From transformer to GPT-4