Attention Is All You Need — The Paper That Changed AI

Q: Where can I read the paper?

Free on arXiv: [arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762). The paper is well-written and accessible — highly recommended to read directly. ---

Paper Overview

“Attention Is All You Need” Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Google Brain / Google Research Advances in Neural Information Processing Systems (NeurIPS 2017)

arxiv.org/abs/1706.03762

Why it matters: This paper introduced the Transformer architecture — the foundation of every modern LLM, including GPT-4, Claude, Gemini, and Llama. Before this paper, sequence modeling relied on RNNs and LSTMs. After it, transformers became the universal architecture for NLP, then vision, then biology, then code.

Citations (as of 2025): 130,000+. Arguably the most impactful ML paper of the 2010s.

The Problem They Were Solving

In 2017, the state of the art for sequence-to-sequence tasks (translation, summarization) was encoder-decoder architectures built on RNNs (Recurrent Neural Networks) and LSTMs.

The problems:

Sequential processing — RNNs process one token at a time, left to right. You can’t parallelize this, so training is slow.
Vanishing gradients — Information from early tokens fades through many recurrent steps. Long-range dependencies are hard to learn.
Limited context — Even with LSTMs and attention mechanisms bolted on, capturing relationships across long sequences was difficult.

The question the authors asked: “What if we remove recurrence entirely and rely solely on attention?”

The Core Contribution: Self-Attention

The key innovation is scaled dot-product attention:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Where:

Q (Query): what each position is “looking for”
K (Key): what each position “offers”
V (Value): the information each position carries

For each token, this computes a weighted average of all other tokens’ values, where the weights reflect relevance.

Intuition:

"The bank along the river was steep"

For "bank":
- Attends strongly to "river" (key clue → riverbank)
- Attends strongly to "steep" (describes a slope)
- Attends weakly to "The", "along", "was"

Result: "bank" is represented as a weighted blend of all positions,
concentrated on the contextually relevant ones.

This is O(n²) in sequence length but fully parallelizable — every token’s attention to every other token can be computed simultaneously on a GPU.

Multi-Head Attention

Running attention once captures one type of relationship. The paper introduces multi-head attention — running h parallel attention operations with different learned projections:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W_O

where head_i = Attention(Q × W_Q_i, K × W_K_i, V × W_V_i)

In their experiments: h=8 heads, each projecting to d_k = d_model/h = 64 dimensions.

Why it helps: Different heads learn different relationship types:

Head 1 might focus on subject-verb agreement
Head 2 might focus on coreference (“it” → “the cat”)
Head 3 might focus on positional relationships

The Full Architecture

INPUT → Embedding + Positional Encoding
            ↓
    ┌─────────────────────┐
    │   ENCODER (×6)      │
    │ ┌─────────────────┐ │
    │ │ Multi-Head Attn │ │
    │ │ + Residual/Norm │ │
    │ ├─────────────────┤ │
    │ │  Feed-Forward   │ │
    │ │ + Residual/Norm │ │
    │ └─────────────────┘ │
    └─────────────────────┘
            ↓
    ┌─────────────────────┐
    │   DECODER (×6)      │
    │ ┌─────────────────┐ │
    │ │ Masked MH Attn  │ │← causal masking
    │ ├─────────────────┤ │
    │ │  Cross-Attention│ │← attends to encoder
    │ ├─────────────────┤ │
    │ │  Feed-Forward   │ │
    │ └─────────────────┘ │
    └─────────────────────┘
            ↓
    Linear + Softmax → output probabilities

The encoder processes the source sequence (e.g., English sentence). The decoder generates the target sequence (e.g., German translation) token by token, attending to both its own previous outputs (masked self-attention) and the full encoder output (cross-attention).

Positional Encoding

Since attention is permutation-invariant (it doesn’t inherently know token order), the paper adds positional information via sinusoidal encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This allows the model to learn to attend by relative position — “two tokens to my left” — via linear combinations of sin/cos patterns.

Note: Modern LLMs have largely replaced sinusoidal encoding with RoPE (Rotary Positional Embeddings), which extends better to long contexts.

Results

The paper reports results on WMT 2014 English-German and English-French translation:

Model	EN-DE BLEU	EN-FR BLEU	Training Cost
Best RNN (at the time)	26.4	41.0	~1-2 weeks
Transformer (base)	27.3	38.1	0.5 day (8 GPUs)
Transformer (big)	28.4	41.8	3.5 days (8 GPUs)

Not only did the Transformer outperform RNNs — it trained significantly faster due to parallelization.

What the Paper Got Right (and What Changed)

Still standard in 2025:

Multi-head self-attention mechanism (unchanged)
Residual connections and layer normalization
Feed-forward sublayers in each block
Scaled dot-product attention formula

Evolved since:

Positional encoding: sinusoidal → RoPE, ALiBi, learned
Normalization placement: Post-LN → Pre-LN (more stable)
Activation function: ReLU → GELU, SiLU
Architecture: encoder-decoder → decoder-only (GPT lineage)
Context length: 512 tokens → millions (Gemini 1.5)
Efficient attention: Flash Attention, sliding window attention

Why Decoder-Only Won

The original paper uses encoder-decoder for translation. Modern LLMs are decoder-only.

Why?

Decoder-only models can be used for any task by framing it as “completion” — translation, summarization, Q&A, all become next-token prediction
Simpler architecture → easier to scale
Pre-training on raw text (predict next token) is straightforward
Scaling laws favor decoder-only architectures for general capability

GPT-1 (2018, OpenAI) demonstrated this — a decoder-only transformer pre-trained on text was surprisingly capable. The rest is history.

Frequently Asked Questions

Did the authors anticipate what the transformer would become?

The paper’s focus was machine translation. The authors didn’t claim to have solved general AI. Ilya Sutskever and others at OpenAI saw the scaling potential and built GPT on it. The “one ring to rule them all” architecture wasn’t obvious from the original paper.

What is “attention” replacing in RNNs?

RNNs have a hidden state that carries information from previous tokens. Attention replaces this with direct connections — any token can directly attend to any other, with no path length > 1.

Is the Transformer architecture patented?

No. The paper was published openly and the architecture is freely used. Google, Meta, OpenAI, Anthropic, and others have all built on it.

What is the “Transformer” in “Transformer XL”, “Vision Transformer”, etc.?

The transformer architecture generalized beyond NLP. Vision Transformers (ViT) apply self-attention to image patches. AlphaFold uses transformers for protein structure. The core mechanism — self-attention + position-wise FFN + residuals — transfers broadly.

Where can I read the paper?

Free on arXiv: arxiv.org/abs/1706.03762. The paper is well-written and accessible — highly recommended to read directly.

Next Steps

The Transformer Architecture Explained — Implementation-level walkthrough
ReAct Paper Explained — How transformers became agents
What Is a Large Language Model? — From transformer to GPT-4