Intermediate Fundamentals 4 min read

The Transformer Architecture Explained for Developers

#transformer #attention #self-attention #neural-network #architecture #llm

Why Transformers Changed Everything

Before 2017, the best NLP models were RNNs (Recurrent Neural Networks) — they processed text word by word, maintaining a hidden state. The problem: information from early in a long sequence would fade by the time the model reached the end.

The 2017 paper “Attention Is All You Need” introduced the Transformer, which processes all tokens simultaneously using a mechanism called self-attention. This enabled:

  • Full parallelization (fast training on GPUs)
  • Direct access to any token in the context, regardless of distance
  • Better handling of long-range dependencies

Every modern LLM — GPT-4, Claude, Gemini, Llama — is built on this architecture.

High-Level Structure

A Transformer has two main components:

  • Encoder — processes the input sequence into a rich representation
  • Decoder — generates the output sequence token by token

For LLMs (GPT-style models), only the decoder is used in “autoregressive” fashion — each generated token is fed back as input to generate the next.

Input tokens → [Embedding + Positional Encoding]

              [N × Transformer Blocks]
               ├── Multi-Head Self-Attention
               ├── Layer Norm
               ├── Feed-Forward Network
               └── Layer Norm

              [Linear + Softmax → Token probabilities]

Step 1: Tokenization and Embedding

Text is broken into tokens, then converted to vectors (embeddings):

# Conceptual — not actual LLM code
vocab_size = 50000
embedding_dim = 4096  # GPT-4 uses ~12,288

# Each token ID maps to a learned vector
embedding_table = torch.nn.Embedding(vocab_size, embedding_dim)

token_ids = [15496, 995, 11]  # "Hello world!"
token_vectors = embedding_table(token_ids)
# Shape: [3, 4096] — 3 tokens, each 4096-dimensional

Step 2: Positional Encoding

Transformers process tokens in parallel — they don’t inherently know which token came first. Positional encoding injects position information:

import torch
import math

def positional_encoding(seq_len: int, d_model: int) -> torch.Tensor:
    """Sinusoidal positional encoding from the original paper."""
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

# Add to token embeddings
x = token_vectors + positional_encoding(3, 4096)

Modern LLMs often use RoPE (Rotary Positional Embeddings) instead, which scales better to long contexts.

Step 3: Self-Attention — The Core Mechanism

Self-attention allows every token to look at every other token to determine relevance. The key insight: the meaning of a word depends on its context.

"I saw the bat in the cave"

  "bat" should look at "cave" → animal
  "bat" should look at "hit" → sports equipment

The QKV Formulation

For each token, self-attention computes three vectors:

  • Q (Query): “what am I looking for?”
  • K (Key): “what do I contain?”
  • V (Value): “what information do I provide?”
import torch
import torch.nn.functional as F

d_model = 512
d_k = 64  # dimension per head

# Learned projection matrices
W_Q = torch.nn.Linear(d_model, d_k)
W_K = torch.nn.Linear(d_model, d_k)
W_V = torch.nn.Linear(d_model, d_k)

def scaled_dot_product_attention(x: torch.Tensor) -> torch.Tensor:
    """
    x: [seq_len, d_model]
    Returns: [seq_len, d_k]
    """
    Q = W_Q(x)  # [seq_len, d_k]
    K = W_K(x)  # [seq_len, d_k]
    V = W_V(x)  # [seq_len, d_k]

    # Attention scores: how much each token attends to each other
    scores = Q @ K.T / math.sqrt(d_k)  # [seq_len, seq_len]

    # Softmax to get probabilities
    weights = F.softmax(scores, dim=-1)  # [seq_len, seq_len]

    # Weighted combination of values
    output = weights @ V  # [seq_len, d_k]
    return output

The division by sqrt(d_k) prevents the dot products from becoming too large, which would push softmax into vanishing gradient territory.

Step 4: Multi-Head Attention

Running attention once captures one type of relationship. Multi-head attention runs multiple attention mechanisms in parallel, each with different learned projections, then combines results:

num_heads = 8
d_k = d_model // num_heads  # 64

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_Q = torch.nn.Linear(d_model, d_model)
        self.W_K = torch.nn.Linear(d_model, d_model)
        self.W_V = torch.nn.Linear(d_model, d_model)
        self.W_O = torch.nn.Linear(d_model, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq_len, d_model = x.shape

        Q = self.W_Q(x).view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(x).view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(x).view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        weights = F.softmax(scores, dim=-1)
        attended = weights @ V

        # Concatenate heads and project
        concat = attended.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
        return self.W_O(concat)

Each head can learn different patterns: one might focus on syntactic relationships, another on semantic similarity, another on coreference.

Step 5: Causal Masking (for Decoder)

When generating text, the model must not “cheat” by looking at future tokens. A causal mask prevents this:

def causal_mask(seq_len: int) -> torch.Tensor:
    """Upper triangular mask — future tokens are -inf (become 0 after softmax)."""
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    return mask.masked_fill(mask == 1, float('-inf'))

# Apply before softmax in attention
scores = scores + causal_mask(seq_len)
weights = F.softmax(scores, dim=-1)

This ensures token at position i only attends to positions 0..i.

Step 6: Feed-Forward Network

After attention, each position is processed independently through a feed-forward network:

class FeedForward(torch.nn.Module):
    def __init__(self, d_model: int, d_ff: int = 2048):
        super().__init__()
        self.net = torch.nn.Sequential(
            torch.nn.Linear(d_model, d_ff),
            torch.nn.GELU(),          # activation function
            torch.nn.Linear(d_ff, d_model),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

The FFN is where much of the “knowledge” is stored — it’s roughly 2/3 of the parameters in a transformer.

Step 7: Layer Normalization and Residual Connections

Each sub-layer (attention, FFN) is wrapped with:

  1. Residual connection — adds the input directly to the output, enabling gradient flow in deep networks
  2. Layer normalization — normalizes across the feature dimension for training stability
class TransformerBlock(torch.nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model)
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Residual + norm (Pre-LN variant, used in GPT-2+)
        x = x + self.attention(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

Scale: What Changes in Large Models

Small (GPT-2)Large (GPT-4 ~est.)
Layers (N)12~96+
d_model768~12,288
Attention heads12~96
Parameters1.5B~1.7T
Context window1,024 tokens128,000 tokens

More layers, wider dimensions, more heads. The training recipe improves too — better data curation, longer training, RLHF alignment.

Frequently Asked Questions

What’s the difference between encoder-only, decoder-only, and encoder-decoder transformers?

Encoder-only (BERT, RoBERTa): reads full sequence bidirectionally. Best for classification, embeddings, understanding tasks.

Decoder-only (GPT series, Claude, Llama): autoregressive generation with causal masking. Best for text generation, chat.

Encoder-decoder (T5, BART, original Transformer): encoder reads input, decoder generates output. Best for translation, summarization.

Why does attention have O(n²) complexity?

Every token attends to every other token: n×n attention matrix. Doubling the context quadruples the computation. This is why long contexts are expensive. Research into sparse attention and linear attention aims to reduce this.

What is Flash Attention?

Flash Attention is an efficient implementation of attention that avoids materializing the full n×n matrix in GPU memory. Same mathematical result, much faster and more memory-efficient. Used in all production LLM inference systems.

What is KV Cache?

During autoregressive generation, the Keys and Values for already-generated tokens don’t change. KV cache stores them to avoid recomputation on each new token. Critical for fast inference — without it, generating each token would require processing the entire history.

How does the model “know” what task to do?

It doesn’t, intrinsically. The task is conveyed through the system prompt and prompt structure. Fine-tuning on instruction-following data teaches the model to recognize these patterns and respond appropriately.

Next Steps

Related Articles