What Is a Large Language Model (LLM)?

The One-Sentence Definition

A Large Language Model (LLM) is a neural network trained on massive amounts of text that learns to predict the next token — and in doing so, acquires the ability to understand and generate human language.

That’s it. Everything else — reasoning, coding, translation, summarization — emerges from that single training objective applied at enormous scale.

Why “Large”?

The “large” refers to two things:

Parameters — the numerical weights inside the neural network. GPT-3 has 175 billion. GPT-4 is rumored to be ~1.7 trillion. More parameters = more capacity to memorize patterns.
Training data — the text the model learned from. Trillions of tokens from the internet, books, code, scientific papers, and more.

The surprising discovery (confirmed empirically around 2020–2022) is that scale produces qualitative jumps. A model 10x larger isn’t just 10% better — it suddenly acquires capabilities the smaller model never had: reasoning, analogy, few-shot learning.

How LLMs Are Trained

Pre-training: Predict the Next Token

Input:  "The capital of France is"
Target: "Paris"

The model sees billions of these examples. Adjust weights to predict the next word better. Repeat 10^23 times. The result is a model that has absorbed vast statistical patterns about how language — and the world it describes — works.

Tokenization breaks text into subword units first:

"agentscookbook" → ["agents", "cook", "book"]

The model operates on tokens, not characters or words. GPT-4 uses a 100,000-token vocabulary.

Fine-tuning and RLHF

Raw pre-trained models are powerful but unfocused — they’ll complete your prompt, but not necessarily helpfully. Fine-tuning adjusts the model on curated instruction-following data:

Human: Explain LLMs simply.
Assistant: [good explanation]

RLHF (Reinforcement Learning from Human Feedback) goes further: humans rank model outputs, and those rankings train a reward model that guides further training. This is how models like GPT-4 and Claude became “assistant-like.”

The Transformer: What Makes It Work

All modern LLMs use the Transformer architecture (introduced in the 2017 paper “Attention Is All You Need”). The key innovation: self-attention.

Self-attention lets each token look at every other token in the context to figure out which are relevant:

"The bank by the river was steep"
       ^--- "bank" looks at "river" and "steep" to understand
            it means a riverbank, not a financial institution

Older architectures (RNNs, LSTMs) processed text sequentially — difficult to parallelize and hard to maintain long-range context. Transformers process all tokens simultaneously, which enables:

Massive parallelization (GPUs love this)
Capturing relationships across long documents

Context Window

The context window is how much text the model can “see” at once. Early GPT-3: 4,096 tokens (~3,000 words). GPT-4o: 128,000 tokens. Claude: 200,000 tokens.

Practical implications for developers:

Your entire prompt + conversation history + response must fit in the context window
Longer context = more expensive (cost is roughly proportional to tokens processed)
“Lost in the middle” problem: models struggle to use information from the middle of very long contexts

Emergent Capabilities

As LLMs scale, they develop capabilities no one explicitly trained them for:

Capability	When it emerged
Few-shot learning	~GPT-3 (2020)
Chain-of-thought reasoning	~GPT-3.5 (2022)
Reliable code generation	~GPT-4 (2023)
Instruction following	Fine-tuned models (2022+)

“Emergent” means the capability wasn’t present in smaller models and wasn’t directly targeted by training — it appeared as a side effect of scale.

LLMs vs. Traditional NLP

	Traditional NLP	LLMs
Approach	Rule-based + statistical	End-to-end neural
Task scope	One task per model	General purpose
Training data	Thousands of examples	Trillions of tokens
Adaptability	Retrain for each task	Prompt engineering

The Main LLM Families

Model	Company	Context	Key Strength
GPT-4o	OpenAI	128K	Multimodal, general
Claude 3.5/4	Anthropic	200K	Long context, safe
Gemini 2.5	Google	1M	Huge context window
Llama 3.x	Meta	128K	Open weights
Mistral	Mistral AI	32K	Efficient, open

Limitations Every Developer Should Know

Hallucination — LLMs generate plausible-sounding but incorrect information. They don’t “know” facts; they pattern-match. Always verify factual claims from LLMs against authoritative sources.

Knowledge cutoff — Training data has a cutoff date. The model doesn’t know about events after that date unless you provide them in the prompt.

Context window as memory — LLMs have no persistent memory between API calls. Each call starts fresh (unless you include history in the prompt).

Sensitivity to phrasing — The same question phrased differently can get very different answers. Prompt engineering matters.

Cost — API calls cost money. A 10,000-token call to GPT-4o costs ~$0.05–$0.15. At scale, this adds up quickly.

Frequently Asked Questions

Is an LLM the same as AI?

No. LLMs are one type of AI — specifically, deep learning models for language. AI is a broader field including computer vision, robotics, reinforcement learning, and more. Current LLMs are often called “narrow AI” despite their impressive generality.

Do LLMs understand what they’re saying?

This is actively debated. LLMs produce correct, contextually appropriate responses, but whether that constitutes “understanding” in a philosophical sense is unclear. For practical development purposes: treat them as very powerful pattern-matching machines, not as reasoning agents.

What’s the difference between an LLM and a chatbot?

An LLM is the underlying model. A chatbot is a product built on top of an LLM — with system prompts, memory, UI, and usually guardrails. ChatGPT is a chatbot; GPT-4 is the LLM.

Why can’t I just run an LLM on my laptop?

Frontier models (GPT-4, Claude) require massive GPU clusters to run. However, smaller open-weight models like Llama 3.2 3B can run on a modern laptop via tools like Ollama. Quality is lower but fully functional for many tasks.

What are tokens, exactly?

Tokens are the units the model processes. Roughly: 1 token ≈ 0.75 English words. “Hello world” = 2 tokens. Code and non-English text use more tokens per word. APIs charge per token.

Next Steps

What Is an AI Agent? — How LLMs become autonomous agents
The Transformer Architecture Explained — Deeper technical dive
Attention Is All You Need — Paper Explained — The original transformer paper