The Problem with Unstructured LLM Output
LLMs return free-form text by default. For applications that need to parse, store, or process AI outputs — user data extraction, document classification, API responses — you need structured output: JSON that matches a predictable schema.
LangChain structured output forces the LLM to return data matching a Pydantic model, every time. No regex parsing, no fragile string splitting — just reliable, typed Python objects.
Basic Structured Output with Pydantic
Define a Pydantic model, call .with_structured_output():
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from typing import Optional
class PersonInfo(BaseModel):
"""Information about a person extracted from text."""
name: str = Field(description="Full name of the person")
age: Optional[int] = Field(default=None, description="Age in years, if mentioned")
occupation: Optional[str] = Field(default=None, description="Job or role, if mentioned")
location: Optional[str] = Field(default=None, description="City or country, if mentioned")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(PersonInfo)
result = structured_llm.invoke(
"John Smith is a 34-year-old software engineer based in San Francisco."
)
print(result)
# PersonInfo(name='John Smith', age=34, occupation='software engineer', location='San Francisco')
print(result.name) # 'John Smith'
print(result.age) # 34
The response is a validated Python object, not a string. If the LLM omits a required field, Pydantic raises a validation error.
Extracting Lists and Nested Objects
from pydantic import BaseModel, Field
from typing import List
class Product(BaseModel):
name: str
price: float = Field(description="Price in USD")
in_stock: bool
class ProductCatalog(BaseModel):
"""A list of products extracted from the text."""
products: List[Product]
currency: str = Field(default="USD")
llm = ChatOpenAI(model="gpt-4o-mini")
extractor = llm.with_structured_output(ProductCatalog)
text = """
Our current inventory:
- MacBook Pro 16": $2,499, available
- iPad Air: $599, out of stock
- AirPods Pro: $249, available
"""
catalog = extractor.invoke(text)
for product in catalog.products:
status = "✅" if product.in_stock else "❌"
print(f"{status} {product.name}: ${product.price}")
Document Classification
from pydantic import BaseModel
from typing import Literal
class DocumentClassification(BaseModel):
category: Literal["technical", "legal", "marketing", "support", "financial"]
confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
summary: str = Field(max_length=200, description="One-sentence summary")
action_required: bool = Field(description="Does this document require immediate action?")
classifier = ChatOpenAI(model="gpt-4o-mini").with_structured_output(DocumentClassification)
doc = "Customer reports that API endpoint /v2/users returns 500 error since deployment at 14:30 UTC."
result = classifier.invoke(f"Classify this document:\n{doc}")
print(f"Category: {result.category}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Action required: {result.action_required}")
print(f"Summary: {result.summary}")
Extraction Pipeline with Multiple Documents
Process many documents efficiently:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List, Optional
class JobPosting(BaseModel):
title: str
company: str
location: Optional[str] = None
salary_min: Optional[int] = None
salary_max: Optional[int] = None
remote: bool = False
required_skills: List[str] = Field(default_factory=list)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Extract job posting information. Be precise and extract only what's stated."),
("human", "{posting}"),
])
chain = prompt | llm.with_structured_output(JobPosting)
postings = [
"Senior Python Engineer at Stripe, remote OK, $180k-$220k. Must know Python, AWS, PostgreSQL.",
"Data Scientist at Anthropic (San Francisco). Competitive salary. Skills: Python, PyTorch, statistics.",
"Junior Frontend Dev at startup in NYC. $70k-$90k, on-site. React, TypeScript required.",
]
results = chain.batch([{"posting": p} for p in postings])
for job in results:
salary = f"${job.salary_min:,}-${job.salary_max:,}" if job.salary_min else "not specified"
remote = "🏠 Remote" if job.remote else "🏢 On-site"
print(f"{job.title} @ {job.company} — {salary} — {remote}")
print(f" Skills: {', '.join(job.required_skills)}")
Structured Output vs JSON Mode
Two ways to get structured output from OpenAI models:
Method 1: with_structured_output (Recommended)
# Uses function calling under the hood — most reliable
structured_llm = llm.with_structured_output(MyModel)
result = structured_llm.invoke("...")
# result is a validated MyModel instance
Method 2: JSON Mode
from langchain_core.output_parsers import JsonOutputParser
parser = JsonOutputParser(pydantic_object=MyModel)
prompt = ChatPromptTemplate.from_messages([
("system", "Return JSON matching this schema: {format_instructions}"),
("human", "{text}"),
]).partial(format_instructions=parser.get_format_instructions())
chain = prompt | ChatOpenAI(model="gpt-4o-mini") | parser
result = chain.invoke({"text": "..."})
Use with_structured_output — it’s more reliable and doesn’t require format instructions in the prompt.
Handling Extraction Failures
Sometimes the LLM can’t extract certain fields. Handle this gracefully:
from pydantic import BaseModel, Field, validator
from typing import Optional
class SafeExtraction(BaseModel):
value: Optional[float] = None
confidence: float = Field(default=0.0, ge=0.0, le=1.0)
raw_text: str = Field(description="Original text snippet for this value")
@validator("confidence")
def round_confidence(cls, v):
return round(v, 2)
# For batch processing with error handling:
from langchain_core.runnables import RunnablePassthrough
def safe_extract(text: str) -> Optional[SafeExtraction]:
try:
return structured_llm.invoke(text)
except Exception as e:
print(f"Extraction failed: {e}")
return None
results = [safe_extract(t) for t in documents]
valid = [r for r in results if r is not None]
print(f"Successfully extracted: {len(valid)}/{len(documents)}")
Using with Claude and Other Models
Structured output works with any model that supports function calling:
from langchain_anthropic import ChatAnthropic
# Claude with structured output
claude = ChatAnthropic(model="claude-sonnet-4-6-20250514")
structured_claude = claude.with_structured_output(PersonInfo)
result = structured_claude.invoke("Extract person info from: Emma Watson, 34, actress from UK")
print(result)
Frequently Asked Questions
What’s the difference between with_structured_output and a regular prompt asking for JSON?
with_structured_output uses function calling (native model feature) to force structured output. Regular “return JSON” prompts are unreliable — the model might add explanation text, use wrong field names, or miss required fields. with_structured_output produces reliably valid Pydantic objects.
Does structured output work with streaming?
Partial streaming (get tokens as they arrive) works differently with structured output — you get the complete object when the model finishes, not token by token. For streaming structured output:
for partial in structured_llm.stream("Extract info from: ..."):
print(partial) # accumulates as the object is built
Can I validate extracted values beyond type checking?
Yes — use Pydantic validators:
from pydantic import validator
class Invoice(BaseModel):
amount: float
currency: str
@validator("currency")
def valid_currency(cls, v):
if v not in ["USD", "EUR", "GBP"]:
raise ValueError(f"Unknown currency: {v}")
return v.upper()
How do I extract from very long documents?
For documents longer than the context window, chunk the document and extract from each chunk, then merge:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
chunks = splitter.split_text(long_document)
results = chain.batch([{"text": chunk} for chunk in chunks])
# Merge results (domain-specific logic)
Next Steps
- LangChain Agents and Tools — Use structured extraction as a tool for your agents
- RAG with Pinecone — Combine structured output with document retrieval
- Getting Started with LlamaIndex — Extract structured data from large document collections