Intermediate Langchain 2 min read

LangChain Structured Output: Extract Data with Pydantic

#langchain #structured-output #pydantic #json #extraction #python
📚

Read these first:

The Problem with Unstructured LLM Output

LLMs return free-form text by default. For applications that need to parse, store, or process AI outputs — user data extraction, document classification, API responses — you need structured output: JSON that matches a predictable schema.

LangChain structured output forces the LLM to return data matching a Pydantic model, every time. No regex parsing, no fragile string splitting — just reliable, typed Python objects.

Basic Structured Output with Pydantic

Define a Pydantic model, call .with_structured_output():

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from typing import Optional

class PersonInfo(BaseModel):
    """Information about a person extracted from text."""
    name: str = Field(description="Full name of the person")
    age: Optional[int] = Field(default=None, description="Age in years, if mentioned")
    occupation: Optional[str] = Field(default=None, description="Job or role, if mentioned")
    location: Optional[str] = Field(default=None, description="City or country, if mentioned")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(PersonInfo)

result = structured_llm.invoke(
    "John Smith is a 34-year-old software engineer based in San Francisco."
)
print(result)
# PersonInfo(name='John Smith', age=34, occupation='software engineer', location='San Francisco')
print(result.name)  # 'John Smith'
print(result.age)   # 34

The response is a validated Python object, not a string. If the LLM omits a required field, Pydantic raises a validation error.

Extracting Lists and Nested Objects

from pydantic import BaseModel, Field
from typing import List

class Product(BaseModel):
    name: str
    price: float = Field(description="Price in USD")
    in_stock: bool

class ProductCatalog(BaseModel):
    """A list of products extracted from the text."""
    products: List[Product]
    currency: str = Field(default="USD")

llm = ChatOpenAI(model="gpt-4o-mini")
extractor = llm.with_structured_output(ProductCatalog)

text = """
Our current inventory:
- MacBook Pro 16": $2,499, available
- iPad Air: $599, out of stock
- AirPods Pro: $249, available
"""

catalog = extractor.invoke(text)
for product in catalog.products:
    status = "✅" if product.in_stock else "❌"
    print(f"{status} {product.name}: ${product.price}")

Document Classification

from pydantic import BaseModel
from typing import Literal

class DocumentClassification(BaseModel):
    category: Literal["technical", "legal", "marketing", "support", "financial"]
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
    summary: str = Field(max_length=200, description="One-sentence summary")
    action_required: bool = Field(description="Does this document require immediate action?")

classifier = ChatOpenAI(model="gpt-4o-mini").with_structured_output(DocumentClassification)

doc = "Customer reports that API endpoint /v2/users returns 500 error since deployment at 14:30 UTC."

result = classifier.invoke(f"Classify this document:\n{doc}")
print(f"Category: {result.category}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Action required: {result.action_required}")
print(f"Summary: {result.summary}")

Extraction Pipeline with Multiple Documents

Process many documents efficiently:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List, Optional

class JobPosting(BaseModel):
    title: str
    company: str
    location: Optional[str] = None
    salary_min: Optional[int] = None
    salary_max: Optional[int] = None
    remote: bool = False
    required_skills: List[str] = Field(default_factory=list)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract job posting information. Be precise and extract only what's stated."),
    ("human", "{posting}"),
])

chain = prompt | llm.with_structured_output(JobPosting)

postings = [
    "Senior Python Engineer at Stripe, remote OK, $180k-$220k. Must know Python, AWS, PostgreSQL.",
    "Data Scientist at Anthropic (San Francisco). Competitive salary. Skills: Python, PyTorch, statistics.",
    "Junior Frontend Dev at startup in NYC. $70k-$90k, on-site. React, TypeScript required.",
]

results = chain.batch([{"posting": p} for p in postings])
for job in results:
    salary = f"${job.salary_min:,}-${job.salary_max:,}" if job.salary_min else "not specified"
    remote = "🏠 Remote" if job.remote else "🏢 On-site"
    print(f"{job.title} @ {job.company}{salary}{remote}")
    print(f"  Skills: {', '.join(job.required_skills)}")

Structured Output vs JSON Mode

Two ways to get structured output from OpenAI models:

# Uses function calling under the hood — most reliable
structured_llm = llm.with_structured_output(MyModel)
result = structured_llm.invoke("...")
# result is a validated MyModel instance

Method 2: JSON Mode

from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser(pydantic_object=MyModel)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Return JSON matching this schema: {format_instructions}"),
    ("human", "{text}"),
]).partial(format_instructions=parser.get_format_instructions())

chain = prompt | ChatOpenAI(model="gpt-4o-mini") | parser
result = chain.invoke({"text": "..."})

Use with_structured_output — it’s more reliable and doesn’t require format instructions in the prompt.

Handling Extraction Failures

Sometimes the LLM can’t extract certain fields. Handle this gracefully:

from pydantic import BaseModel, Field, validator
from typing import Optional

class SafeExtraction(BaseModel):
    value: Optional[float] = None
    confidence: float = Field(default=0.0, ge=0.0, le=1.0)
    raw_text: str = Field(description="Original text snippet for this value")

    @validator("confidence")
    def round_confidence(cls, v):
        return round(v, 2)

# For batch processing with error handling:
from langchain_core.runnables import RunnablePassthrough

def safe_extract(text: str) -> Optional[SafeExtraction]:
    try:
        return structured_llm.invoke(text)
    except Exception as e:
        print(f"Extraction failed: {e}")
        return None

results = [safe_extract(t) for t in documents]
valid = [r for r in results if r is not None]
print(f"Successfully extracted: {len(valid)}/{len(documents)}")

Using with Claude and Other Models

Structured output works with any model that supports function calling:

from langchain_anthropic import ChatAnthropic

# Claude with structured output
claude = ChatAnthropic(model="claude-sonnet-4-6-20250514")
structured_claude = claude.with_structured_output(PersonInfo)

result = structured_claude.invoke("Extract person info from: Emma Watson, 34, actress from UK")
print(result)

Frequently Asked Questions

What’s the difference between with_structured_output and a regular prompt asking for JSON?

with_structured_output uses function calling (native model feature) to force structured output. Regular “return JSON” prompts are unreliable — the model might add explanation text, use wrong field names, or miss required fields. with_structured_output produces reliably valid Pydantic objects.

Does structured output work with streaming?

Partial streaming (get tokens as they arrive) works differently with structured output — you get the complete object when the model finishes, not token by token. For streaming structured output:

for partial in structured_llm.stream("Extract info from: ..."):
    print(partial)  # accumulates as the object is built

Can I validate extracted values beyond type checking?

Yes — use Pydantic validators:

from pydantic import validator

class Invoice(BaseModel):
    amount: float
    currency: str

    @validator("currency")
    def valid_currency(cls, v):
        if v not in ["USD", "EUR", "GBP"]:
            raise ValueError(f"Unknown currency: {v}")
        return v.upper()

How do I extract from very long documents?

For documents longer than the context window, chunk the document and extract from each chunk, then merge:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
chunks = splitter.split_text(long_document)
results = chain.batch([{"text": chunk} for chunk in chunks])
# Merge results (domain-specific logic)

Next Steps

Related Articles