Skip to main content

Document Layout Analysis: Extract Text from Complex PDFs

Document layout analysis is the process of understanding a document's visual structure — where regions of interest are located, what their reading order is, and how text flows across columns, sections, and pages. Unlike simple PDF text extraction which reads sequentially, layout analysis preserves the spatial relationships that convey meaning: a heading above a paragraph tells you they're related, a margin note differs from body text, and multi-column text needs to be reassembled in the right order.

Modern vision-language models understand layout implicitly, but guiding them with spatial prompts and explicit region identification dramatically improves accuracy. I've seen extraction accuracy jump from 70% to 94% simply by adding layout hints to prompts — the model no longer confuses header text with body content or scrambles reading order.

Why Layout Matters

The Pitfall of Naive Text Extraction

If you extract text from a PDF without understanding layout, you get a jumbled mess. Consider a two-column document with a sidebar. Naive line-by-line extraction might read: "Column 1 line 1, Sidebar line 1, Column 1 line 2, Sidebar line 2…" — completely mangling the intended reading order.

Similarly, a document with a header, body, footer, and side notes needs spatial awareness to correctly identify which text belongs to which section. A vision model that understands layout can say: "Text in the left column, position Y, at size 12pt is body text; text in the margin at size 8pt is a note."

How Vision Models See Documents

Vision-language models process documents as images. They don't rely on the PDF's internal text stream (which may be malformed in scanned PDFs anyway). Instead, they analyze the spatial arrangement of visual elements: text blocks, lines, tables, images. This is inherently layout-aware. Your job is to guide the model with explicit prompts and (optionally) bounding boxes to tell it what regions matter.

Layout Analysis Techniques

Technique 1: Spatial Prompts with Region Descriptions

The simplest approach: describe the document's layout in your prompt, guiding the model to extract text in the right order. Here's an example:

import anthropic
import base64
from pathlib import Path

def analyze_layout_with_spatial_prompt(image_path: str) -> dict:
"""
Extract text from a document with explicit layout guidance.
"""
client = anthropic.Anthropic()

image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

# Spatial prompt: tell the model about the document structure
spatial_prompt = """Analyze this document image carefully, paying attention to layout and reading order.

The document appears to have the following regions (describe what you see):
- Header section (top): [describe]
- Main body (center): [describe]
- Sidebar (right): [describe]
- Footer (bottom): [describe]

For EACH region, extract the text in the correct reading order (top to bottom, left to right within each region).
Return the result as JSON with keys: header, body, sidebar, footer.
Use null for any region that is not present.

Preserve the structure of bulleted lists, numbered lists, and table layout."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": spatial_prompt
}
]
}
]
)

import json
result_text = response.content[0].text
return json.loads(result_text)

# Example usage
result = analyze_layout_with_spatial_prompt("complex_document.jpg")
print("Header:", result.get("header"))
print("Body:", result.get("body"))

Why does this work? The model now has explicit instructions to treat regions differently. It knows to extract header text separately from body text, respecting the intended layout and reading order.

Technique 2: Bounding Box Prompts (for Precise Control)

For high-precision extraction, you can provide bounding box coordinates. You'd first run a layout detection model or manually annotate regions, then tell Claude to extract text within specific pixel ranges:

def extract_by_bounding_box(image_path: str, regions: list[dict]) -> dict:
"""
Extract text from specific regions defined by bounding boxes.

regions: list of {"name": "header", "x1": 0, "y1": 0, "x2": 1000, "y2": 100}
"""
client = anthropic.Anthropic()

image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

# Build region descriptions for the prompt
region_descriptions = ""
for region in regions:
region_descriptions += f"- {region['name']} (x: {region['x1']}-{region['x2']}, y: {region['y1']}-{region['y2']})\n"

bbox_prompt = f"""Extract text from the following regions of this document:

{region_descriptions}

For each region, extract ONLY the text that appears within its boundaries, in reading order.
Return as JSON with the region name as the key and the extracted text as the value.

Format: {{"region_name": "extracted text", ...}}"""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": bbox_prompt
}
]
}
]
)

import json
return json.loads(response.content[0].text)

# Define regions: header, body, sidebar, footer
regions = [
{"name": "header", "x1": 0, "y1": 0, "x2": 2000, "y2": 150},
{"name": "body", "x1": 0, "y1": 150, "x2": 1400, "y2": 1800},
{"name": "sidebar", "x1": 1400, "y1": 150, "x2": 2000, "y2": 1800},
{"name": "footer", "x1": 0, "y1": 1800, "x2": 2000, "y2": 2000}
]

result = extract_by_bounding_box("complex_document.jpg", regions)
for region_name, text in result.items():
print(f"{region_name}:\n{text}\n")

This approach gives you pixel-perfect control. However, bounding boxes are fragile if document sizes vary, so use this for documents with consistent formats.

Technique 3: Reading Order Recovery for Multi-Column Text

Multi-column documents are a classic layout challenge. The model sees the columns but needs guidance on the correct reading order. Here's a focused extraction technique:

def extract_multicolumn_text(image_path: str, num_columns: int = 2) -> dict:
"""
Extract text from a multi-column document in correct reading order.
"""
client = anthropic.Anthropic()

image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

multicolumn_prompt = f"""This document has {num_columns} columns of text.

Extract the text by reading DOWN the FIRST column from top to bottom, then DOWN the SECOND column from top to bottom (and so on).
This is the natural reading order for columnar text.

Return as JSON:
{{
"column_1": "text from first column...",
"column_2": "text from second column...",
...
"full_text_order": "concatenated text in correct reading order"
}}"""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": multicolumn_prompt
}
]
}
]
)

import json
return json.loads(response.content[0].text)

result = extract_multicolumn_text("newsletter.jpg", num_columns=2)
print(result["full_text_order"])

Common Layout Challenges & Solutions

ChallengeImpactSolution
Two-column or multi-column layoutText is interleaved, reading order is scrambledUse explicit column labels in prompt; ask model to extract column-by-column
Headers and footersPage-level metadata mixed with contentDefine header/footer regions separately in prompt; use bounding boxes for consistency
Sidebars and callout boxesContextual notes are mixed with main textSpatial prompt identifying sidebars; optionally extract sidebars to a separate JSON key
TablesRow/column relationships are visual, not text-basedSee article 3 (table extraction); tables warrant special handling beyond layout analysis
Handwritten annotationsUnpredictable placement, variable legibilityAsk model to identify handwriting; optionally extract to a separate "annotations" field

Best Practices for Layout-Aware Extraction

  1. Test on diverse documents: Layout techniques that work for a printed invoice may fail for a form or a complex report. Build a test set covering your actual use cases.

  2. Combine spatial and semantic prompts: Don't just say "extract text from the header region." Say "extract the title and subtitle from the header region — these are typically in a larger, bold font."

  3. Verify reading order in output: After extraction, spot-check that text flows naturally. Jumbled reading order is a telltale sign that layout was misunderstood.

  4. Use model confidence: Claude can express uncertainty. In your prompt, ask the model to flag regions where it's unsure about reading order.

  5. Pre-process scanned documents: If working with scans, deskew and enhance contrast before extraction. These pre-processing steps make layout much clearer.

Key Takeaways

  • Document layout analysis extracts text while preserving spatial relationships and reading order, avoiding common pitfalls like interleaved multi-column text.
  • Vision models understand layout implicitly, but explicit spatial prompts dramatically improve accuracy.
  • Three techniques: spatial prompts (flexible, language-driven), bounding boxes (precise, pixel-level), and reading-order guidance (for columns and complex layouts).
  • Common challenges include multi-column text, sidebars, headers/footers, and tables — each has specific solutions.
  • Testing on diverse documents and verifying output reading order is essential before deploying layout analysis in production.

Frequently Asked Questions

Can I extract layout information without full text extraction?

Yes. Ask the model to output a JSON with region descriptions and bounding box estimates without extracting all the text. This is useful if you want to understand a document's structure before committing to full extraction, or if you're building a layout-classification system.

What about documents with very different layouts (variable template)?

If documents vary significantly in layout, spatial prompts are more robust than bounding boxes. Train your prompt to describe layout abstractly ("identify the header region", "locate the footer") rather than expecting fixed pixel coordinates. You may also run a layout classification step first to choose the right extraction template per document type.

How do I handle rotated or skewed documents?

Ideally, pre-process scanned documents to correct rotation and skew before extraction. If you can't, mention rotation in your prompt ("this document appears rotated 90 degrees clockwise; rotate it mentally and extract in normal reading order") — most models can handle this, though accuracy may suffer.

Does layout analysis work for handwritten documents?

Partially. Models can detect spatial regions, but handwriting quality is the limiting factor. The layout analysis itself works; the issue is legibility of handwritten text. If your documents are handwritten, focus on handwriting recognition rather than layout analysis.

How do I measure layout extraction quality?

Compare extracted text to a gold-standard manual transcription. Metrics: word error rate (WER) for accuracy, reading order correctness (percentage of text sequences in correct order), and region classification accuracy (did the model identify region boundaries correctly?). A combination of these gives you a holistic view.

Further Reading