Skip to main content

PDF parsing for RAG: Extract text and metadata accurately

PDF extraction is notoriously unreliable: PDFs can be text-based (searchable) or scanned (images with OCR), may contain complex layouts, embedded fonts, tables, and metadata scattered across pages. A production RAG system must extract text and preserve document structure so chunks remain coherent and queryable. This article covers text extraction libraries, layout detection, table preservation, and when to use vision models.

The challenge is that PDFs are inherently layout-focused (they render on paper) while text extraction flattens that layout into linear sequences. A simple text extraction might reorder columns, merge table cells, or drop headers entirely. Solving this requires choosing the right extraction tool for your PDF type.

Text-Based vs. Scanned PDFs

Text-based PDFs contain embedded font and text objects; extraction is fast and accurate. Scanned PDFs are images; they require OCR. Most corporate documents, research papers, and recent software documentation are text-based. Older books, faxes, and handwritten documents are scanned.

# Detect if a PDF is text-based or scanned using PyPDF2
import PyPDF2

def is_pdf_searchable(file_path: str) -> bool:
"""Check if a PDF contains extractable text (is text-based)."""
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
first_page = reader.pages[0]
text = first_page.extract_text()
# If extract_text returns meaningful content, it's text-based
return len(text.strip()) > 100

# Usage
if is_pdf_searchable("document.pdf"):
print("Text-based PDF: use PyPDF2 or pdfplumber")
else:
print("Scanned PDF: use OCR (Tesseract, AWS Textract)")

For text-based PDFs, pdfplumber is the industry standard for RAG systems. It preserves tables, bounding boxes, and layout relationships. For scanned PDFs, use pytesseract (local OCR) or cloud services like AWS Textract (higher accuracy, higher cost).

Extracting Text with pdfplumber

pdfplumber is built on pdfminer and excels at preserving structure. Unlike PyPDF2 (which extracts raw text) or pdf2image (which rasterizes to images), pdfplumber recovers object coordinates, tables, and layout.

import pdfplumber

def extract_pdf_with_structure(file_path: str) -> list[dict]:
"""Extract text, tables, and layout from a PDF using pdfplumber."""
chunks = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
# Extract main text
text = page.extract_text()

# Extract tables (preserve as formatted text)
tables = page.extract_tables()
table_text = ""
if tables:
for table in tables:
# Convert table to markdown for better chunk readability
table_text += "| " + " | ".join(table[0]) + " |\n"
table_text += "|" + " --- |" * len(table[0]) + "\n"
for row in table[1:]:
table_text += "| " + " | ".join(str(cell) for cell in row) + " |\n"

# Combine text and tables
full_content = text + "\n\n" + table_text if table_text else text

chunks.append({
"text": full_content,
"page": page_num,
"source": file_path,
"has_tables": len(tables) > 0 if tables else False,
"format": "pdf"
})
return chunks

pdfplumber also provides bounding box coordinates (page.chars, page.rects) enabling layout-aware chunking (see Article 4 on table extraction for details).

Handling Metadata and Document Structure

PDFs often contain metadata: author, creation date, title, and embedded bookmarks (outlines). Preserve this as chunk metadata for filtering and ranking.

from datetime import datetime
import PyPDF2

def extract_pdf_metadata(file_path: str) -> dict:
"""Extract PDF metadata: title, author, creation date, page count."""
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
metadata = reader.metadata

return {
"title": metadata.get("/Title", ""),
"author": metadata.get("/Author", ""),
"subject": metadata.get("/Subject", ""),
"created": metadata.get("/CreationDate", ""),
"pages": len(reader.pages),
"file": file_path
}

# Attach metadata to chunks
meta = extract_pdf_metadata("whitepaper.pdf")
chunks = extract_pdf_with_structure("whitepaper.pdf")
for chunk in chunks:
chunk.update({
"doc_title": meta["title"],
"doc_author": meta["author"],
"doc_created": meta["created"]
})

OCR for Scanned PDFs

For scanned PDFs, local OCR using pytesseract works for small batches, but cloud OCR (AWS Textract, Google Document AI) is more reliable for production. Textract outputs both text and layout bounding boxes, making it excellent for complex scanned documents.

import pytesseract
from pdf2image import convert_from_path

def ocr_scanned_pdf(file_path: str) -> list[dict]:
"""Extract text from scanned PDFs using Tesseract OCR."""
chunks = []
images = convert_from_path(file_path)

for page_num, image in enumerate(images, start=1):
text = pytesseract.image_to_string(image)
chunks.append({
"text": text,
"page": page_num,
"source": file_path,
"ocr_used": True,
"format": "pdf"
})

return chunks

Layout-Aware Chunking for Complex PDFs

For documents with multi-column layouts, sidebars, or dense tables, preserve column and spatial structure. pdfplumber's bounding box data enables this.

def layout_aware_extraction(file_path: str) -> list[dict]:
"""Extract content respecting column and section boundaries."""
chunks = []

with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
# Group text by vertical position (detect columns)
text_by_x = {}
for char in page.chars:
x = int(char["x0"])
if x not in text_by_x:
text_by_x[x] = []
text_by_x[x].append(char["text"])

# Combine columns from left to right
sorted_cols = sorted(text_by_x.items())
full_text = " ".join("".join(chars) for _, chars in sorted_cols)

chunks.append({
"text": full_text,
"page": page_num,
"source": file_path,
"layout_aware": True
})

return chunks

Comparison of PDF Extraction Tools

ToolSpeedLayoutTablesCostBest For
PyPDF2Very fastPoorNoFreeQuick extraction, text-only
pdfplumberFastGoodYes (partial)FreeMost text-based PDFs
pdfminer.sixFastGoodNoFreePrecise text coordinates
AWS TextractModerateExcellentYesPer-page $Scanned docs, complex layouts
Vision LLM (Claude)SlowExcellentYesPer-image $Maximum quality, strategic docs

Key Takeaways

  • Distinguish between text-based PDFs (use pdfplumber) and scanned PDFs (use OCR).
  • pdfplumber preserves layout, bounding boxes, and tables—ideal for RAG ingestion.
  • Extract and attach document metadata (title, author, creation date) for filtering and ranking.
  • For scanned PDFs, use local OCR (pytesseract) for prototypes and cloud OCR (AWS Textract) for production.
  • Layout-aware extraction using bounding boxes prevents column merging and preserves spatial structure.

Frequently Asked Questions

Should I use pdfplumber or PyPDF2?

Use pdfplumber for RAG. It's built on pdfminer and preserves tables and layout. PyPDF2 is fast but loses structure, making it suitable only for simple text-only PDFs where you don't care about organization.

How do I handle PDFs with images of text?

Use OCR. For small batches, pytesseract (local) is free; for production, AWS Textract or Google Document AI provide better accuracy and layout recovery. Vision LLMs (Claude's vision API) offer maximum quality but are slower and costlier.

What if pdfplumber fails to extract text?

The PDF may be encrypted, corrupted, or purely image-based. Try opening it in PDF readers to confirm it's readable; if readable but extraction fails, it's likely a layout or encoding quirk. Use pdfplumber's debug mode: pdf.debug = True and inspect character/line objects.

Should I preserve or flatten tables in chunks?

Preserve tables as markdown or HTML inside chunks. Flattening (converting to prose) loses semantic structure. A markdown table like | Col1 | Col2 | is both human-readable and LLM-friendly.

How do I handle multi-language PDFs?

If the PDF is text-based, extraction works regardless of language. For OCR, set tesseract language: pytesseract.image_to_string(image, lang='fra+eng') for French + English. Cloud OCR services (Textract, Document AI) support 100+ languages.

Further Reading