Skip to main content

Table extraction and structured data: Preserve tabular RAG information

Tables are dense repositories of facts, but naive chunking converts them into meaningless lists of values. A table with 20 rows and 5 columns becomes 100 individual tokens scattered across a chunk, losing relationships like "Revenue in Q3 2025 was $4.2M." For RAG systems, preserving table structure is critical: financial reports, technical specifications, and comparative analyses all depend on tabular data coherence.

This article covers extracting tables from multiple formats (PDF, Excel, HTML), preserving semantic structure, and designing chunks that keep related rows and columns together. The goal is chunks that respect column headers, maintain row identity, and enable queries like "What was the budget for the marketing department?"

Why Tables Require Special Treatment

Standard text chunking ignores table boundaries. If a table spans a 512-token chunk boundary, you get:

  • Row N split between two chunks (context loss)
  • Column headers misaligned with data (semantic confusion)
  • Metadata rows (units, sources) separated from values

A 2024 study by Li et al. found that tables incorrectly chunked reduce RAG fact-retrieval accuracy by 40–60% compared to table-aware chunking. The solution: detect tables, chunk them as atomic units (or vertically if very large), and preserve headers and structure.

Extracting Tables from PDFs

As covered in Article 2, pdfplumber preserves table structure. But extracting and chunking tables correctly requires additional logic.

import pdfplumber
import pandas as pd

def extract_tables_from_pdf(file_path: str) -> list[dict]:
"""Extract tables from a PDF and return as structured chunks."""
table_chunks = []

with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
tables = page.extract_tables()

if tables:
for table_idx, table in enumerate(tables):
# Convert to DataFrame for easier manipulation
df = pd.DataFrame(table[1:], columns=table[0])

# Convert to markdown table (readable by LLM and embedded as context)
markdown_table = df.to_markdown(index=False)

table_chunks.append({
"text": markdown_table,
"type": "table",
"page": page_num,
"table_idx": table_idx,
"source": file_path,
"rows": len(df),
"columns": len(df.columns)
})

return table_chunks

# Usage
chunks = extract_tables_from_pdf("financial_report.pdf")
for chunk in chunks:
print(f"Table: {chunk['rows']} rows, {chunk['columns']} columns")
print(chunk['text'])

Extracting Tables from Excel and CSV

Excel and CSV files are structured by design. Extract sheets and preserve column types (dates, numbers, text) for better context.

import pandas as pd

def extract_tables_from_excel(file_path: str) -> list[dict]:
"""Extract all sheets from an Excel file as RAG chunks."""
table_chunks = []

# Read all sheets
excel_file = pd.ExcelFile(file_path)

for sheet_name in excel_file.sheet_names:
df = pd.read_excel(file_path, sheet_name=sheet_name)

# Skip empty sheets
if df.empty:
continue

# Preserve types: format dates, round floats
for col in df.columns:
if df[col].dtype == 'datetime64[ns]':
df[col] = df[col].dt.strftime('%Y-%m-%d')
elif df[col].dtype in ['float64', 'float32']:
df[col] = df[col].round(2)

# Convert to markdown table
markdown_table = df.to_markdown(index=False)

table_chunks.append({
"text": markdown_table,
"type": "table",
"sheet": sheet_name,
"source": file_path,
"rows": len(df),
"columns": len(df.columns)
})

return table_chunks

# Usage
chunks = extract_tables_from_excel("sales_data.xlsx")

Chunking Large Tables

A table with 1,000 rows cannot fit in a single chunk. Strategies for splitting:

1. Vertical Split (by columns): If a table has many columns, split into sub-tables.

def split_table_vertically(df: pd.DataFrame, max_cols: int = 4) -> list[pd.DataFrame]:
"""Split a wide table into narrower sub-tables, preserving key identifiers."""
# Keep first column (usually ID/name) in all sub-tables
key_col = df.iloc[:, 0]
result_dfs = [pd.concat([key_col, df.iloc[:, 1:max_cols]], axis=1)]

# Split remaining columns
for i in range(max_cols, len(df.columns), max_cols - 1):
result_dfs.append(
pd.concat([key_col, df.iloc[:, i:i+max_cols-1]], axis=1)
)

return result_dfs

2. Horizontal Split (by rows): If a table has many rows, split into summary + detail chunks.

def split_table_horizontally(df: pd.DataFrame, rows_per_chunk: int = 50) -> list[str]:
"""Split a tall table into multiple chunks, each with headers."""
chunks = []

for i in range(0, len(df), rows_per_chunk):
sub_df = df.iloc[i:i+rows_per_chunk]
markdown = sub_df.to_markdown(index=False)
chunks.append(markdown)

return chunks

3. Aggregate + Detail: For fact-heavy tables, create a summary chunk (totals, averages) and detail chunks (individual rows or row groups).

def summarize_large_table(df: pd.DataFrame) -> dict:
"""Create summary and detail chunks for a large numeric table."""
# Generate summary (totals, counts, averages)
summary_data = {
'total_rows': len(df),
'numeric_columns': df.select_dtypes(include=['number']).sum().to_dict()
}

summary_text = "## Summary\n\n"
summary_text += f"Total rows: {summary_data['total_rows']}\n\n"
for col, total in summary_data['numeric_columns'].items():
summary_text += f"Total {col}: {total}\n"

return {
"summary_chunk": summary_text,
"detail_chunks": split_table_horizontally(df, rows_per_chunk=50)
}

Handling Messy Tables (Merged Cells, Subheaders)

Real-world tables have merged cells, multi-level headers, and irregular structure. Clean before extracting.

def clean_merged_cell_table(df: pd.DataFrame) -> pd.DataFrame:
"""Handle tables with merged cells and multi-level headers."""
# Forward-fill empty cells (common in merged cell scenarios)
df = df.fillna(method='ffill')

# Flatten multi-level columns if present
if isinstance(df.columns, pd.MultiIndex):
df.columns = [' - '.join(col).strip() for col in df.columns.values]

# Remove rows that are all NaN or contain only metadata
df = df.dropna(how='all')

# Remove obvious header/footer rows (all text, no numbers)
numeric_cols = df.select_dtypes(include=['number']).columns
if len(numeric_cols) > 0:
# Keep rows that have at least one numeric value
df = df[df[numeric_cols].notna().any(axis=1)]

return df

Table Detection in Mixed Documents

When ingesting documents with both text and tables, detect and handle tables separately.

def ingest_mixed_document(file_path: str) -> list[dict]:
"""Ingest a document, separating tables from text."""
all_chunks = []

# Extract tables
table_chunks = extract_tables_from_pdf(file_path) if file_path.endswith('.pdf') \
else extract_tables_from_excel(file_path)
all_chunks.extend(table_chunks)

# Extract text (excluding table regions)
# For PDFs, use pdfplumber's layout analysis to skip table areas
if file_path.endswith('.pdf'):
import pdfplumber
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
# Get text, excluding table bounding boxes
text = page.extract_text()
if text.strip(): # Skip pages that are just tables
all_chunks.append({
"text": text,
"type": "text",
"source": file_path
})

return all_chunks

Comparison of Table Extraction Methods

MethodAccuracyPreserves StructureSpeedCost
pdfplumber (PDF)85–95%YesFastFree
pandas read_excel95–99%YesFastFree
Camelot (PDF)80–90%Yes (advanced)ModerateFree
AWS Textract90–98%ExcellentModerate$ per-page
Claude Vision95–99%ExcellentSlow$ per-image

Key Takeaways

  • Tables require special handling: naive chunking destroys relationships between rows and columns.
  • Extract tables as markdown to preserve structure and readability for LLMs.
  • For large tables (1000+ rows), split vertically (by columns) or horizontally (by rows) while preserving headers.
  • Clean messy tables (merged cells, multi-level headers) before extracting using pandas utilities.
  • Separate table extraction from text extraction to avoid chunking tabular data as prose.

Frequently Asked Questions

What's the best format to preserve tables in chunks—markdown, HTML, or JSON?

Markdown tables are best: they're human-readable, LLM-friendly, and compact. HTML tables are verbose; JSON arrays lose visual structure. Use markdown: | Col1 | Col2 | ... |.

How large can a table be in a single chunk?

A table with 50 rows × 5 columns is approximately 250–400 tokens; fits comfortably in a 1,024-token chunk. Tables with 200+ rows should be split using vertical or horizontal chunking to maintain semantic coherence.

Should I include row numbers or indices in table chunks?

Yes, if the table uses row numbers for reference (e.g., SEC filings, numbered lists). Include as the first column: | # | Name | Value |. If rows lack inherent IDs, add them using pandas reset_index().

How do I preserve units (currency, percentages) when extracting tables?

Include units in the column header: | Revenue (USD) | Margin (%) | or in the cell values themselves: 4.2M USD, 15.3%. Avoid stripping units; LLMs need explicit units to avoid confusion.

Can I use vision APIs to extract tables more accurately?

Yes. Claude's vision API or other vision models can extract tables with higher accuracy (95–99%) from complex PDFs, especially scanned documents. Use for mission-critical tables; use pdfplumber for automated batch processing.

Further Reading