Invoice and Receipt Extraction: Automate Data Capture
Invoice and receipt extraction is one of the most valuable applications of document AI. Organizations process millions of invoices annually — from vendors, suppliers, and contractors — and manual entry is a major bottleneck. A single invoice extraction system can automate hours of data entry, reduce errors, and enable real-time spending visibility. The challenge is that invoices vary dramatically in format: some have clear line-item tables, others list items in text, some have nested discounts and taxes, and many include logos, QR codes, and other non-text elements.
I've deployed invoice extraction systems processing 50,000+ documents monthly. The difference between a naive extraction (which captures total amount and nothing else) and a robust system (which extracts vendor, all line items, taxes, discounts, and payment terms) translates directly to downstream efficiency.
Invoice Structure: What to Extract
A typical invoice contains:
- Metadata: Vendor name, vendor address, vendor tax ID, invoice number, invoice date, due date
- Customer info: Customer/bill-to company, contact person, address
- Line items: Product/service description, quantity, unit price, total per item
- Totals section: Subtotal, tax (itemized by type), discounts, final total
- Payment info: Payment terms, payment method, bank details
Real invoices vary widely. Some omit customer details. Some have multiple tax rates. Some include shipping costs. Build a flexible schema that captures the most important fields and marks optional ones as such.
Invoice Schema and Extraction
Building an Invoice Schema
from dataclasses import dataclass
from typing import Optional, List
from decimal import Decimal
@dataclass
class LineItem:
description: str
quantity: Decimal
unit_price: Decimal
total: Decimal
tax_rate: Optional[float] = None
@dataclass
class TaxBreakdown:
tax_type: str # "Sales Tax", "VAT", "GST", etc.
amount: Decimal
@dataclass
class Invoice:
# Metadata
invoice_number: str
invoice_date: str # ISO 8601
due_date: Optional[str]
# Vendor info
vendor_name: str
vendor_address: Optional[str]
vendor_tax_id: Optional[str]
# Customer info
customer_name: Optional[str]
customer_address: Optional[str]
# Line items and totals
line_items: List[LineItem]
subtotal: Decimal
tax_breakdown: List[TaxBreakdown]
total_tax: Decimal
discount_amount: Optional[Decimal] = None
final_total: Decimal
# Payment terms
payment_terms: Optional[str] = None
def to_dict(self) -> dict:
"""Convert to JSON-serializable dict."""
return {
"invoice_number": self.invoice_number,
"invoice_date": self.invoice_date,
"due_date": self.due_date,
"vendor": {
"name": self.vendor_name,
"address": self.vendor_address,
"tax_id": self.vendor_tax_id
},
"customer": {
"name": self.customer_name,
"address": self.customer_address
},
"line_items": [
{
"description": item.description,
"quantity": float(item.quantity),
"unit_price": float(item.unit_price),
"total": float(item.total),
"tax_rate": item.tax_rate
}
for item in self.line_items
],
"totals": {
"subtotal": float(self.subtotal),
"tax_breakdown": [
{"type": tax.tax_type, "amount": float(tax.amount)}
for tax in self.tax_breakdown
],
"total_tax": float(self.total_tax),
"discount": float(self.discount_amount) if self.discount_amount else None,
"final_total": float(self.final_total)
},
"payment_terms": self.payment_terms
}
Extraction Prompt for Invoices
import anthropic
import base64
import json
from pathlib import Path
def extract_invoice(image_path: str) -> dict:
"""
Extract invoice data from an image.
"""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
invoice_prompt = """Analyze this invoice image and extract the following data as JSON:
{
"invoice_number": "invoice or reference number",
"invoice_date": "date in YYYY-MM-DD format",
"due_date": "date in YYYY-MM-DD format or null",
"vendor": {
"name": "vendor company name",
"address": "full address or null",
"tax_id": "VAT/Tax ID or null"
},
"customer": {
"name": "customer/bill-to company name or null",
"address": "customer address or null"
},
"line_items": [
{
"description": "item description",
"quantity": number,
"unit_price": number (without currency symbol),
"total": number
}
],
"subtotal": number,
"tax_breakdown": [
{
"type": "Sales Tax, VAT, GST, etc.",
"rate": percentage (e.g., 0.08 for 8%) or null,
"amount": number
}
],
"discount_amount": number or null,
"final_total": number,
"payment_terms": "Net 30, Due on Receipt, etc. or null"
}
Important instructions:
- Extract EXACTLY as shown; do not add extra fields
- All amounts as numbers without currency symbols
- Dates in ISO 8601 (YYYY-MM-DD) format
- If a field is missing or unclear, use null
- For line items, extract EVERY item in the invoice
- Calculate missing values if needed (e.g., if subtotal is missing, sum line items)
- Preserve the original quantity/price (do not calculate per-item if bundled)
- If there are discounts, extract as a separate line item or in "discount_amount"
- Return ONLY valid JSON, no markdown or explanation"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": invoice_prompt
}
]
}
]
)
return json.loads(response.content[0].text)
# Example usage
invoice_data = extract_invoice("invoice_sample.jpg")
print(json.dumps(invoice_data, indent=2))
Validation and Reconciliation
After extraction, validate the invoice for consistency:
def validate_invoice(data: dict) -> tuple[bool, list[str]]:
"""
Validate invoice data for consistency.
"""
errors = []
# Required fields
if not data.get("invoice_number"):
errors.append("Invoice number is missing")
if not data.get("invoice_date"):
errors.append("Invoice date is missing")
if not data.get("vendor", {}).get("name"):
errors.append("Vendor name is missing")
# Validate line items
line_items = data.get("line_items", [])
if not line_items:
errors.append("No line items found")
# Calculate expected subtotal and verify
calculated_subtotal = sum(
item.get("total", 0) for item in line_items
)
reported_subtotal = data.get("subtotal", 0)
if abs(calculated_subtotal - reported_subtotal) > 0.01:
errors.append(
f"Subtotal mismatch: calculated {calculated_subtotal}, "
f"reported {reported_subtotal}"
)
# Validate final total
tax_total = sum(
tax.get("amount", 0) for tax in data.get("tax_breakdown", [])
)
discount = data.get("discount_amount", 0) or 0
calculated_total = reported_subtotal + tax_total - discount
reported_total = data.get("final_total", 0)
if abs(calculated_total - reported_total) > 0.01:
errors.append(
f"Total mismatch: calculated {calculated_total}, "
f"reported {reported_total}"
)
# Validate dates
from datetime import datetime
try:
datetime.fromisoformat(data.get("invoice_date", ""))
except (ValueError, TypeError):
errors.append(f"Invalid invoice date format: {data.get('invoice_date')}")
if data.get("due_date"):
try:
datetime.fromisoformat(data.get("due_date"))
except (ValueError, TypeError):
errors.append(f"Invalid due date format: {data.get('due_date')}")
return len(errors) == 0, errors
# Example usage
is_valid, errors = validate_invoice(invoice_data)
if not is_valid:
print("Validation errors:")
for error in errors:
print(f" - {error}")
else:
print("Invoice is valid!")
Handling Complex Invoices
Multi-Currency Invoices
Some invoices list prices in multiple currencies. Extract each line item with its currency:
def extract_multicurrency_invoice(image_path: str) -> dict:
"""Extract invoices with multiple currencies."""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
prompt = """Extract invoice data. If the invoice has multiple currencies:
- Include 'currency' field for each line item (e.g., "USD", "EUR")
- Include 'currency' in the totals section
- Extract exchange rates if present
JSON format:
{
"line_items": [
{"description": "...", "quantity": N, "unit_price": N, "total": N, "currency": "USD"},
...
],
"totals": {"subtotal": N, "currency": "USD", "exchange_rate_to_primary": N or null}
}"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": prompt
}
]
}
]
)
return json.loads(response.content[0].text)
Recurring Invoices and Subscriptions
For subscription-based services, extract billing period and recurrence information:
def extract_subscription_invoice(image_path: str) -> dict:
"""Extract subscription/recurring billing invoices."""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
prompt = """Extract subscription invoice data:
{
"invoice_number": "...",
"invoice_date": "YYYY-MM-DD",
"billing_period_start": "YYYY-MM-DD",
"billing_period_end": "YYYY-MM-DD",
"subscription_name": "service name",
"monthly_amount": number,
"billing_frequency": "monthly, quarterly, annual, etc.",
"next_billing_date": "YYYY-MM-DD or null",
"line_items": [...]
}"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": prompt
}
]
}
]
)
return json.loads(response.content[0].text)
Key Takeaways
- Invoice extraction automates data entry and integrates with accounting systems; the most common document AI use case.
- Standard invoices contain: metadata (invoice number, date), vendor/customer info, line items, totals, payment terms.
- Always validate extracted invoices for mathematical consistency (subtotal, tax, discount, final total).
- Real invoices vary widely; build flexible schemas that handle multi-currency, multi-tax, discount, and subscription variations.
- Post-extraction, route high-confidence extractions directly to accounting systems; flag low-confidence or invalid invoices for human review.
Frequently Asked Questions
How do I handle invoices with discounts applied to specific line items vs. the whole invoice?
Ask the model to differentiate: line-item discounts go in the line item; invoice-level discounts go in a separate "discount_amount" field. If ambiguous, extract both and flag for review. Document the discount type in the JSON for clarity.
What if the invoice is handwritten or poorly scanned?
Handwritten invoices are much harder. If you must process them, consider: (1) Ask OCR preprocessing before sending to the model, (2) lower confidence thresholds and route more to human review, (3) combine vision extraction with traditional OCR for critical fields. For poor scans, enhance contrast and brightness before extraction.
How do I handle invoices with multiple pages?
Process each page separately, extract line items from each, then merge. In the final step, validate that the total matches the sum of all pages. For multi-page invoices, consider passing all pages in a single request (if token limits allow) with context: "This is page 1 of 3. Continue extracting line items across pages."
Can I extract invoices from PDF files directly?
PDFs vary: some have embedded text (easier), others are image-based (scans). Convert PDFs to images first (one image per page), then use extraction. Libraries like PyPDF2 or pdfplumber can help. For image-based PDFs, use PIL/Pillow to convert to JPG/PNG.
What about vendor-specific invoices with custom layouts?
For high-volume vendors with consistent layouts, consider: (1) Bounding box extraction for key fields (vendor address, invoice number), (2) custom prompts tuned to their format, (3) hybrid approaches (structure + learned extraction). Template-based extraction can be more efficient than generic extraction for consistent formats.