Key-Value Pair Extraction: Understanding Document Fields
Key-value pair extraction is the process of identifying labeled fields in a document and extracting their corresponding values. A form might have a field labeled "Applicant Name:" with value "John Smith"; a financial document might have "Total Revenue" next to a number; a contract might have "Agreement Date" paired with a date. These label-value associations are the building blocks of structured data extraction, and they're fundamental to automating document processing in finance, insurance, healthcare, and beyond.
I've built extraction systems for mortgage applications, insurance claims, and vendor forms, and I can tell you: the most robust approach combines spatial awareness (understanding that labels and values are adjacent) with semantic understanding (knowing that "Applicant Name" refers to a person, not a company). Modern vision-language models excel at this task when guided with clear prompts.
Understanding Key-Value Structure
What Is a Key-Value Pair?
A key-value pair consists of:
- Key (label): The field name or identifier (e.g., "Invoice Number", "Date of Birth")
- Value: The corresponding data (e.g., "INV-2026-00123", "1990-05-15")
Keys can be explicit (printed labels) or implicit (derived from context). Values can be simple (single words or numbers) or complex (addresses with multiple lines, lists of items).
Why Key-Value Extraction Matters
Key-value extraction is the foundation for:
- Form automation: Extracting data from application forms, questionnaires, claim forms.
- Document summarization: Pulling key facts (date, parties, amounts) from contracts or reports.
- Data integration: Feeding extracted values directly into databases, CRM systems, or APIs.
- Compliance: Verifying that required fields are present and extracting values for audit trails.
Core Extraction Patterns
Pattern 1: Label-Adjacent Values
The simplest case: a label appears immediately next to or above its value. Examples:
Name: John Smith
Address: 123 Main St, Boston, MA
Phone: (617) 555-0123
Extraction prompt for this pattern:
import anthropic
import base64
import json
from pathlib import Path
def extract_label_value_pairs(image_path: str, expected_fields: list[str]) -> dict:
"""
Extract key-value pairs where labels are adjacent to values.
expected_fields: list of field names to look for (e.g., ["Name", "Address", "Phone"])
"""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
fields_desc = ", ".join(expected_fields)
label_value_prompt = f"""Extract key-value pairs from this document.
Look for the following fields: {fields_desc}
For each field, find the label (e.g., "Name:") and extract the corresponding value.
Return as JSON with field names as keys and extracted values as values.
{{
"name": "extracted value or null",
"address": "extracted value or null",
"phone": "extracted value or null"
}}
Important:
- Use the exact field names provided (lowercase)
- Extract ONLY the value, not the label
- If a field is not found, use null
- Preserve formatting (e.g., phone numbers with parentheses)
- Use null for empty or illegible fields"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": label_value_prompt
}
]
}
]
)
return json.loads(response.content[0].text)
# Example usage
extracted = extract_label_value_pairs("form.jpg", ["name", "address", "phone"])
print(extracted)
# Output: {"name": "John Smith", "address": "123 Main St, Boston, MA", "phone": "(617) 555-0123"}
Pattern 2: Contextual Values (No Explicit Label)
Sometimes values appear without explicit labels. Context clues indicate what the value represents. For example, a date appearing near the word "Agreement" likely is the agreement date.
def extract_contextual_values(image_path: str, context_hints: dict) -> dict:
"""
Extract values using context hints when labels are not explicit.
context_hints: dict like {
"company_name": "Look near the company logo or letterhead",
"invoice_date": "Find the date near the word 'Date' or 'Invoice Date'"
}
"""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
# Build hints string
hints_text = "\n".join([f"- {field}: {hint}" for field, hint in context_hints.items()])
contextual_prompt = f"""Extract values from this document using the following context clues:
{hints_text}
Even if labels are not explicitly printed, look for visual and semantic context to identify values.
Return as JSON with field names as keys.
Example: If you see a date next to "Effective Date" or in a date-like position, extract it as "effective_date"."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": contextual_prompt
}
]
}
]
)
return json.loads(response.content[0].text)
# Example usage
hints = {
"company_name": "Located in the document header or associated with a logo",
"agreement_date": "Date field near 'Agreement Date' or 'Date' label",
"signatory_name": "Name appearing near a signature or 'By:' label"
}
extracted = extract_contextual_values("contract.jpg", hints)
print(extracted)
Pattern 3: Typed Values (Enforce Data Types)
For production systems, you need to ensure extracted values match expected types (dates, numbers, emails, phone numbers). Add type constraints to your prompts:
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
@dataclass
class TypedField:
name: str
field_type: str # "string", "date", "number", "email", "phone"
required: bool = False
def extract_typed_fields(image_path: str, fields: list[TypedField]) -> dict:
"""
Extract key-value pairs with type checking.
"""
client = anthropic.Anthropic()
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
# Build field schema for prompt
field_specs = []
for field in fields:
required_str = "required" if field.required else "optional"
field_specs.append(f"- {field.name} ({field.field_type}, {required_str})")
field_specs_text = "\n".join(field_specs)
typed_prompt = f"""Extract the following typed fields from this document:
{field_specs_text}
Format requirements:
- date: ISO 8601 format (YYYY-MM-DD)
- number: numeric value (no currency symbols or commas)
- email: valid email address
- phone: 10-digit US format (NNN) NNN-NNNN or international format
- string: text as-is
Return as JSON. For each field, include:
{{"field_name": "value", "type": "detected_type", "confidence": 0.0-1.0}}
If a field cannot be found or is illegible, use null for the value."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": typed_prompt
}
]
}
]
)
return json.loads(response.content[0].text)
# Example usage
fields = [
TypedField("applicant_name", "string", required=True),
TypedField("date_of_birth", "date", required=True),
TypedField("phone_number", "phone", required=False),
TypedField("annual_income", "number", required=False),
]
extracted = extract_typed_fields("loan_application.jpg", fields)
for field_name, field_data in extracted.items():
print(f"{field_name}: {field_data['value']} (confidence: {field_data['confidence']})")
Validation and Post-Processing
After extraction, validate values against type constraints and domain rules:
import re
from datetime import datetime
def validate_email(email: str) -> bool:
"""Simple email validation."""
pattern = r"^[^@]+@[^@]+\.[^@]+$"
return bool(re.match(pattern, email))
def validate_phone(phone: str) -> bool:
"""Validate US phone format."""
cleaned = re.sub(r"\D", "", phone)
return len(cleaned) == 10
def validate_date(date_str: str) -> bool:
"""Validate ISO 8601 date format."""
try:
datetime.fromisoformat(date_str)
return True
except (ValueError, TypeError):
return False
def validate_extracted_fields(fields: dict, schema: dict) -> tuple[bool, list[str]]:
"""
Validate extracted fields against a schema.
schema: {"field_name": {"type": "email", "required": True}, ...}
"""
errors = []
for field_name, field_schema in schema.items():
value = fields.get(field_name)
field_type = field_schema.get("type")
required = field_schema.get("required", False)
# Check if required field is present
if required and (value is None or value == ""):
errors.append(f"Required field '{field_name}' is missing")
continue
# Skip validation if value is empty and not required
if value is None or value == "":
continue
# Type-specific validation
if field_type == "email" and not validate_email(value):
errors.append(f"Field '{field_name}' has invalid email format: {value}")
elif field_type == "phone" and not validate_phone(value):
errors.append(f"Field '{field_name}' has invalid phone format: {value}")
elif field_type == "date" and not validate_date(value):
errors.append(f"Field '{field_name}' has invalid date format: {value}")
elif field_type == "number":
try:
float(value)
except ValueError:
errors.append(f"Field '{field_name}' is not numeric: {value}")
return len(errors) == 0, errors
# Example usage
schema = {
"applicant_name": {"type": "string", "required": True},
"email": {"type": "email", "required": True},
"phone": {"type": "phone", "required": False},
"income": {"type": "number", "required": False}
}
extracted = {
"applicant_name": "Jane Doe",
"email": "[email protected]",
"phone": "(617) 555-0123",
"income": "75000"
}
is_valid, errors = validate_extracted_fields(extracted, schema)
print(f"Valid: {is_valid}")
if errors:
print(f"Errors: {errors}")
Common Key-Value Extraction Challenges
| Challenge | Example | Solution |
|---|---|---|
| Multiline values | Address spread across 3-4 lines | Use contextual prompts; ask model to identify value boundaries |
| Missing labels | Value without a clear label (implicit field) | Use context clues and field position; ask model to infer field type |
| Label-value separation | Label on one line, value on next (not adjacent) | Expand search radius in spatial prompts; accept non-adjacent pairs |
| Repeated fields | Multiple instances of the same field | Specify which instance to extract (first, last, sum) in the prompt |
| Formatting inconsistency | Dates in different formats (MM/DD/YYYY vs DD-MM-YYYY) | Normalize in post-processing; store original format if needed |
Key Takeaways
- Key-value pair extraction identifies labeled fields and extracts corresponding values, foundational for form automation and document processing.
- Three patterns: label-adjacent (explicit labels), contextual (implicit labels), and typed (with data type constraints).
- Always validate extracted values: type-check, check required fields, and enforce domain rules (e.g., valid email format).
- Post-process and normalize values (dates, phone numbers) to ensure consistency.
- Real-world forms have inconsistencies; robust extraction uses spatial reasoning, context clues, and flexible field matching.
Frequently Asked Questions
How do I extract data from forms with variable layouts?
Use contextual prompts that describe field types rather than fixed positions. Instead of "the name is at position X,Y", ask the model to "identify the field labeled 'Name' or 'Full Name' and extract the value." This is more flexible than bounding boxes for variable layouts.
What if the same label appears multiple times (e.g., multiple addresses)?
Specify in the prompt which instance to extract: "Extract the FIRST address field" or "Extract all address fields as an array". Alternatively, include context clues to distinguish them: "the Billing Address is in the first section, the Shipping Address in the second."
Can I extract key-value pairs from tables within documents?
Yes, but be explicit about the extraction scope. A table is structured data; extract it as a table first, then optionally convert rows to key-value pairs. If mixing table data and form fields, handle them separately: tables → structured rows; form fields → key-value pairs.
How do I handle optional vs. required fields?
Specify in the prompt: "Required fields: name, email. Optional fields: phone, company." Validate after extraction; required fields with null values are validation failures. Depending on your application, decide whether to reject the entire extraction or flag specific missing fields.
What about fields with numeric ranges (e.g., "Income: $50,000-$75,000")?
Ask the model to extract ranges as objects: {"min": 50000, "max": 75000} or as separate fields: {"income_min": 50000, "income_max": 75000}. Be explicit in the prompt about how ranges should be represented.