Skip to main content

OCR & Text Extraction: Vision Models, Exact Reading

Text extraction from images (optical character recognition, OCR) is one of the most demanded vision language tasks, yet it's also surprisingly difficult without careful prompting. Vision language models can read printed text, handwriting, and documents, but accuracy varies dramatically based on image quality, text size, language, and how you structure your prompt. A carefully crafted OCR prompt yields 90-95% accuracy on clean documents; a generic "extract text" request produces 60-70% accuracy with hallucination and missing content.

The core challenge is that vision models are trained to understand images semantically, not to transcribe them character-by-character. They excel at answering "what text is visible," but struggle with exact reproduction of text, especially when the font is unusual, the image is low-contrast, or the text is handwritten. Strategic prompting—specifying output format, layout preservation, confidence thresholds, and error handling—dramatically improves both accuracy and usability.

OCR Prompt Structure: The Three-Part Framework

Effective OCR prompts follow a structured three-part pattern: specification (what text to extract), constraints (quality requirements, error handling), and output format (how to structure the result).

def ocr_prompt(image_content, extraction_scope, quality_level="strict", output_format="plain_text"):
"""
Generates a structured OCR extraction prompt.

Args:
image_content: Brief description of image content
extraction_scope: What text to extract (e.g., "all visible text", "document headers only")
quality_level: 'strict' (only confident extractions), 'balanced' (some uncertainty OK), 'lenient'
output_format: 'plain_text', 'markdown', 'json', 'csv', 'structured'

Returns:
Structured OCR prompt
"""

prompt = f"""Extract text from this image: {image_content}

Extraction scope: {extraction_scope}

Quality level: {quality_level}"""

if quality_level == "strict":
prompt += """
Only extract text you can read with high confidence (95%+).
For any text that is unclear, partially obscured, or illegible, respond with [UNREADABLE].
Do NOT guess or infer missing characters.
Preserve exact spelling, capitalization, and punctuation."""

elif quality_level == "balanced":
prompt += """
Extract text you can read clearly (85%+ confidence).
For uncertain characters, use [?] or [uncertain].
Attempt to infer missing characters based on context and image clarity.
Preserve exact spelling and capitalization where visible."""

else: # lenient
prompt += """
Extract all visible text, including partial or unclear text.
For unreadable characters, estimate based on context.
Attempt to complete words based on visible letters and context.
Preserve capitalization and punctuation where visible."""

if output_format == "plain_text":
prompt += """

Output format: Plain text, preserving line breaks and spacing from original image.
Structure: [Text exactly as it appears, line by line]"""

elif output_format == "markdown":
prompt += """

Output format: Markdown, preserving structure through formatting.
- Headings marked as #, ##, ###
- Bold text as **text**
- Italic text as *text*
- Line breaks preserved
- Lists formatted as markdown lists"""

elif output_format == "json":
prompt += """

Output format: JSON with structure:
{
"text_blocks": [
{"type": "heading", "content": "text"},
{"type": "paragraph", "content": "text"},
{"type": "list", "items": ["item1", "item2"]}
],
"metadata": {
"language": "detected_language",
"confidence": 0.85,
"total_text_lines": 10
}
}"""

elif output_format == "structured":
prompt += """

Output format: Structured extraction.
Return a dictionary with keys corresponding to form fields, labels, or sections.
{
"field_name_1": "extracted_value",
"field_name_2": "extracted_value",
"table": {
"headers": ["col1", "col2"],
"rows": [["val1", "val2"]]
}
}"""

prompt += """

Rules:
1. Read EXACTLY what you see—do not rephrase, summarize, or correct text
2. Preserve original formatting: line breaks, indentation, spacing
3. If text is rotated, read it in its rotated orientation
4. Mark unclear sections clearly ([UNREADABLE], [uncertain], etc.)
5. Do NOT add commentary or interpretation"""

return prompt

# Example: Extracting text from a receipt
prompt = ocr_prompt(
image_content="retail receipt from grocery store",
extraction_scope="all visible text on the receipt, including merchant name, items, prices, and total",
quality_level="strict",
output_format="structured"
)
print(prompt)

Document-Specific OCR Strategies

Different document types require tailored prompts:

def document_ocr_prompt(document_type, preserve_layout=True):
"""
Generates document-specific OCR prompts.

Args:
document_type: 'invoice', 'form', 'contract', 'article', 'receipt', 'email', 'table'
preserve_layout: Whether to maintain original document structure

Returns:
Document-specific OCR prompt
"""

templates = {
'invoice': """Extract invoice text while preserving structure.
Extract:
1. Vendor/Merchant name and address
2. Invoice number and date
3. Customer/Bill-to information
4. Line items: description, quantity, unit price, line total
5. Subtotal, taxes, total amount due
6. Payment terms or notes

Format: JSON with sections for header, line_items array, and summary.""",

'form': """Extract text from form fields.
For each field:
1. Identify the field label/name
2. Extract the filled-in value
3. Note if field is blank or checked

Format: JSON mapping field names to values.
{
"fields": {
"full_name": "extracted_value",
"email": "extracted_value",
"checkbox_field": "checked" or "unchecked"
}
}""",

'contract': """Extract key contract text while preserving section structure.
Focus on:
1. Party names and identities
2. Effective date and term
3. Key obligations and rights
4. Payment terms and amounts
5. Termination conditions
6. Signature lines

Format: Markdown with sections for each element.""",

'article': """Extract article text while preserving structure.
Include:
1. Title and author
2. Publication date if visible
3. Main text, paragraph by paragraph
4. Headings and subheadings
5. Captions for images/figures
6. References or citations

Format: Markdown with proper heading hierarchy.""",

'table': """Extract table data while preserving tabular structure.
1. Identify column headers
2. Extract each row's data
3. Preserve row and column order

Format: JSON or CSV
CSV format:
header1,header2,header3
row1_col1,row1_col2,row1_col3""",

'receipt': """Extract receipt text including:
1. Merchant name and location
2. Date and time
3. Item names and prices
4. Subtotal, tax, total
5. Payment method
6. Receipt number if visible

Format: Structured JSON with merchant, items, and totals.""",
}

prompt = templates.get(document_type, templates['form'])

if preserve_layout:
prompt += "\n\nPreserve original document layout and structure. "
prompt += "Use indentation and spacing to reflect visual hierarchy."

return prompt

# Example: Extracting from a table in an image
prompt = document_ocr_prompt(document_type='table', preserve_layout=True)
print(prompt)

Handling Handwriting and Difficult Text

Handwriting recognition is a common OCR challenge. Specialized prompting improves accuracy:

def handwriting_ocr_prompt(handwriting_quality="moderate", allow_estimation=False):
"""
Prompt for extracting handwritten text.

Args:
handwriting_quality: 'clear', 'moderate', 'poor', 'mixed'
allow_estimation: Whether to infer likely letters when unclear

Returns:
Handwriting-specific OCR prompt
"""

prompt = """Extract handwritten text from this image.

Handwriting quality: """

if handwriting_quality == "clear":
prompt += """Legible and well-formed. Most letters are easy to read.
Strategy: Read character-by-character, maintaining exact spelling."""

elif handwriting_quality == "moderate":
prompt += """Readable but with some variation in letterforms or slant.
Strategy: Use context and surrounding letters to identify ambiguous characters.
Mark uncertain letters with [?]."""

elif handwriting_quality == "poor":
prompt += """Difficult to read. Many letters are unclear or malformed.
Strategy: Attempt to infer words based on partial letter visibility and context.
Use [uncertain] for low-confidence extractions."""

else: # mixed
prompt += """Contains both clear and unclear sections.
Strategy: For clear sections, transcribe exactly. For unclear sections, indicate uncertainty."""

if allow_estimation:
prompt += """

When text is unclear, attempt to infer:
1. Look at letter shape (tall letters, descenders, loops)
2. Use word context (if "th_" is visible, "the" or "this" are likely)
3. Check surrounding words for format clues (all caps, mixed case)
4. Mark inferred letters as [inferred: likely_letter]"""

else:
prompt += """

If text is unclear, mark as [UNREADABLE] rather than guessing.
Prioritize accuracy over completeness."""

prompt += """

Output: Extract text line-by-line, preserving original line breaks.
Format: Plain text with uncertainty markers where needed."""

return prompt

# Example: Extracting from a handwritten note
prompt = handwriting_ocr_prompt(handwriting_quality="moderate", allow_estimation=True)
print(prompt)

Multi-Language OCR and Language-Specific Handling

Vision language models support OCR in dozens of languages, but accuracy varies. Specify the language to improve results:

def multilingual_ocr_prompt(detected_languages=None, primary_language="English"):
"""
Prompt for OCR with multiple languages.

Args:
detected_languages: List of languages present in image
primary_language: Main language of document

Returns:
Multilingual OCR prompt
"""

prompt = f"""Extract text from this image.

Primary language: {primary_language}"""

if detected_languages:
prompt += f"""
Additional languages detected: {', '.join(detected_languages)}

For each language:
1. Identify sections written in that language
2. Extract text exactly as written in that language
3. Mark language transitions if needed"""

prompt += """

Important notes:
- Preserve original language and script
- Do NOT translate between languages
- Maintain diacritical marks (accents, umlauts, etc.)
- For non-Latin scripts, ensure character accuracy
- If unsure of language, attempt extraction based on visible characters

Output: Text in original languages, grouped by language if document is multilingual."""

return prompt

# Example: Document with English and Spanish
prompt = multilingual_ocr_prompt(
detected_languages=["Spanish", "Portuguese"],
primary_language="English"
)
print(prompt)

Structured Data Extraction from Images

For forms, tables, and structured documents, map extracted text to expected fields:

def structured_data_extraction_prompt(expected_fields, validation_rules=None):
"""
Prompt for extracting structured data from image.

Args:
expected_fields: Dict mapping field names to descriptions
validation_rules: Dict with field-specific validation rules

Returns:
Structured extraction prompt
"""

prompt = "Extract and structure data from this image.\n\n"
prompt += "Expected fields:\n"

for field_name, field_desc in expected_fields.items():
prompt += f"- {field_name}: {field_desc}\n"

prompt += """
Extraction rules:
1. For each field, extract the corresponding value from the image
2. If a field is not visible or not applicable, return null
3. Extract exact values—do NOT synthesize or estimate missing fields
4. Preserve original formatting (dates, phone numbers, addresses)

Output format (JSON):
{
"extraction_confidence": 0.85,
"extracted_data": {"""

for field_name in expected_fields.keys():
prompt += f"""
"{field_name}": "value or null","""

prompt += """
},
"missing_fields": ["field_name"],
"validation_issues": ["issue description"]
}"""

if validation_rules:
prompt += "\n\nValidation rules to apply:"
for field_name, rule in validation_rules.items():
prompt += f"\n- {field_name}: {rule}"

return prompt

# Example: Extracting from a driver's license
fields = {
"full_name": "Name of license holder",
"date_of_birth": "Birth date (format: MM/DD/YYYY)",
"address": "Residential address",
"license_number": "License number",
"expiration_date": "License expiration date",
"class": "License class (A, B, C, etc.)"
}

rules = {
"date_of_birth": "Must be in MM/DD/YYYY format",
"license_number": "10-character alphanumeric",
"expiration_date": "Must be future date relative to extraction"
}

prompt = structured_data_extraction_prompt(
expected_fields=fields,
validation_rules=rules
)
print(prompt)

OCR Validation and Confidence Assessment

Extracted text should be validated against expected properties:

def validate_ocr_extraction(extracted_text, expected_properties=None, image_quality="medium"):
"""
Validates OCR extraction quality.

Args:
extracted_text: Text extracted from image
expected_properties: Dict with properties to validate
image_quality: 'low', 'medium', 'high' - expected image quality

Returns:
Validation report with confidence score
"""

issues = []

# Check for common OCR errors
ocr_error_patterns = {
'rn_as_m': extracted_text.count('rn') > extracted_text.count('m') * 2,
'zero_as_o': False, # Hard to detect without context
'one_as_l': False, # Hard to detect without context
'six_as_G': extracted_text.count('G') > 3 and image_quality == 'low',
}

for error_type, is_present in ocr_error_patterns.items():
if is_present:
issues.append(f"Possible {error_type} error detected")

# Consistency checks
if expected_properties:
if 'email_present' in expected_properties:
email_count = extracted_text.count('@')
if email_count != expected_properties['email_present']:
issues.append(f"Expected {expected_properties['email_present']} email(s), found {email_count}")

if 'phone_numbers' in expected_properties:
phone_pattern = r'\d{3}[-.]?\d{3}[-.]?\d{4}'
import re
phones = re.findall(phone_pattern, extracted_text)
if len(phones) != expected_properties['phone_numbers']:
issues.append(f"Expected {expected_properties['phone_numbers']} phone numbers, found {len(phones)}")

# Quality assessment
confidence = 0.95 if not issues else max(0.5, 0.95 - (len(issues) * 0.1))

return {
"issues": issues,
"confidence_score": confidence,
"quality_rating": "high" if confidence >= 0.9 else "medium" if confidence >= 0.75 else "low",
"recommendation": "Usable for most applications" if confidence >= 0.75 else "Manual review recommended"
}

Iterative OCR Refinement

For critical applications, refine OCR through multiple passes:

def iterative_ocr_refinement_prompt(previous_extraction, refinement_focus):
"""
Prompt for refining a previous OCR extraction.

Args:
previous_extraction: Text from first extraction attempt
refinement_focus: What to focus on (e.g., "small text", "handwriting", "numbers")

Returns:
Refinement prompt
"""

prompt = f"""Review and refine this OCR extraction.

Previous extraction:
---
{previous_extraction}
---

Refinement focus: {refinement_focus}

Task:
1. Check each line of extracted text against the original image
2. Fix any obvious OCR errors or misreadings
3. Fill in any [UNREADABLE] or missing sections if now visible at higher confidence
4. Verify {refinement_focus} specifically - ensure accuracy in these areas

Return the refined extraction in the same format as the original."""

return prompt

Key Takeaways

  • Structured OCR prompts yield 90-95% accuracy; unstructured requests produce 60-70% with hallucination.
  • Specify extraction scope, quality level, and output format explicitly to guide the model's processing.
  • Handwriting requires different strategies than printed text: use context and character shape analysis.
  • Validate extracted data against expected properties (field presence, format, consistency) to catch systematic errors.
  • Iterative refinement (first extraction, then focused refinement) improves accuracy by 5-10% for challenging documents.

Frequently Asked Questions

Why does OCR fail on low-contrast text?

Vision models rely on visual clarity. Low contrast, small text (< 12pt equivalent), or unusual fonts are inherently challenging. Increase image resolution or zoom into relevant regions to improve accuracy.

Can I extract text in multiple languages from a single image?

Yes. Explicitly mark multilingual content in your prompt and request output grouped by language. Avoid mixing languages in a single field to reduce confusion.

Should I use a specialized OCR tool or a vision language model?

Specialized tools (Tesseract, EasyOCR) are faster and more accurate for pure text extraction. Vision language models are better for understanding context and extracting structured data (form fields, amounts, dates) that require semantic understanding.

How do I handle rotated or skewed text in images?

Mention rotation/skew in your prompt: "This text is rotated 90 degrees clockwise. Extract it in readable order." Models generally handle rotation well if explicitly told.

What confidence level should I require before using extracted text?

For automated processing, require 85%+. For human review, 70%+ is reasonable. Always validate critical data (financial figures, dates, account numbers) against expected formats.

Further Reading