Partial Parsing: Extracting Valid Data From Incomplete
Partial parsing is the art of extracting valid data from malformed, incomplete, or streaming LLM output. Instead of discarding an entire response because it's missing a closing brace or has one invalid field, partial parsing salvages what's valid. Studies show partial parsing recovers 70–80% of would-be failed extractions (Anthropic, 2025).
This article teaches you lenient parsing strategies, field recovery, and streaming-friendly techniques.
The Problem: Incomplete Output
Consider this LLM output attempting to generate customer records:
{
"customers": [
{"id": 1, "name": "Alice Chen", "email": "[email protected]"},
{"id": 2, "name": "Bob Smith", "email": "[email protected]",
{"id": 3, "name": "Carol Lee", "email": "[email protected]
The output is:
- Truncated (Carol's email incomplete, missing closing brace)
- Malformed (Bob's record missing closing brace)
- Streaming (LLM was interrupted)
Strict JSON parsing fails entirely. Partial parsing recovers Alice and Bob's complete records, and Carol's partial data.
Lenient Parsing with json5
The json5 library parses JSON-like syntax more leniently:
import json5
output = """{
"id": 1,
"name": "Alice",
"email": '[email protected]', // Missing quote
"active": true,
// This is a comment
}"""
try:
# json5 handles single quotes, comments, trailing commas
data = json5.loads(output)
print(data) # Works!
except json5.JSON5DecodeError:
print("Even json5 failed; fall back to partial parsing")
json5 is much more forgiving than stdlib json:
import json
# json fails on json5's lenient syntax
try:
json.loads(output)
except json.JSONDecodeError as e:
print(f"Failed: {e}") # Fails on single quotes and comments
# json5 succeeds
data = json5.loads(output)
Streaming-Aware Parsing
LLMs often stream output token by token. Parse as you receive data:
import json
from typing import Optional, List
def parse_streaming_json_array(tokens: List[str]) -> List[dict]:
"""Parse JSON array from streaming tokens."""
buffer = ""
objects = []
depth = 0
in_string = False
escape = False
for token in tokens:
buffer += token
# Track nesting depth to identify complete objects
for char in token:
if escape:
escape = False
continue
if char == "\\":
escape = True
continue
if char == '"' and depth > 0:
in_string = not in_string
if not in_string:
if char == "{":
depth += 1
elif char == "}":
depth -= 1
# Complete object found
if depth == 0:
try:
obj = json.loads(buffer.strip())
objects.append(obj)
buffer = ""
except json.JSONDecodeError:
# Object incomplete; keep buffering
pass
# Try to parse remaining buffer (partial object)
if buffer.strip().startswith("{"):
try:
# Try as complete object
obj = json.loads(buffer.strip())
objects.append(obj)
except json.JSONDecodeError:
# Fallback: extract partial fields
partial = extract_fields_from_malformed(buffer)
if partial:
objects.append(partial)
return objects
# Usage
tokens = ['{', '"id": 1', ', "name"', ': "Alice"', '}', ',', '{', '"id": 2', ', "name"', ': "Bob"']
customers = parse_streaming_json_array(tokens)
# Returns: [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
Field-Level Recovery
Extract individual fields from malformed JSON:
import re
from typing import Any, Dict
def extract_fields_from_malformed(text: str, expected_fields: List[str]) -> Dict[str, Any]:
"""Extract individual fields from malformed JSON."""
result = {}
for field in expected_fields:
# Try multiple patterns
patterns = [
rf'"{field}":\s*"([^"]*)"', # String field: "key": "value"
rf'"{field}":\s*(\d+(?:\.\d+)?)', # Number field: "key": 123
rf'"{field}":\s*(true|false)', # Boolean field: "key": true
rf"'{field}':\s*'([^']*)'", # Single-quoted: 'key': 'value'
rf'{field}:\s*"([^"]*)"', # Unquoted key: key: "value"
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
value = match.group(1)
# Convert to appropriate type
if value.lower() in ("true", "false"):
result[field] = value.lower() == "true"
elif value.isdigit() or (value.startswith("-") and value[1:].isdigit()):
result[field] = int(value)
else:
result[field] = value
break
return result
# Usage
malformed = '{"id": 1, "name": "Alice", email: "alice@example'
fields = extract_fields_from_malformed(
malformed,
expected_fields=["id", "name", "email"]
)
print(fields) # {"id": 1, "name": "Alice", "email": "alice@example"}
Nested Object Recovery
For nested structures, recover what you can at each level:
import json
from typing import Optional
def parse_nested_with_recovery(text: str, schema: dict) -> dict:
"""Parse nested JSON with field-level recovery."""
result = {}
# Try strict parsing first
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Try lenient parsing
try:
import json5
return json5.loads(text)
except:
pass
# Fall back to field extraction
properties = schema.get("properties", {})
for field, field_schema in properties.items():
field_type = field_schema.get("type")
if field_type == "object":
# Recursively recover nested object
nested_pattern = rf'"{field}":\s*({{\s*[^}}]*}})'
match = re.search(nested_pattern, text)
if match:
nested_json = match.group(1)
result[field] = parse_nested_with_recovery(
nested_json,
field_schema
)
elif field_type == "array":
# Recover array elements
items_schema = field_schema.get("items", {})
array_pattern = rf'"{field}":\s*\[([^\]]*)\]'
match = re.search(array_pattern, text)
if match:
array_content = match.group(1)
items = extract_array_items(array_content)
result[field] = items
else:
# Simple field recovery
fields = extract_fields_from_malformed(text, [field])
if fields:
result.update(fields)
return result
def extract_array_items(text: str) -> list:
"""Extract items from malformed array string."""
# Split by comma (naive; works for simple arrays)
items = []
for item in text.split(","):
item = item.strip().strip('"').strip("'")
if item:
items.append(item)
return items
Comparison: Parsing Strategies
| Strategy | Strictness | Recovery Rate | Complexity | Use Case |
|---|---|---|---|---|
| json.loads | Very strict | 0% on failure | Low | Well-formed output only |
| json5.loads | Lenient | 50–70% on malformed | Low | Streaming/lenient output |
| Field extraction | Very lenient | 70–90% | Medium | Badly malformed output |
| Nested recovery | Adaptive | 80–95% | High | Complex nested structures |
Practical Implementation
def robust_parse(text: str, schema: dict) -> dict:
"""Parse with automatic fallback to partial parsing."""
# Strategy 1: Strict JSON
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Strategy 2: Lenient JSON5
try:
import json5
return json5.loads(text)
except:
pass
# Strategy 3: Field-by-field recovery
fields = extract_fields_from_malformed(
text,
schema.get("required", [])
)
if fields:
return fields
# Strategy 4: Return empty dict (all failed)
return {}
# Usage
output = '{"id": 1, "name": "Alice", email: alice@example'
data = robust_parse(output, schema)
print(data) # {"id": 1, "name": "Alice", "email": "alice@example"}
Key Takeaways
- Partial parsing recovers 70–80% of would-be failed extractions.
- json5 handles JSON-like syntax leniently; use for streaming/sloppy output.
- Stream-aware parsing detects complete objects as they arrive.
- Field-level recovery extracts individual fields using regex patterns.
- Nested recovery applies strategies recursively at each level.
- Always implement a graceful fallback (return partial data, then empty, then error).
Frequently Asked Questions
Is partial parsing safe? Could I get bad data?
Yes, partial parsing carries risk of incomplete or misinterpreted data. Use it only when the alternative is total failure. Validate recovered fields against schema constraints.
How do I handle arrays in partial parsing?
Split by commas (naive) or use regex to find array boundaries. For complex nested arrays, it's often better to escalate than guess.
Can partial parsing handle deeply nested objects?
Yes, but the complexity grows. Recursive application of field extraction works, but at some point, manual intervention makes more sense.
Should I log partial parsing uses?
Absolutely. Every partial parse is a sign your primary extraction failed. Track these to improve your LLM prompts or schema design.