Bounding Box Output: Spatial Coordinates from Vision
While vision language models excel at understanding images semantically, they initially lack precision for localization—identifying the exact pixel or normalized coordinates of objects. However, by carefully structuring your prompts, you can coerce VLMs to produce reasonably accurate bounding box outputs in JSON format, enabling programmatic object detection, region cropping, and spatial analysis workflows. This capability bridges the gap between semantic understanding and quantitative spatial reasoning.
The challenge is fundamental: vision language models process images as token sequences, not as coordinate spaces. A model that understands "there is a dog in the image" doesn't inherently reason about pixel coordinates. But modern VLMs can be prompted to estimate spatial positions using percentile-based systems or pixel coordinates, trading some accuracy (80-85% vs. specialized detectors at 95%+) for flexibility and no model retraining.
Bounding Box Format Specification
Before prompting, standardize your coordinate system. Two formats dominate:
Normalized coordinates (0-100 scale):
{
"object": "dog",
"x_min": 10,
"y_min": 20,
"x_max": 45,
"y_max": 70
}
Advantages: Resolution-invariant, human-readable percentages. Disadvantages: Low precision for complex scenes.
Pixel coordinates:
{
"object": "dog",
"x_min": 120,
"y_min": 240,
"x_max": 540,
"y_max": 840,
"image_width": 1024,
"image_height": 1024
}
Advantages: Pixel-precision, directly usable for image cropping. Disadvantages: Requires image dimensions; not portable across resolutions.
Here's a practical template for standardized bounding box output:
def bounding_box_prompt(bbox_format="normalized", confidence_threshold=0.8):
"""
Generates a prompt requesting bounding box output.
Args:
bbox_format: 'normalized' (0-100%) or 'pixel' (pixel coordinates)
confidence_threshold: Minimum confidence to report (0.0-1.0)
Returns:
Structured bounding box prompt
"""
prompt = "Detect and localize all objects in this image.\n\n"
if bbox_format == "normalized":
prompt += """Coordinate system: Normalized percentages (0-100)
- x_min, x_max: Horizontal position (0% = left edge, 100% = right edge)
- y_min, y_max: Vertical position (0% = top edge, 100% = bottom edge)
Example format:
{
"objects": [
{"class": "dog", "confidence": 0.95, "x_min": 10, "y_min": 20, "x_max": 45, "y_max": 70},
{"class": "cat", "confidence": 0.87, "x_min": 50, "y_min": 15, "x_max": 80, "y_max": 55}
]
}
"""
else: # pixel
prompt += """Coordinate system: Pixel coordinates
- x_min, x_max: Horizontal position in pixels (0 = left edge)
- y_min, y_max: Vertical position in pixels (0 = top edge)
- Include image dimensions for reference
Example format:
{
"image_width": 1024,
"image_height": 768,
"objects": [
{"class": "dog", "confidence": 0.95, "x_min": 102, "y_min": 154, "x_max": 460, "y_max": 539},
{"class": "cat", "confidence": 0.87, "x_min": 512, "y_min": 115, "x_max": 819, "y_max": 423}
]
}
"""
prompt += f"""
Instructions:
1. Scan the image systematically from left to right, top to bottom
2. Identify each distinct object or region of interest
3. For each object, estimate its bounding box (smallest rectangle containing it)
4. Assign a confidence score (0.0-1.0) based on how clearly the object is visible
5. Only report objects with confidence >= {confidence_threshold}
For each object, provide:
- class: Object category or name
- confidence: How certain you are (0.0-1.0)
- Bounding box coordinates
Output ONLY valid JSON. Do not include commentary."""
return prompt
# Example usage
prompt = bounding_box_prompt(bbox_format="normalized", confidence_threshold=0.75)
print(prompt)
Hierarchical Object Detection
Many images contain objects at different scales and levels of hierarchy. Prompt for hierarchical detection to avoid missing small or nested objects:
def hierarchical_detection_prompt():
"""
Prompt for multi-level object detection (primary objects, components, details).
"""
prompt = """Detect objects at multiple levels of hierarchy:
Level 1 (Primary objects): Large, prominent objects that define the scene
- Examples: "car", "building", "person"
Level 2 (Components): Significant parts of primary objects
- Examples: "car tire", "building window", "person arm"
Level 3 (Details): Small or decorative elements
- Examples: "license plate", "window curtain", "person watch"
For each object level, provide bounding boxes in this JSON structure:
{
"level_1_primary": [
{"class": "car", "confidence": 0.95, "x_min": 20, "y_min": 30, "x_max": 70, "y_max": 90}
],
"level_2_components": [
{"class": "car tire", "confidence": 0.88, "x_min": 25, "y_min": 75, "x_max": 35, "y_max": 88}
],
"level_3_details": [
{"class": "license plate", "confidence": 0.80, "x_min": 40, "y_min": 65, "x_max": 50, "y_max": 72}
]
}
Include all objects you can reliably detect. Omit objects with confidence < 0.75."""
return prompt
Hierarchical prompts help VLMs organize their detection process and avoid conflating objects at different scales.
Occlusion and Partial Object Handling
Real-world images contain occluded or partially visible objects. Prompt explicitly for handling these edge cases:
def occlusion_aware_bounding_box_prompt():
"""
Prompt that handles partially visible or occluded objects.
"""
prompt = """Detect all objects, including those that are partially visible or occluded.
For each object, specify visibility state:
- "fully_visible": Entire object is clearly visible
- "partially_visible": Part of object is cut off by image boundary or another object
- "mostly_occluded": Object is mostly hidden but identifiable by visible parts
For partially visible objects, estimate the full bounding box (where the object *would* be if fully visible), not just the visible portion.
JSON format:
{
"objects": [
{
"class": "person",
"confidence": 0.92,
"visibility": "fully_visible",
"x_min": 10, "y_min": 5, "x_max": 40, "y_max": 95
},
{
"class": "car",
"confidence": 0.78,
"visibility": "partially_visible",
"visible_portion": "right_edge_cut_off",
"x_min": 40, "y_min": 30, "x_max": 110, "y_max": 70,
"note": "Right edge extends beyond image boundary"
}
]
}
Rules:
- For occluded objects, base estimation on visible parts and context
- If you cannot estimate the full box, provide only the visible portion and mark as partial
- Confidence should reflect estimation uncertainty for occluded objects"""
return prompt
Explicit occlusion handling reduces hallucination of "completed" objects and improves consistency.
Multi-Class and Semantic Bounding Box Output
For complex scenes with many object categories, prompt for organized output by class:
def semantic_bounding_box_prompt(categories):
"""
Prompts for bounding box output organized by object class/category.
Args:
categories: List of object categories to detect
Returns:
Organized bounding box prompt
"""
prompt = "Detect and localize objects from these categories:\n\n"
for i, category in enumerate(categories, 1):
prompt += f"{i}. {category}\n"
prompt += """
For each category, list all detected objects with bounding boxes.
Output format (JSON):
{
"detection_summary": {
"total_objects": <int>,
"categories_found": [list of categories]
},
"detections_by_class": {
"category_1": [
{"confidence": 0.95, "x_min": 10, "y_min": 20, "x_max": 45, "y_max": 75},
{"confidence": 0.87, "x_min": 60, "y_min": 30, "x_max": 90, "y_max": 80}
],
"category_2": [
{"confidence": 0.92, "x_min": 5, "y_min": 5, "x_max": 30, "y_max": 40}
]
}
}
Rules:
- Use exact category names from the list above
- If a category is not found, omit it from detections_by_class
- Sort detections within each class by confidence (highest first)
- Only report objects with confidence >= 0.75"""
return prompt
# Example: Retail shelf detection
categories = ["product_box", "empty_space", "price_tag", "barcode", "shelf_unit"]
prompt = semantic_bounding_box_prompt(categories)
print(prompt)
Spatial Relationship Output
Beyond individual bounding boxes, prompt for spatial relationships between objects:
def spatial_relationship_prompt():
"""
Prompt for detecting spatial relationships between objects.
"""
prompt = """Detect objects and their spatial relationships.
For each object, provide its bounding box AND describe its relationship to other objects.
Relationships to report:
- "left_of": Object A is to the left of Object B
- "right_of": Object A is to the right of Object B
- "above": Object A is above Object B
- "below": Object A is below Object B
- "overlapping": Object A overlaps with Object B
- "adjacent": Object A is adjacent (touching) Object B
- "contained": Object A is contained within Object B
JSON format:
{
"objects": [
{
"id": "obj_1",
"class": "chair",
"x_min": 10, "y_min": 20, "x_max": 40, "y_max": 75
},
{
"id": "obj_2",
"class": "table",
"x_min": 35, "y_min": 25, "x_max": 80, "y_max": 65
}
],
"relationships": [
{"subject": "obj_1", "relation": "left_of", "object": "obj_2"},
{"subject": "obj_1", "relation": "adjacent", "object": "obj_2"}
]
}
Provide all detected spatial relationships between any pair of objects."""
return prompt
Relationship data enables graph-based scene understanding and spatial reasoning.
Confidence Calibration and Validation
Vision language models' confidence scores may not accurately reflect true accuracy. Validate and calibrate before deployment:
def validate_bounding_boxes(detected_boxes, original_image=None, validation_method="visual_overlap"):
"""
Validates detected bounding boxes against known or expected properties.
Args:
detected_boxes: List of detected bounding box dicts
original_image: Optional PIL Image or image path for pixel-level validation
validation_method: 'visual_overlap', 'aspect_ratio', 'non_max_suppression'
Returns:
Validation report with issues and confidence adjustments
"""
issues = []
adjustments = {}
if validation_method == "non_max_suppression":
# Check for excessive overlap—likely duplicates
for i, box1 in enumerate(detected_boxes):
for j, box2 in enumerate(detected_boxes[i+1:], i+1):
# Calculate intersection over union (IoU)
x_min = max(box1['x_min'], box2['x_min'])
x_max = min(box1['x_max'], box2['x_max'])
y_min = max(box1['y_min'], box2['y_min'])
y_max = min(box1['y_max'], box2['y_max'])
if x_max > x_min and y_max > y_min:
intersection = (x_max - x_min) * (y_max - y_min)
area1 = (box1['x_max'] - box1['x_min']) * (box1['y_max'] - box1['y_min'])
area2 = (box2['x_max'] - box2['x_min']) * (box2['y_max'] - box2['y_min'])
union = area1 + area2 - intersection
iou = intersection / union if union > 0 else 0
if iou > 0.5:
lower_conf = min(box1.get('confidence', 0.5), box2.get('confidence', 0.5))
issues.append(f"High overlap ({iou:.2f}) between boxes {i} and {j}; recommend removing lower-confidence box (confidence {lower_conf})")
elif validation_method == "aspect_ratio":
# Check for unreasonable aspect ratios
for i, box in enumerate(detected_boxes):
width = box['x_max'] - box['x_min']
height = box['y_max'] - box['y_min']
aspect_ratio = width / height if height > 0 else 0
if aspect_ratio < 0.1 or aspect_ratio > 10:
issues.append(f"Box {i}: unusual aspect ratio {aspect_ratio:.2f} (width/height); validate manually")
return {
"issues": issues,
"validation_passed": len(issues) == 0,
"recommendation": "Review flagged boxes manually; consider rerunning detection with adjusted confidence threshold"
}
Iterative Refinement Strategy
For critical applications, refine bounding box predictions iteratively:
def iterative_refinement_prompt(previous_detections, refinement_focus):
"""
Prompt for refining previous bounding box detections.
Args:
previous_detections: Bounding boxes from previous detection attempt
refinement_focus: What to focus on (e.g., "missed objects", "boundary precision", "small objects")
Returns:
Refinement prompt
"""
prompt = f"""Review these previously detected bounding boxes and refine them.
Previous detections (may be incomplete or imprecise):
"""
for i, det in enumerate(previous_detections, 1):
prompt += f"{i}. {det['class']}: ({det['x_min']}%, {det['y_min']}%) to ({det['x_max']}%, {det['y_max']}%)\n"
prompt += f"""
Refinement focus: {refinement_focus}
Task:
1. Verify each existing detection - is the bounding box accurate?
2. If a box is imprecise, provide corrected coordinates
3. Look for objects MISSED in the previous detection round
4. Provide updated detections in JSON format
Return:
{{
"verified_detections": [{{refined boxes with corrected coordinates}}],
"new_detections": [{{boxes for missed objects}}],
"summary": "Brief explanation of changes and refinements"
}}"""
return prompt
Key Takeaways
- VLMs can produce approximate bounding boxes via structured prompting, achieving 80-85% accuracy compared to specialized detectors at 95%+.
- Use normalized coordinates (0-100%) for portability across resolutions; pixel coordinates for direct image manipulation.
- Explicitly handle occlusion, partial visibility, and hierarchy levels to avoid hallucinating "completed" objects or missing details.
- Validate bounding boxes against expected properties (aspect ratio, overlap, scene composition) before deployment.
- Iterative refinement improves accuracy significantly; a second detection pass focusing on missed objects and boundary precision gains 5-15%.
Frequently Asked Questions
How accurate are vision language model bounding boxes compared to specialized object detectors?
VLMs achieve 75-85% accuracy for coarse localization; specialized detectors (YOLO, Mask R-CNN) achieve 92-98%. VLMs excel at semantic understanding (knowing what an object is) but are weaker at precise spatial localization. Use VLMs for flexible, general-purpose detection; use specialized models for high-precision applications.
Should I use pixel or normalized coordinates?
Normalized (0-100%) for flexibility and portability. Pixel coordinates if you need direct image cropping. Always include image dimensions for pixel coordinates to ensure reproducibility.
How do I detect very small objects (< 1% of image area)?
Small objects are challenging for VLMs. Increase image resolution (1600+ px), zoom into relevant regions (create sub-images), or use multi-region analysis focusing on the area where small objects are expected.
Can vision models detect objects they haven't seen in training?
To some extent, yes. VLMs generalize reasonably well to novel object categories if they share visual properties with training objects. For truly novel or domain-specific objects, provide examples or use semantic description ("detect square metal boxes approximately 10cm on each side").
What confidence threshold should I use?
Start at 0.75 for filtering. Adjust based on your tolerance for false positives (lower threshold) vs. missed objects (higher threshold). For production, validate thresholds against a held-out test set of your actual images.