Computer Vision for Retail, Part 4: Shelf Obstacle Detection

Continuing our series on computer vision for retail.

What This Article Covers

When a camera captures a shelf, shoppers, carts, or bags often end up between the lens and the products. For the product classification pipeline to work correctly, the system needs to detect these obstacles and determine whether they're actually blocking the merchandise or simply appearing in the background. This article describes a two-level approach to solving that problem.

Shelf Obstacle Detection

Goal: Identify objects that partially or fully obscure the shelf area, so the pipeline can operate reliably. When a product is fully blocked, the system triggers a reshoot; when it's partially blocked, the system isolates the obstacle and excludes it from the analysis zone.

Data: Raw, unprocessed photos from cameras pointed at supermarket shelves.

Our Approach

We use a two-level deep learning pipeline for shelf obstacle detection.

General pipeline of obstacle detection

Level 1: Object Detection and Segmentation

The first level runs YOLO11 in instance segmentation mode, which produces pixel-accurate masks for every detected object.

Object Classes

The model is trained to detect and segment 8 classes:

product — items on shelves
obstacle — removable obstacles blocking products
cart — shopping carts
customer — shoppers
worker — store staff
obstacle_equipment — equipment (shelving units, machinery, etc.)
obstacle_bag — bags
unknown — unidentified objects (non-removable obstacles)

This level of detail lets the system not only flag the presence of an obstacle, but also classify its type — which matters for downstream analysis and decision-making.

Annotation Format

Annotations are in YOLO segmentation format. Each object is stored with:

a class ID (0–7)
an object polygon in normalized coordinates (x, y ∈ [0, 1])

The dataset was originally labeled with bounding boxes only. To convert to polygon annotations, we used a semi-automated pipeline: YOLO detected bounding boxes, SAM (Segment Anything Model) generated masks from those regions, and the results were then manually reviewed and corrected.

Why Instance Segmentation Over Bounding Boxes

More precise localization. A mask captures the actual shape of an object, not just its rectangular envelope — this is critical when assessing partial product occlusion.
Accurate overlap estimation. The intersection area between a product and an obstacle is calculated from masks, not bounding boxes. This significantly reduces error, especially for elongated or irregularly shaped objects.

Level 2: Obstacle Position Estimation

First-level segmentation alone isn't enough for practical use. Masks tell you what an object is and where it projects onto the image — but not whether an obstacle is in front of the product or behind it.

On retail shelves, this distinction is crucial. In a 2D projection, an object behind the shelf — say, a shopper in the aisle or a cart on a neighboring row — can visually "overlap" a product's mask even though it doesn't actually block access to the item. Relying on segmentation alone would flag these as obstacles and generate false positives further down the pipeline.

The second level addresses this by using scene depth information to:

filter out obstacles that lie behind the shelf plane
determine the relative position of each obstacle (in front of / behind / between) relative to the products

Model Architecture

Depth estimation uses a pretrained Depth Anything V2 model built on a Vision Transformer backbone. The architecture consists of a ViT encoder and a DPT (Dense Prediction Transformer) decoder, which converts multi-scale features into a depth map at the original image resolution. The map is inverted so that objects closer to the camera receive higher values — simplifying downstream analysis.

For each product defined by a bounding box, the system computes a reference depth value using an adaptive approach:

if the center of the bounding box isn't obscured by an obstacle, depth is averaged over a 5×5-pixel window around the center
if the center is obscured, depth is averaged across the entire bounding box area, excluding regions covered by obstacles

Obstacle Classification

Each obstacle (defined by a polygon) is classified relative to a product in three steps.

Step 1: Geometric intersection. The obstacle mask is computed and checked for intersection with the product's bounding box. If there's no intersection, the obstacle is excluded from analysis.

Step 2: Geometric heuristics. If the intersection area exceeds 80% of the obstacle's area, or if the obstacle extends below the product's lower boundary, it's classified as being in front of the product.

Step 3: Depth distribution analysis. The proportion of pixels within the obstacle region with depth values less than the product's reference depth is calculated, then thresholds are applied:

more than 70% of such pixels → obstacle is in front of the product
fewer than 30% → obstacle is behind the product
anything in between → between / unknown

Condition	Classification
No intersection	excluded
>80% area intersection	in front
>70% of pixels have lower depth	in front
<30% of pixels have lower depth	behind
Otherwise	between / unknown

Metrics and Alternative Experiments

Model Evaluation

We use a combination of standard computer vision metrics and business metrics that reflect the system's real-world value.

Detection metrics:

Precision (PR) — share of correctly identified obstacles among all model predictions
Recall (RC) — share of detected obstacles among all that actually exist
F1-score — harmonic mean of Precision and Recall

Best model results:

Metric	Value
Precision	0.84
Recall	0.7461
F1-score	0.7881

Business metrics:

Business metrics only account for obstacles that intersect the product zone — others don't affect the pipeline and are treated as irrelevant.

Accuracy (overall classification accuracy with business constraints applied): 0.9415 (94.15%)

The "Carved-Out Mask" Problem with Overlapping Objects

In real scenes, classes frequently overlap at the pixel level: shoppers, staff, carts, bags, and other obstacles can all cover products. This is critical for our pipeline — downstream we use mask geometry and intersection area to determine whether and how much a product is blocked.

On part of the data, we ran into a characteristic artifact: when objects overlapped, unstable (jagged) masks appeared. Specifically, a product's mask could "eat into" an obstacle's mask, or produce malformed geometry in the overlap zone. This degraded occlusion estimates and caused inference errors — the obstacle was visually present, but its mask partially disappeared precisely where it mattered most: on top of the product.

Example of model inference before class priority in labeling

Solution: explicit class hierarchy in annotations. We introduced prioritization: all obstacle classes and "foreground" objects are treated as higher priority than product. The logic is straightforward — pixels occupied by an obstacle shouldn't belong to a product mask, so the obstacle can't be "punched through" by the product mask and its contour remains intact.

This was implemented as an annotation preprocessing step:

A combined occupancy mask is built from all "senior" class objects (union across pixels)
Pixels occupied by obstacles are subtracted from each product mask
If a product mask splits into multiple components after subtraction, contours are found and saved as new annotation polygons
"Senior" class masks are saved unchanged — they "win" any overlap conflict

After introducing the hierarchy, we got stable obstacle masks without carve-outs in overlap zones, more accurate intersection area estimates, and fewer edge cases in the second-level logic. Explicit annotation hierarchy turned ambiguous class overlaps into a deterministic rule — and this noticeably improved instance segmentation stability in practice.

Example of inference from an overfitted model after class priority in labeling

Architectures Tested

We evaluated several alternative approaches during development:

ModelResNet_RGBA — a ResNet18-based classifier using RGBA channels, where the alpha channel carries additional camera position information. This helps the model better understand the scene context and obstacle positions relative to the shelf. (Source)
PSPNet (Pyramid Scene Parsing Network) — a segmentation model from mmsegmentation that uses pyramid pooling to capture context at multiple scales. It handles semantic segmentation well, but required significant additional adaptation for instance segmentation. (Source)

Why We Chose YOLO Segmentation

After a comparative evaluation, YOLO segmentation came out on top for four reasons:

Optimal accuracy-speed tradeoff, critical for real-time processing
Native instance segmentation support, enabling correct handling of overlapping objects
Straightforward integration into the existing pipeline
Training efficiency: strong results without heavy computational overhead

Visualizations and examples of operation

First level: comparison of Ground Truth and model predictions

Comparison of model detection results with ground truth annotation. Left panel — Ground Truth, right panel — model predictions.

However, the model is not immune to errors. Here is an example where a fairly obvious person in the frame was not detected:

There are also more unexpected cases — for example, a shopping cart handle.

Second level: determining the position of obstacles

Results of determining the position of obstacles relative to the shelf using the second-level model

Results

We developed and tested a two-level obstacle detection pipeline for retail shelves.

At the first level, the YOLO segmentation model handles detection and instance segmentation, providing pixel-accurate localization of both products and potential obstacles. This enables correct analysis of partial occlusions and handles complex object geometry — something bounding-box approaches simply can't do.

At the second level, depth analysis powered by Depth Anything V2 is integrated into the pipeline. The combination of geometric heuristics and depth distribution analysis determines the position of obstacles relative to the shelf plane and effectively filters out objects located behind it — significantly reducing false positives compared to pure 2D segmentation.

System quality is validated by both standard and business metrics: Precision = 0.84, Recall = 0.746, F1 = 0.788, business accuracy = 94.15%. These results show the system holds up reliably even in dynamic scenes with shoppers, carts, and partially obscured products.

Combining instance segmentation with depth analysis proved to be an effective, scalable approach for this task. The pipeline integrates cleanly into existing infrastructure and can be extended to adjacent retail analytics problems: product availability monitoring, shopper behavior analysis, and store floor condition tracking.

Special thanks to our engineers Alexander Korotaevsky and Artem Smetanin for preparing this article!

AI Computer Vision