Blog

How to Train AI to Recognize Products on Shelves

The Challenge

Why product classification on shelves matters in retail?

Imagine you are a large supermarket chain. You track which products arrive at your warehouse. Through point-of-sale data, you know exactly how many units of each item are sold. Using overhead cameras, you can roughly gauge how many people visit the store and what isles they typically walk past. Yet, you likely have very little insight into what actually happens on the shelves themselves:

  • Is there any Coca-Cola left on the beverage shelf?
  • Have the promotional mandarins already sold out by midday?
  • Or maybe someone changed their mind at the checkout and left a bottle of beer in the toy aisle?

Without real-time shelf visibility, these issues go undetected until an employee happens to notice them during routine checks, which could be hours later.
If you had a clearer understanding of what happens on the shelves, you could:
  • Proactively dispatch store staff to address such issues instead of waiting for an employee to notice them during routine rounds.
  • Improve On-Shelf Availability (OSA), which in theory could boost revenue, as more products would be in stock and customers would always find what they are looking for (or make impulse purchases).
We partnered with two supermarket chains. Our journey began with a pilot in a small neighborhood convenience store, which successfully paved the way for scaling the solution to two flagship stores of a major retailer.
We had a straightforward goal: to accurately recognize every product on every shelf. From there, based on client requests, we would build downstream analytics, create "operations" for restocking from the warehouse, and so on. However, for now, we'll focus specifically on product recognition on shelves. There were numerous challenges, which, of course, is what makes it interesting. So let's walk through some of them:
  • First, we had to design and install a camera system capable of capturing shelf images every 30 minutes and reliably transmitting them for processing.
  • Lighting varies dramatically across different store departments. This creates a problem: images from the cameras look completely different. It even reaches a point where the same products on different shelves (for example, Coke in a refrigerator VS Coke on a promotional display) appear entirely distinct.
  • Then there's the classic cold-start problem: what do we do when launching in a new store?
  • There is also the constant addition of new products and shelf rearrangements, – factors beyond our control and not managed by us.
And these were just a few of the obstacles. In this series, we’ll delve deeper into how we approached classification amid such complexity. Meanwhile, my colleagues will cover related challenges (particularly on the hardware and deployment side) in their upcoming articles.

Data and Annotation

The Data Pipeline

Once our cameras were installed and integrated into the infrastructure, they began capturing and transmitting shelf images every 30 minutes directly to an S3 bucket. This gave us a steady stream of raw visual data but not all images were equally useful for classification.
Here are a few examples of what the camera images might look like:
A standard shot where everything is clearly visible.
Perspective distortions + overexposure (the black edges on the sides are other shelves – they’ve simply been blurred here).
Small, similar-looking products. If you can’t read the text on the packaging, you’re out of luck.
Obstructions like refrigerator doors and glare on glass.
The images from the cameras first need to be filtered, as they often include people, shopping carts, boxes, and other obstructions. However, obstacle detection is a topic for another article.
Once an image is cleared for processing, we run a detector to locate every product on the shelf. In our case, we use a fairly standard, up‑to‑date YOLO model, so we won’t dwell on it here. After this step, we finally have the data required for classification: crops extracted based on detection coordinates from every photo and every camera.

Annotation

Unfortunately, there are no models in the world capable of immediately identifying a product in a photo, even with a full catalog of 20,000+ items at hand. Therefore, we needed human annotation.

We initially started annotation in Label Studio but soon realized we needed far more custom functionality. As a result, we built our own tool, tailored specifically for annotating product classification on store shelves.

How to annotate not too little and not too much will be covered in a separate article. For now, let’s focus on the annotation outcome: from each camera, we annotated one frame every few days (on average), meaning each crop was assigned the correct product_id.

Errors in annotation happen—that’s normal. But at a certain point, we realized that for some product categories, annotation errors were almost the biggest problem, since the rest of the pipeline for those items was working well.

We tried various methods to clean up annotation errors: automatic approaches (like outlier detection per class using Cleanlab, clustering embeddings, etc.) and manual ones (narrowing down the list of classes to the most “suspicious,” visualizing them in FiftyOne, manually removing and re-annotating).

However, because many classes look very similar, automatic cleaning performed inconsistently, and striking a balance between high precision and acceptable recall proved impossible. In the end, manual cleaning turned out to be the most effective approach—unfortunately for our team’s time, but fortunately for the accuracy of our dataset.

Solution v0: Embedder + Search Space

Classification vs. Metric Learning

Let's consider how to approach product classification in our environment.

Given:

  • Photos from cameras,
  • Crops from YOLO detections,
  • Annotations every few days.

We need to learn how to classify product crops.

The most intuitive method is to treat this as a conventional classification problem—train a model on a catalog of 20,000 products to output a probability distribution over these classes.

It’s straightforward to implement, but there’s a critical drawback: the product assortment is dynamic. New items appear regularly that the classifier has never seen, causing it to misassign them to known but incorrect classes.

The solution is to move from a classification framework to a metric learning paradigm. Rather than training a model to predict probabilities, we train an embedder that converts each input image into a representation (embedding).

We then compare this embedding against a collection of embeddings from previously annotated crops (our search space) to identify the correct class.

When a new product is introduced, we simply annotate a few examples and add their embeddings to the search space. If the embedder is well-trained, it will generate an embedding close to these new instances, enabling accurate classification.

For a pipeline to work in this paradigm, we need two components: an embedder and a search space.

Embedder & ArcFace

We experimented extensively with various compact models (ConvNeXt, ViT, Swin, etc.), but based on the balance of speed under constrained resources and overall quality, we selected ViT-Base-32.

But how do you train a model to produce good embeddings? The good news is that an embedder can be trained as a classifier: a classification task, when paired with a properly designed loss function, naturally teaches the model to form compact, well-separated embeddings for each class.

We use ArcFace, which employs an additive angular margin loss—a modified version of softmax that operates not on logits but on the angles between normalized embeddings and class weights. ArcFace “pushes” embeddings of the same class closer to their center on the hypersphere while simultaneously increasing the angular separation between classes, enforcing a more stringent geometric structure in the feature space.

The advantages of this approach are that the embeddings become significantly more discriminative:

  • Classes are separated by angles, not just by Euclidean distance;
  • Intra-class variability is reduced;
  • Inter-class distances are consistently larger.

As a result, an embedder trained with ArcFace is well suited for nearest-neighbor search, deduplication, and clustering tasks, even when the number of classes in production significantly exceeds the number of classes seen during training.
Embeddings from a model trained with ArcFace exhibit large inter-class distances and small intra-class distances, making them easy to separate. Source

A few details on training the embedder:
  • Embedding size: 512 — a fairly standard choice.
  • Batch size: Larger is better, as it allows more product groups to fit into a single batch. When we switched from an RTX 3090 to an H100 (increasing batch size severalfold), we even observed a slight improvement in quality.
  • Sampling: We used batch balancing by company, with four crops per class. Since we had two vendors, each batch contained crops from only one company, and during inference we filtered searches by company as well. We also plan to introduce category-based sampling, so that a batch primarily contains items from a single category (there is little value in distinguishing cola from cookies, but distinguishing one cookie from another is meaningful).
  • Pretraining: Through experimentation, we found it best to pretrain on the rp2k dataset and then fine-tune on our own dataset.

Search Space

The search space serves a single primary function: given an embedding, find the nearest embeddings as quickly as possible—whether via exact KNN (K-Nearest Neighbors) or approximate nearest neighbor (ANN) search. There are many implementations of vector stores that support this, including Faiss, Milvus, Qdrant, and even Redis with its vector database capabilities. We chose Qdrant because it is relatively easy to configure and scale, while providing all the functionality we needed.

Let’s build the pipeline step by step. We take all annotated crops, run them through the embedder to obtain embeddings, and store these embeddings in Qdrant. In addition to vectors, Qdrant allows us to store useful payload data (e.g., product_id, bbox_id, and other metadata) and to efficiently filter results using indexing.

But how do we search for the nearest vectors efficiently? If we have 2 million vectors in the search space, simply computing cosine similarity between the target embedding and all stored embeddings would be extremely slow. First, we can significantly narrow the search space by filtering on payload data (for example, restricting results to products from the current store or the current category). Second, we can use approximate search, also known as approximate nearest neighbors (ANN). This can be implemented in several ways, but Qdrant uses Hierarchical Navigable Small World (HNSW).

In short, HNSW is a data structure for fast nearest-neighbor search, built as a multi-level graph, where the upper levels provide coarse navigation and the lower levels perform precise local search. The search proceeds from top to bottom: first, Qdrant finds an approximate region where the nearest vector is likely to be located, and then refines the result at lower levels. This makes it possible to find the most relevant vectors very quickly without scanning the entire database.
Multi-level graph in HNSW. Source (great article, highly recommend!)

v0 pipeline: bare embedder

he pipeline is fairly simple. For each crop from each camera, we compute an embedding. If the crop is annotated, we store it together with its embedding and payload in Qdrant; if it is not annotated, we query the search space for the nearest vector, extract the product_id from the payload, and use it as the prediction for that crop. This pipeline achieved approximately 85% accuracy across all crops, which clearly leaves room for improvement.
How is the metric computed?

In the production pipeline, we do not have a classic test dataset with “clean” ground-truth labels: the assortment is constantly changing, products appear and disappear, and annotations are updated incrementally.

Therefore, the main end-to-end metric that we can compute automatically is the fraction of predictions that the annotator does not modify. The annotator receives a pre-annotation in the form of the top-1 prediction produced by the current pipeline. If the annotator confirms it, we count this prediction as correct. In this way, the metric reflects how often the system immediately provides the annotator with an acceptable result, without requiring manual correction.

This metric captures the real-world usefulness of the system in production quite well (in terms of annotator time savings and pipeline stability). However, it has a known limitation: in ambiguous cases, annotators tend to confirm the pipeline’s prediction even if they are not fully confident in its correctness. As a result, the metric is biased toward the current solution and does not represent a “pure” accuracy in the classical ML sense. Nevertheless, when comparing different versions of the pipeline under identical conditions, it correlates well with genuine improvements in quality.

Examples

Let’s look at a few interesting cases of nearest crop matches retrieved from the search space, along with their cosine similarity scores and product_ids.
The mayonnaise is lying down, not standing up, so it's hard to detect.
Two frequently confused items: orange juice and grapefruit juice.
The quest: differentiate the 160g can from the 190g can.
In this example, the crop is poorly lit, so the top-5 results actually correspond to five different products.

Solution v1: v0 + realgram

Issues with the v0 approach

There are several issues with this approach. No matter how well the embedder is trained, it will inevitably confuse visually similar products (for example, 0.5 L water vs. 1 L water; chicken seasoning vs. potato seasoning; oranges vs. mandarins). Sometimes this confusion is caused by camera quality or lighting conditions; in other cases, the products are genuinely indistinguishable even to the human eye (for instance, when the packaging or bottle is turned backward). In short, relying solely on visual features is clearly insufficient to reliably determine the correct class.

Enhancing visual context with spatial context

Each crop does not exist in isolation—it lives on a shelf, and that shelf is periodically annotated (on average, every few days). This gives us a strong additional signal—“What has been annotated on this shelf recently?”—that we have not yet leveraged.

The idea, therefore, is straightforward: incorporate this signal into the pipeline, as it should clearly improve results. The implementation, however, is slightly more involved than it may seem. We do not want to treat visual features (classes and scores from the search space) and spatial features (annotations from the shelf) independently; instead, we need to merge them in a way that leverages the strengths of both.

Ideally, we want an algorithm with the following properties:

  1. Primarily rely on the classes and scores returned by the search space.
  2. Incorporate recent annotations from the shelf, accounting for factors such as:
a. Spatial proximity between the annotation and the current crop. For example, if Coke is consistently on the left side of the shelf and Fanta on the right, then a crop on the left should receive a stronger signal for Coke.
b. Annotation recency. An annotation from two days ago should carry more weight than one from two weeks ago.

This algorithm was implemented under the name realgram (by analogy to a planogram, which represents what is planned to be on a shelf, whereas a realgram reflects what is actually there).

Pipeline v1: Embedder + realgram

The v0 pipeline remains unchanged for now: we compute an embedding for a crop, query Qdrant, and retrieve the nearest crops along with their payloads. Then:
  1. Gather recent shelf annotations. For the current shelf, we compile a combined list of products from annotations over the last 30 days. For example, if the first annotation contains products 1, 2, 2, 3 and the second contains 2, 2, 3, 4, the combined list becomes 1, 2, 2, 3, 4—meaning that for each unique product, we take the maximum occurrence count across recent annotations. If no annotations exist for this shelf, the realgram component is disabled and we fall back to the pure v0 pipeline.
  2. Calculate weighting factors for each product in the list: a. Temporal decay coefficient. A hyperparameter controls how quickly annotations lose relevance over time, with older annotations receiving less weight. b. Spatial overlap coefficient. Another hyperparameter determines how strongly to weigh annotations based on their coordinate overlap with the current crop.
  3. Compute a final coefficient for each product. We combine the temporal decay and spatial overlap coefficients, and for each unique product in the list (since duplicates may exist), we take the maximum resulting coefficient.
  4. Adjust the original Qdrant scores. The final coefficient (scaled by a global realgram weight hyperparameter that reflects how much we trust shelf annotations) is added to the original similarity scores returned by Qdrant.
  5. Obtain final scores: The resulting scores now incorporate both Qdrant similarity and realgram shelf-context signals. The top-1 score is used as the final prediction.
Visualizing the realgram is challenging, but an attempt to illustrate it looks something like this:
Realgram scores for the selected crop: "Rice & Buckwheat" fits best based on both coordinates and freshness, while "Druzhba" and "Buckwheat Kernels" also match in freshness but have a lower overlap score (S = staleness coefficient, SX = combined staleness and overlap coefficient).

Examples

We achieved a solid baseline, progressing from a simple embedder with search space (85% accuracy) to the contextual Realgram algorithm (92% accuracy).

However, several issues still remain:

  1. Quality is still far from ideal. On challenging crops with poor lighting or partially obscured products, the system essentially makes near-random predictions, relying mostly on the context of nearby shelf annotations.
  2. Numerous "flickering" predictions. On consecutive frames from the same camera, a product doesn't physically change, but its predicted label "jumps" due to volatility in the search space top results.
  3. Too many hyperparameters. Realgram has about a dozen parameters that must be tuned manually or via Optuna. We wanted a more universal algorithm that could adapt these parameters autonomously.
  4. Many borderline cases. The correct product often ranks 2nd–10th in the search space top, while the top-1 result is wrong but has a high score (sometimes even higher than the correct product's Realgram score). Considering the overall top-K picture, rather than just the top-1 result, could help correct these errors.

In the next two articles, we will explain how we improved this pipeline using tracking and a second-level classifier.
Big thanks to our CV engineers for preparing this material: Oleksandr Korotaievskyi and Artem Smetanin.
AI Computer Vision