How is the metric computed?
In the production pipeline, we do not have a classic test dataset with “clean” ground-truth labels: the assortment is constantly changing, products appear and disappear, and annotations are updated incrementally.
Therefore, the main end-to-end metric that we can compute automatically is the fraction of predictions that the annotator does not modify. The annotator receives a pre-annotation in the form of the top-1 prediction produced by the current pipeline. If the annotator confirms it, we count this prediction as correct. In this way, the metric reflects how often the system immediately provides the annotator with an acceptable result, without requiring manual correction.
This metric captures the real-world usefulness of the system in production quite well (in terms of annotator time savings and pipeline stability). However, it has a known limitation: in ambiguous cases, annotators tend to confirm the pipeline’s prediction even if they are not fully confident in its correctness. As a result, the metric is biased toward the current solution and does not represent a “pure” accuracy in the classical ML sense. Nevertheless, when comparing different versions of the pipeline under identical conditions, it correlates well with genuine improvements in quality.