A typical photo with YOLO detections. And an example of a camera can be seen at bottom of the photo.
A column represents a single image from the camera, and a row represents one set of coordinates. On this particular camera, products hardly change over time and look almost identical, so tracking works well. However, things are far from this good on all cameras.
Original DeepSORT: first, matching by embeddings (filtering out unrealistic candidates using a Kalman filter), and then IoU-based matching for detections that were not matched by embeddings.
Our version of DeepSORT: first, IoU-based matching, and only for detections that were matched by IoU, embedding-based matching (since products in our case hardly move and visual descriptors are more important for us).
About the “most recent unused annotation”
Tracking across frames is a good idea, but if the source for tracking is just the pipeline’s prediction (with, say, 90% accuracy), then in 10% of cases the tracking will “propagate” the error further. To reduce this effect, we use annotations as a more reliable source.
For each frame, we check whether an annotation is available. When a new annotation appears, we run tracking from it once and then continue tracking on intermediate frames. Yes, with each frame the number of tracks created from annotations (we call them “anchor” tracks) decreases, but depending on the thresholds, between 20% and 50% of “anchor” tracks are preserved between annotations.
The v2 pipeline looks roughly as follows: we obtain intermediate predictions from the search space and realgram; obtain predictions from tracking; and then select between them based on a score with a threshold.