Multi-object tracking (MOT) involves identifying and following objects as they move about in videos. Currently, available methods obtain identities by associating detection boxes whose scores are higher than a threshold. This creates a problem that some small or unnoticed objects might be thrown away because their score isn’t high enough to qualify them for analysis.
Researchers from Huazhong University of Science and Technology, The University of Hong Kong, and ByteDance have discovered a way to distinguish objects in low score detection boxes with the help of tracklets.
The research group developed a simple and effective association method called ‘BYTE‘ to make full use of detection boxes from high scores to low ones in the matching process. This tracking system values every detailed box as an individual byte, or unit of information of the tracklet that can be extracted during playback with this algorithm.
The researchers first match high-score detection boxes with tracklets on similar motion profiles. They used Kalman Filter to predict the location of tracklets in new frames. The motion similarity was computed by the IoU of the predicted box and the detection box. The research group then performed the second matching between unmatched tracklets. The person with low detection scores is correctly matched, and their background has been removed for clarity purposes.
The research evaluated the proposed association method by applying it to nine different state-of-the-art trackers. These trackers include the Re-ID-based ones, motion-based ones, chain-based ones, and attention-based ones. The researchers were able to achieve notable improvements on all metrics, including MOTA, IDF1 score, and ID switches.
The research group proposed a simple and robust tracker, named ByteTrack, a new way towards the state-of-the-art performance of MOT. They adopt a current high-performance detector YOLOX to obtain the detection boxes and associate them with the proposed BYTE. The proposed method achieves highly competitive tracking performance by using an extremely simple motion model. It does not require any Re-ID modules or attention mechanisms.