Researchers from U Texas and Apple Propose a Novel Transformer-Based Architecture for Global Multi-Object Tracking

This research summary article is based on the research paper 'Global Tracking Transformers'.

Multi-object tracking aims to locate and track all objects in a video feed. It’s a fundamental component in domains like mobile robots, where an autonomous system must navigate dynamic surroundings populated by other mobile agents. Thanks to breakthroughs in deep learning and object detection, tracking-by-detection has become the dominant tracking paradigm in recent years.

Tracking-by-detection simplifies the process by reducing it to just two steps: detection and association. First, an object detector searches each video stream frame for probable items. The second phase is an association step, which connects detections over time. Local trackers are greedy when it comes to pairwise relationships. They keep track of each trajectory’s state based on its position and/or identity traits and correlate current-frame detections with it based on its last visible status.

This pairwise association is effective, but it lacks an explicit description of all paths and difficulties with significant occlusion or strong appearance change. Over pairwise associations, global trackers perform offline graph-based combinatorial optimization. They are more resilient and can resolve inconsistently clustered detections, although they are slow and usually separated from the detector.

Apple researchers recently published a paper demonstrating how to express worldwide surveillance as a few layers in a deep network. Because the network generates trajectories directly, it avoids both pairwise association and graph-based optimization. The researchers demonstrate how detectors can be enhanced with transformer layers to become combined detectors and trackers.

The Global TRacking Transformer (GTR) encodes detections from several successive frames and groups them into trajectories using trajectory queries. The queries are non-maximum suppressed detection features from a single frame (e.g., the current frame in an online tracker) that are translated into trajectories by the GTR. Each trajectory query generates a single global trajectory by utilizing a softmax distribution to allocate a detection from each frame.


The model’s outputs are thus detections and their temporal relationships. Using ground-truth trajectories and associated image-level bounding boxes, the team explicitly oversees the output of the global tracking transformer during training. They run GTR in a sliding window mode with a moderate temporal size of 32 frames during inference and online connect trajectories between windows. Within the temporal window, the model is end-to-end differentiable.

The framework was inspired by the recent success of transformer models in computer vision, namely in object detection. The cross-attention structure between queries and encoder features mines object similarities and easily matches the multi-object tracking association purpose. Within a temporal window, researchers undertake cross-attention between trajectory questions and object features and explicitly supervise it to generate a query-to-detections assignment.

Each task has a global trajectory that it corresponds to. Unlike transformer-based detectors that learn queries as fixed parameters, the queries come from existing detection features and evolve with the image content. Furthermore, the converter works with identified objects rather than working with raw pixels. This allows the team to take advantage of well-developed object detectors to their maximum potential.

The goal is to learn a tracking transformer from a set of ground-truth trajectories that estimates PA and, indirectly, the trajectory distribution. The researchers combined tracking and detection training by considering the transformer as a RoI head, similar to two-stage detectors. After non-maximum suppression, they obtain high-confidence objects and their accompanying features at each training iteration.

The approach is evaluated using two tracking benchmarks: TAO and MOT17. TAO is capable of tracking a wide range of objects. The photos were taken from six different video collections, which included indoor, outdoor, and driving settings. In a long-tail environment, the dataset necessitates tracking objects with a broad vocabulary of 488 types. In crowd settings, MOT tracks pedestrians. There are seven training sequences and seven test sequences in total. The sequences encompass 500 to 1500 frames that were captured and annotated at a rate of 25-30 frames per second.

The model achieves 20.1 tracking mAP on the test set on the demanding large-scale TAO dataset, greatly surpassing published work, which achieved 12.4 tracking mAP. The entry obtains a competitive 75.3 MOTA and 59.1 HOTA on the MOT17 benchmark, beating most competing transformer-based trackers and matching state-of-the-art association-based trackers.

The method beats both the official SORT baseline and the previous best result (QDTrack [32]), with a 62 percent improvement in mAP on the test set. While the proposed stronger detector accounts for some of the increase, it also illustrates one of the model’s advantages: it can be jointly trained end-to-end with state-of-the-art detection systems.


Apple researchers demonstrated a system for detecting and tracking many objects at the same time. The key component is a global tracking transformer, which combines objects into trajectories using object features from all frames within a temporal window. On the MOT17 and TAO benchmarks, the model is competitive. The researchers expect that their study will aid in the development of robust and general object tracking in the real world.