Divided We Fall, United We Stand: CoTracker is an AI Approach That Jointly Tracks Multiple Points in a Video

Recent years have been full of advancements in image generation and large language models in the AI domain. They have been under the spotlight for quite some time thanks to their revolutionary capabilities. Both image generation and language models have become so good that it is difficult to differentiate the generated outputs from real ones.

But they are not the only applications that advanced rapidly in recent years. We have seen impressive advancements in computer vision applications as well. The segment anything (SAM) model has opened new possibilities in object segmentation, for example. SAM can segment any object in an image or, more impressively, in a video without relying on a training dictionary.

The video part is specifically exciting because the video had always been considered challenging data to work with. While working with videos, motion tracking plays a crucial aspect in whatever task you are trying to achieve. That lays the foundation of the problem.

One crucial aspect of motion tracking is establishing point correspondences. Recently, there have been multiple attempts to do motion estimation in videos with dynamic objects and moving cameras. This challenging task involves estimating the location of 2D points across video frames, representing the projection of underlying 3D scene points. 

Two main approaches to motion estimation are optical flow and tracking. Optical flow estimates velocity for all points within a video frame while tracking focuses on estimating point motion over an extended period, treating points as statistically independent.

Although modern deep learning techniques have made strides in point tracking, there remains an essential aspect overlooked – the correlation between tracked points. Intuitively, points belonging to the same physical object should be related, yet conventional methods treat them independently, leading to false approximations. Time to meet with CoTracker, which tackles this issue.

CoTracker is a neural network-based tracker that aims to revolutionize point tracking in long video sequences by accounting for the correlation between tracked points. The network takes both the video and a variable number of starting track locations as input and outputs the full tracks for the specified points.

CoTracker supports joint tracking of multiple points and processing longer videos in a windowed application. It operates on a 2D grid of tokens, with one dimension representing time and the other tracking points. By employing suitable self-attention operators, the transformer-based network can consider each track as a whole within a window and exchange information between tracks, leveraging their inherent correlations.

Overview of CoTracker. Source: https://arxiv.org/pdf/2307.07635.pdf

The flexibility of CoTracker allows for tracking arbitrary points at any spatial location and time in the video. It takes an initial, approximate version of the tracks and refines them incrementally to match the video content better. Tracks can be initialized from any point, even in the middle of a video or from the output of the tracker itself, when operated in a sliding-window fashion.

Qualitative results of CoTracker. Source: https://arxiv.org/pdf/2307.07635.pdf

CoTracker represents a promising advancement in motion estimation, emphasizing the importance of considering point correlations. It paves the way for enhanced video analysis and opens new possibilities for downstream tasks in computer vision.

Check out the Paper, Project, and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.