Meet TAP-Vid: A Dataset of Videos Along With Point Tracks, Either Manually Annotated or Obtained From A Simulator

Imagine if we could study the motion of objects in videos by tracking their position and orientation and how different points on the object move. This information will be useful in making inferences about the 3D properties, physical properties, and interactions of various objects.

So what is the most basic step in achieving the goal mentioned above?

Suppose we can determine the position of a few marked points in every video frame and observe them. In that case, we can use this methodology to study the motion of objects by sticking those points to a particular object and then studying their movement.

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

That’s what the author of this paper is trying to do here. The problem statement goes something like this, “How can we track the motion of a point over a long video?”

Fig.1 The problem of tracking any point (TAP) in a video
Source: https://arxiv.org/pdf/2211.03726.pdf

Although some attention has been paid to this issue, there still needs to be a dataset or benchmark for evaluation. Some pre-existing methods partially address this problem, like the most popular bounding box, and segmentation tracking algorithms provide limited information about the deformation and rotation of objects. Optical flow models can track a point on any surface, but the duration for which they can track the movement is limited to a few frames. Moreover, optical flow models are also limited in their ability to estimate occlusion (see Fig. 2 for other methods). This paper can be seen as an extension of optical flow to longer videos with occlusion estimation. In this paper, researchers from Deepmind also introduced a companion benchmark, TAP( Tracking any point)-Vid, which is composed of real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. The benchmark is created using a novel semi-automatic crowdsourced pipeline that uses optical flow to estimate the tracks over shorter timeframes and lets the annotator focus on harder sections of video, and the annotators also proofread the estimate of the optical flow. Tracking points in a synthetic environment is quite straightforward. Even though humans are pretty good at tracking points, obtaining ground-truth tracks for real-world videos is a tedious and time-consuming task as both the objects and cameras tend to move in a complex manner. It can be one of the reasons why this problem has received so little attention.

Fig. 2 Various methods for motion understanding in videos.
Source: https://arxiv.org/pdf/2211.03726.pdf

Let’s talk about how the dataset is created…

The benchmark contains real as well as synthetic videos. Around 1189 real YouTube videos from the Kinetics dataset and 30 videos from the DAVIS evaluation dataset with roughly 25 points are annotated using a small pool of annotators with multiple rounds of proofreading and corrections. A total of four datasets are created, namely, TAP-Vid-Kinetics, TAP-Vid-DAVIS, TAP-Vid-Kubric (synthetic), and TAP-Vid-RGB-Stacking (synthetic). 

Fig. 3 The TAP-Vid point tracking datasets.
Source: https://arxiv.org/pdf/2211.03726.pdf

The model is trained only on the synthetic Kubric dataset. The other three datasets are used for evaluation and testing because this transfer from one synthetic to a real dataset is more likely to represent the change from the seen to the unseen environment.

Again, coming back to baselines and the proposed algorithm for point tracking. Some COTR, VFS, and RAFT extensions are used as pre-existing baselines for comparison with the proposed TAP-Net. The above algorithm fails at handling occlusion, deformation of objects, and transfer from a synthetic to a real environment, respectively. 

The approach of TAP-Net is inspired by cost volumes. The given video is first divided into feature grids. Then the features for the query point are compared with the features everywhere else in the video, and a cost volume is calculated (see Fig. 4). After that, a simple neural network is applied, which predicts the occlusion logit and point location at query time (see Fig. 5). Because we are performing two different tasks with the same model, the cost function is simply a weighted combination of two losses: one is a Huber loss for position regression, and the other is a standard cross-entropy loss for occlusion classification.

Fig. 4 Cost Volume computation for a single query.
Source: https://arxiv.org/pdf/2211.03726.pdf
Fig. 5 Prediction of occlusion and position of the point from cost volume.
Source: https://arxiv.org/pdf/2211.03726.pdf

TAP-Net easily outperforms all other baseline methods on all four datasets. You can see the results in Table 1 below. 

Table 1. Comparison of TAP-Net versus several established methods on the TAP-Vid dataset
Source: https://arxiv.org/pdf/2211.03726.pdf

Advancements in TAP will prove to be highly useful in robotic object manipulation and some applications of reinforcement learning. Even though TAP-Net outperforms all the prior baseline methods, it still has some limitations. Neither the benchmark nor the TAP-Net can handle liquids or objects that are transparent, and the annotations for real-world videos can’t be 100% correct.


Check out the Paper and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.