CMU Researchers Propose Persistent Independent Particles (PIPs): A Computer Vision Method For Multi-frame Point Trajectory Estimation Through Occlusions

The challenge of motion estimation is important to computer vision and has far-reaching implications. Tracking makes it possible to create models of an object’s shape, texture, articulation, dynamics, affordances, and other characteristics. Fine-grained tracking not only enables precise manipulation by robots but also allows for greater precision in tracking. Greater granularity in tracking enables deeper comprehension. 

While there are numerous approaches for fine-grained tracking of certain objects (at the level of segmentation masks or bounding boxes) or specific types of points (e.g., the joints of a person), there are surprisingly few possibilities for general-purpose fine-grained tracking. Feature matching and optical flow are the two most common methods used in this field. 

Feature Matching: Computing a feature for the target in the first frame, then computing features for pixels in the remaining frames, and finally computing “matches” using feature similarity is what is meant by the term “feature matching” (i.e., nearest neighbors). While effective, this approach ignores important factors like motion smoothness and occurs in a static time frame. 

Optical Flow: The idea behind optical flow is to first compute a dense “motion field” that relates every possible pair of frames and then use post-processing to connect the fields. However, it does not apply to targets obscured in more than two consecutive frames because of this limitation. When the line of sight to a target is obscured, as in “occlusion,” it is important to make an educated guess about its location based on other available information.

A “particle video” is an alternative to traditional flow-based and feature-based approaches, which use a collection of particles that change position across numerous frames to depict a video. According to the researchers, these set the framework for treating pixels as persistent entities, with multi-frame trajectories and long-range temporal priors, even if their proposed solution did not address occlusions.

Inspired by this work, researchers from Carnegie Mellon University introduced Persistent Independent Particles (PIPs), a novel approach to creating particle videos. The proposed approach inputs a video and a set of coordinates for a target to follow and outputs the path taken by that target. There is no limit to the number of particles or their positions that can be queried in the model.

The approach estimates the trajectory of each target separately, which is a radical reduction in our ability to track their movements over time. This radical decision frees up most parameters for a module that simultaneously learns temporal priors and an iterative inference mechanism that looks for the target pixel’s location across all input frames. Most related optical flow estimation work takes the opposite tack, estimating the motion of each pixel independently.

To ensure that the trajectory follows the goal in every frame, the model simultaneously generates updates to the locations and features for several different timesteps. This helps them to “catch” a target as it emerges from behind an occluder and “fill in” the previously unknown portion of its path.

We primarily feed the model metrics for regional visual similarity. These values are obtained by multi-scale dot-product (cross-correlation) computations. The estimated trajectory is the second piece of data we feed into the model. Because of this, the model can apply a temporal prior and improve the trajectory in places where the local similarity data was unclear.

Finally, the model is allowed to look at the target’s feature vector on the off chance that it can learn distinct approaches for various feature types. For instance, it might modify how it employs data from the multi-scale similarity maps based on the target’s scale or texture.

The researchers settled on an MLP-Mixer as the model architecture since it struck a reasonable balance between model capacity, training duration, and generalization. They also tested convolutional models and transformers, but the former failed to provide a satisfactory fit to the data, and the latter required too much time to train.

To train the model, they constructed their own data set (using an existing optical flow dataset as inspiration) that included multi-frame ground truth for occluded targets.

Since they lack multi-frame temporal context, baseline approaches frequently become stalled on occluders.

Their findings show that synthetic and real-world video data show that the proposed particle trajectories are more resilient to occlusions than flow trajectories. Further, the team uses a simultaneously computed visibility cue to connect the model’s moderate-length trajectories into arbitrary-length trajectories.

However, in most cases, the researchers don’t want to assume independence. The team is currently working to include cross-particle context so that more confident particles can aid less confident ones and more granular tracking can be performed simultaneously.

All of this work, including the model weights, is now available on GitHub. The team believes their work will pave the way for precise long-range monitoring of “anything.”

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, github link, project and reference article.

Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.