One of the most fundamental and difficult computer vision problems is essential for many illegal applications, including video analysis and comprehension, including self-driving cars, augmented reality, video editing, etc. The goal of video object segmentation (VOS) is to separate a certain object from the rest of the video sequence, such as the dominating objects or the things that users have pointed out. There are various VOS options, such as semi-supervised VOS, which provides the target object’s first-frame mask; unsupervised VOS, which detects main objects automatically; and interactive VOS, which depends on the user interacting with the target item. The study of video object segmentation has been addressed in-depth using both conventional and deep learning methodologies.
Deep-learning-based algorithms have significantly outperformed conventional methods in terms of video object segmentation performance. On two of the most popular VOS datasets, DAVIS and YouTube-VOS, state-of-the-art techniques have produced extremely high performance. For instance, XMem scores 92.0% on the DAVIS 2016 test, 87.7% on the DAVIS 2017 test, and 86.1% on the YouTubeVOS test. The video object segmentation has been successfully resolved, given the great performance. But do they see things in actual situations? They review video object segmentation in increasingly sophisticated and realistic settings to find a solution to this query.
In current datasets, the target items are often dominating and conspicuous. Real-world situations often contain complicated and obscured sceneries rather than separate and prominent things. They gather 2,149 films with complex situations to create a new, extremely difficult video object segmentation benchmark they call coMplex video Object SEgmentation to assess state-of-the-art VOS algorithms under more complex circumstances (MOSE). In particular, MOSE has 5,200 items from 36 categories and 431,725 excellent segmentation masks. The most remarkable aspect of MOSE, as seen in Figure 1, is its ability to handle complicated situations with items that disappear and reappear, are small or obvious, are heavily occluded, are in crowded settings, etc.
For instance, the bus obscures the white automobile in the first row of Figure 1, and the third picture’s third image has the most severe occlusion, which eliminates the sedan. The target player in the crowd in Figure 1’s second row is unnoticeable and obscured by the throng in the third frame. It is incredibly challenging to monitor the target player since each time he emerges, he spins around and assumes a different appearance from the previous two frames. Segmenting video objects is made more difficult by the significant occlusion and disappearance of objects in complicated sequences. They want to further studies on video object segmentation in challenging settings and make VOS practical.
To study the dataset, they retrain and assess some of the current VOS techniques on the planned MOSE dataset. They specifically retrain six cutting-edge methods in semisupervised settings using a mask as the first-frame reference, two in semisupervised settings using bounding boxes as the first-frame reference, three in multiobject zero-shot video object segmentation settings, and seven in interactive environments. The experimental findings demonstrate that movies of complex situations reduce the prominence of the state-of-the-art VOS approaches, particularly in tracking objects that temporarily vanish owing to occlusions.
For instance, the performance of XMem on DAVIS 2016 is 92.0% but drops to 57.6% on MOSE, and the implementation of DeAOT on DAVIS 2016 is 92.9% but drops to 59.4% on MOSE, which repeatedly shows the challenges posed by complicated scenarios. Occlusions, crowds, small-scale pictures, object disappearance and reappearance, and flickering across the temporal domain contribute to MOSE’s poor performance. While crowds, small objects, and strong occlusion make it difficult to segment things in pictures, the disappearance and reappearance of objects make it even harder to track an obscured object, adding to the difficulty of association.
Their primary contributions may be summarized as follows:
• They create the MOSE benchmark dataset for video object segmentation (coMplex Video Object Segmentation). MOSE focuses on comprehending video objects in challenging circumstances.
• Taking a close look at MOSE, they analyze the challenges and potential directions for future video understanding research in complex scenes.
• They conduct a thorough comparison and evaluation of state-of-the-art VOS methods on the MOSE dataset under four different settings, including mask-initialization semi-supervised, box-initialization semi-supervised, unsupervised, and interactive settings.
The dataset is accessible on OneDrive, Google Drive, and Baidu Pan. Additional information for accessing it can be found on their project website.
Check out the Paper, Github, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 13k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.