Google Research Open-Sources ‘SAVi’: An Object-Centric Architecture That Extends The Slot Attention Mechanism To Videos

Multiple distinct things act as compositional building blocks that can be processed independently and recombined in humans’ understanding of the world. The foundation for high-level cognitive abilities like language, causal reasoning, arithmetic, planning, and so on is a compositional model of the universe. Therefore, it’s essential for generalizing in predictable and systematic ways. Machine learning algorithms with object-centric representations have the potential to dramatically improve sampling efficiency, resilience, generalization to new problems, and interpretability.

Unsupervised multi-object representation learning is widely used in various applications. These algorithms learn to separate and represent objects from the statistical structure of the data alone, without the requirement for supervision, by using object-centric inductive biases. Despite their promising outcomes, these approaches are currently constrained by two major issues: 

  1. They are limited to toy data such as moving 2D sprites or extremely rudimentary 3D scenes, and they struggle with more realistic data with complex textures. 
  2. Both during training and inference, it is not clear how to interact with these models. The concept of an object is imprecise and task-dependent, and these models’ segmentation does not always correspond to the tasks of interest.

To overcome the problem of unsupervised / weakly-supervised multi-object segmentation and tracking in video data, a new Google research introduces a sequential extension of Slot Attention called Slot Attention for Video (SAVi). 

Source: https://arxiv.org/pdf/2111.12594.pdf

Inspired by predictor-corrector approaches for the integration of ordinary differential equations, SAVi performs a prediction and a corrective step for each seen video frame. In order to describe temporal dynamics and object interactions, the prediction step employs self-attention among the slots. The slot-normalized cross-attention with the inputs is used in the correction stage to update (or correct) the set of slot representations. The predictor’s output is then used to initialize the corrector at the following time step, allowing the model to track objects through time in a consistent manner. Both of these processes are permutation equivariant, preserving the slot symmetry.

Recent work in object-centric representation learning has examined the incorporation of inductive biases associated with 3D scene geometry, both for static scenes and for films. This is to bridge the gap to visually richer and more realistic surroundings but is in opposition to the use of conditioning and optical flow. FlowCaps technique proposes to leverage optical flow in a multi-object model similarly. It employs capsules instead of slots and expects that individual capsules are dedicated to objects or parts of things with a specific appearance, making it inappropriate for settings with a wide range of item types. Objects are represented using a slot-based, interchangeable representation.

The researchers study the conditional tasks based on semi-supervised video object segmentation (VOS) computer vision problems, in which segmentation masks are provided for the initial video frame during evaluation. They focus on the problem where models have no access to any supervised information beyond the conditioning information on the first frame, which is solved via supervised learning on fully annotated films or comparable datasets. Even when segmentation labels are missing training and test time, multi-object segmentation and tracking can arise.

Each video was divided into six 6-frame sub-sequences during training, with the first frame receiving the conditioning signal. The researchers train for 100k steps (200k for fully unsupervised video decomposition) with a batch size of 64. In SAVi, they use a total of 11 slots. Two rounds of Slot Attention per frame were used for fully unsupervised video decomposition experiments and a single iteration otherwise.

On the CATER dataset1, the researchers first test SAVi in an unconditional scenario and with a standard RGB reconstruction target. Because the two image-based approaches (Slot Attention and MONet) apply each frame independently, they lack a built-in sense of temporal consistency. The (unconditional) SAVi model outperforms these benchmarks, proving the suitability of our architecture for unsupervised object representation learning, albeit only on simple synthetic data.

The team switched the training goal from RGB image prediction to optical flow prediction to deal with these more realistic videos. In addition, they condition the SAVi model’s latent slots on cues about objects in the first frame of the video. SAVi was trained on six consecutive frames in every scenario but during test time. In video object segmentation, it is usual practice to use precise segmentation information for the first frame, with models like T-VOS or CRW propagating the initial masks throughout the video series.

T-VOS scores 50.4 percent and 46.4 percent mIoU on the MOVi and MOVi++ datasets, respectively, whereas CRW achieves 42.4 percent and 50.9 percent mIoU. When trained to forecast flow and with segmentation masks as the conditioning signal in the first frame, SAVi learns to produce temporally consistent masks that are much better on MOVi (72.0 percent mIoU) and marginally poorer than T-VOS and CRW on MOVi++ (43.0 percent mIoU).

There are still a few challenges to overcome before the system can be applied to the real world’s full visual and dynamic complexity. 

  • Firstly, the employed training method assumes that optical flow information is available at training time, which may not be the case in real-world videos.
  • Secondly, the settings considered in this study remain limited by holding only stiff objects with rudimentary physics; in the case of the MOVi datasets, only moving objects. Furthermore, training with the only optical flow is difficult for static objects.

Nonetheless, the research reveals that the suggested model performs excellently in terms of segmentation and tracking. This shows that model capacity isn’t the primary constraint for object-centric representation learning. This method of using location information to condition the initialization of slot representations could lead to a variety of semi-supervised techniques.

Paper: https://arxiv.org/pdf/2111.12594.pdf

GitHub: https://slot-attention-video.github.io/