Computer vision research encompasses a wide range of visual modalities, including images, videos, and depth perception. In general, the computer vision model treats each of these modalities separately to extract the most useful characteristics from their differences. While these modality-specific models outperform people on their specialized tasks, they lack the flexibility of a human-like visual system—the capacity to work across modalities.
Instead of having an over-optimized model for each modality, the first step toward a true all-purpose vision system is to design models that work fluidly across modalities. In addition to their flexibility, such modality-agnostic models have various advantages over their traditional, modality-specific equivalents, such as:
- It can perform cross-modal generalization, which means applying what it has learned from one modality to other modalities.
- It saves time and money spent on researching and developing models for a specific modality.
Despite their evident benefits, modality-agnostic models have received little research, and their performance has been poor compared to modality-specific models. There are several causes for this circumstance, including the necessity for a flexible architecture with the capacity to learn modality-specific signals from several modalities and enough computing to train it simultaneously on video, pictures, and single-view 3D.
Meta AI team conducted a study to create a modality-agnostic vision model that uses current improvements in vision architectures. The model works on three different visual modalities: images, videos, and single-view 3D and is thus named “omnivorous.”
Here, each visual modality does not have its own architecture. Instead, the model uses the same shared model parameters to recognize all three modalities. It operates by turning each input modality into embeddings of Spatio-temporal patches, which are then processed by the same transformer to provide an output representation.
The researchers used a set of conventional, off-the-shelf classification datasets with various input modalities to train OMNIVORE. Unlike previous research, they do not use explicit correspondences across distinct input modalities in our training.
The team evaluated the OMNIVORE models and found that, surprisingly, even though OMNIVORE was not explicitly taught to model cross-modal correspondences, OMNIVORE representations generalize well across visual modalities. Due to parameter sharing between models for multiple modalities, these capabilities evolve without explicit cross-modal supervision.
OMNIVORE outperforms modality-specific vision models with the same amount of parameters on the conventional image, video, and single view 3D benchmarks. The results show that it achieves an accuracy of 85.6 percent On ImageNet-1K, 83.4 percent on Kinetics-400, and 67.4 percent on SUN RGBD datasets.
Further, its powerful generalization capabilities benefit transfer learning experiments. It attains a new SOTA on action recognition benchmarks such as EPIC-Kitchens-100 and Something Something-v2, as well as on single-view 3D classification and segmentation benchmarks.
The team believes that their findings make a compelling possibility for developing vision models that can function on any visual modality.
The current research focuses on technological advancements in visual recognition training models. From an ethical standpoint, these developments appear to be ethically neutral. On the other hand, OMNIVORE is subject to the same ethical considerations as other visual-recognition models. The team states that any real-world implementation of a model like OMNIVORE should be preceded by a thorough examination of the model for ethical issues.