EPFL and Apple Researchers Open-Sources 4M: An Artificial Intelligence Framework for Training Multimodal Foundation Models Across Tens of Modalities and Tasks

Training large language models (LLMs) that can naturally handle various tasks without extensive task-specific adjustments has become more popular in natural language processing (NLP). There is still a need to create equally flexible and scalable models for vision, even though these models have shown outstanding success in NLP. The capacity to manage many input modalities and output tasks is essential for vision’s scalability and versatility. 

Vision models must handle various sensory inputs, including pictures, 3D, and text, and perform various tasks. Regarding vision, training on RGB images with a single purpose has not produced the same results as language modeling on raw text, which has led to multitasking capabilities in natural language processing. As a result, training should make use of a variety of modalities and tasks.

Data, architecture, and training purpose are three critical scalability factors to consider while building a model with the desirable vision foundation model attributes. Data scalability refers to the capacity to leverage more training samples to enhance performance. In architectural terms, scalability means that performance improves with increasing model size and stays stable when trained at huge sizes. Finally, a scalable training goal should be able to efficiently deal with an increasing number of modalities without causing the computational costs to skyrocket. 

New research by the Swiss Federal Institute of Technology Lausanne (EPFL) and Apple aims for scalability in all three areas while being compatible with different input types. 

To overcome these obstacles, the team presents a strategy that involves training a single integrated Transformer encoder-decoder with a multimodal masked modeling goal. 4M stands for “Massively Multimodal Masked Modeling,” highlighting the approach’s capacity to expand to several varied modalities. This approach combines the best features of masked modeling and multimodal learning:

  1. Strong cross-modal predictive coding abilities and shared scene representations,
  2. Iterative sampling allows models to be used for generative tasks. 
  3. The pre-training objective is to effectively learn rich representations. 

Importantly, 4M integrates these advantages while maintaining efficiency through many processes. Through the use of modality-specific tokenizers, modalities may be converted with diverse formats into sets or sequences of discrete tokens, allowing a single Transformer to be trained on text, bounding boxes, pictures, or neural network features, among others. This unifies their representational domains. Since task-specific encoders and heads are no longer necessary, the Transformer can be used with any modality and retain full parameter-sharing thanks to this tokenization approach, improving compatibility, scalability, and sharing. 

Additionally, 4M can train efficiently by utilizing input and target masking, even though it operates on a vast collection of modalities. This requires picking a small subset of tokens randomly from all modalities to use as model inputs and another small subset as targets. To achieve a scalable training goal, decoupling the number of input and target tokens from the number of modalities is necessary. This prevents the computational cost from quickly increasing as the number of modalities increases. Using CC12M and other available single-modal or text-image pair datasets, they create modally aligned binding data using powerful pseudo-labeling networks.

Without requiring them to include multimodal/multitask annotations, this pseudo-labeling method allows training on different and large-scale datasets. In addition to excelling at numerous important visual tasks right out of the gate, 4M models can be fine-tuned to achieve remarkable results on unforeseen downstream tasks and input modalities. 

Furthermore, one must utilize a multimodal masked modeling goal to train steerable generative models that can be conditioned on any modality. This allows for diverse expression of user intent and various multimodal editing tasks. The parameters impacting 4M’s performance are then studied in a thorough ablation analysis. This comprehensive analysis, in conjunction with the ease and generalizability of this method, proves that 4M has great promise for many vision tasks and future developments.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...