Meta AI Introduces CommerceMM: A New Approach To MultiModal Representation Learning With Omni Retrieval For Online Shopping

When it comes to online shopping, a thorough understanding of the category to which a product belongs is critical to providing the best possible user experience to customers. Do blenders belong in the same category as pots and pans in kitchen supplies or the same category as portable dishwashers in appliances? These grey regions necessitate sophisticated interpretation and a thorough understanding of how customers think. The challenge becomes substantially more difficult in an online marketplace with many independent suppliers and a broader range of items. To meet this demand, a group of Meta researchers created a powerful new technique for pretraining and a varied new model called CommerceMM, which can provide a diverse and granular understanding of commerce subjects linked with a specific piece of material. The researchers felt the need to develop AI capabilities to help sort and label products due to Meta’s large marketplace on Facebook and Instagram.

CommerceMM can examine a post as a whole rather than just individual images and phrases. Because many commerce posts are multimodal, with photographs, captions, and other text working together to offer a plethora of information, this acts as a differentiating element. The traits relevant to a particular customer are often identified in a product’s description, but those attributes are instead bundled into a sequence of hashtags in certain circumstances. The photo rather than the text sometimes reveals the more critical elements. To fully comprehend a product post, one must first understand the nuances of multimodal content. By combining its characterizations of a post’s text and image, CommerceMM better comprehends multimodal data. Previous research relied on transformer-based models to link an image to its associated text description, with medium-scale image-text pairings serving as the default training data. Researchers can utilize this to enable AI systems to identify new links between modalities because online purchasing allows for more diversified text and visual data. The team was able to design a generalized multimodal representation for numerous commerce-related applications thanks to a detailed examination of these linkages. The researchers hoped to employ Meta AI’s resources to further increase the model’s grasp of complex data due to Meta AI’s recent accomplishments in multimodal training. 

An image encoder, a text encoder, and a multimodal fusion encoder make up CommerceMM. The encoders translate data into embeddings, which are sets of mathematical vectors. The text encoder’s embeddings describe the different continuums along which a sentence can be related to or different from another. These embeddings condense a wealth of data, encapsulating the distinct characteristics that distinguish each section in a text or object in a photograph. In addition to discrete text and image representations, the system develops a specialized multimodal embedding for each photo-and-text input that represents the post as a whole. This is what distinguishes CommerceMM. The image encoder examines each image first, while a transformer-based text encoder handles the associated text. Both send the embeddings to a transformer-based multimodal fusion encoder, where the two modalities learn to work together to build a joint representation.


Contrastive learning methods train a model to group the representations of substantially identical inputs in embedding space while pushing them away from different examples. The training aims to teach the model to group similar inputs together and push unrelated data away in the multidimensional embedding space. The three encoders are trained on a set of tasks that combine masking and contrastive learning simultaneously. A portion of the image or text is blacked out in the masking tasks, and the model learns to rebuild the missing region based on its surroundings. Following the development of the three embeddings (image, text, and multimodal), they are calibrated using a new set of tasks known as omni retrieval. The relationships between all of the embedding modalities are fine-tuned in this step. Two text-image pairs are initially fed into two identical models, each of which outputs three embeddings. The objective is to educate the system on linking the two sets of three embeddings together. There are nine pairs of relationships in all, where each of the three embeddings from one model should be significantly associated with each of the replica’s embeddings. The model can learn all of these correlations thanks to contrast learning. It has been demonstrated that omni retrieval learns a more discriminative and broad representation.

CommerceMM has been used in research to achieve state-of-the-art performance on seven tasks, outperforming all other systems dedicated to these specific use cases. Researchers may readily fine-tune the model for various specialized tasks once it has been pretrained to learn these representations. Previously, Meta used an early version of CommerceMM to improve category filters on Instagram Shops, Facebook Shops, and Marketplace listings, resulting in more relevant results and recommendations. It has also been utilized to enhance attribute filters on Instagram and Facebook Shops. Meta hopes that CommerceMM will become a standard approach for informing ranking and product suggestions by assisting shoppers in finding precisely what they want. They plan to use CommerceMM to support additional of Meta’s products, such as Marketplace product search and Instagram visual search.

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'CommerceMM: Large-Scale Commerce MultiModal
Representation Learning with Omni Retrieval'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, and blog.

Please Don't Forget To Join Our ML Subreddit