Facebook AI’s New Compositional Framework Can Generalize Attribute-Object Pairs To Unseen Combinations

One of the crucial ways to make it easier for customers to shop online today is to improve product recognition. People could one day opt to make any image or video shoppable if AI can forecast and grasp exactly what’s in any particular virtual frame. Customers would be able to easily find what they’re seeking for. At the same time will allow vendors to make their products more visible.

Facebook AI is creating the world’s largest shoppable social media platform, allowing users to buy and sell billions of things in one spot. Facebook AI is expanding GrokNet, their breakthrough product identification system, to new applications on Facebook and Instagram as a crucial milestone toward this goal. GrokNet identifies products in pictures and predicts their categories, such as “sofa,” and qualities, such as color and style. GrokNet is a first-of-its-kind, all-in-one model that scales over billions of photographs across significantly varied verticals, like fashion, auto, and home décor. GrokNet began as a basic AI research project with its first few applications on Marketplace. The AI analyzes search queries and predicts matches to search indexes providing the most relevant up-to-date results to billions of people searching for products.

Identifying new attributes and objects

Product recognition systems need to be excellent at recognizing the product characteristics or attributes to help shoppers find exactly what they need. With so many possible attributes, each one can be applied to a range of categories. For instance, we can have blue shirts, blue bands, blue skirts and many more. The supervised learning approach for classification is not scalable to situations where we have such near-infinite possibilities. Even having only 1,000 objects and 1,000 attributes would demand manually labeling more than a million pairwise combinations. 

Facebook’s new model learns from few attribute-object pairs and generalizes the same to new and unseen combinations. They employed this in a new compositional framework developed on top of their previous foundational research that uses hashtags as weak supervision to accomplish SOTA image recognition. The researchers trained their model on 78M public Instagram images. 

The new compositional module in their framework uses attribute and object classifier weights and composes them into attribute-object classifiers. This allows the prediction of attribute-object combinations not seen during training. This model performs much better than the standard approach of individual attribute and object predictions. Moreover, each object can be modified with multiple attributes, which increases the fine-grained space of classes with fewer orders of magnitude. 

Source: https://ai.facebook.com/blog/advancing-ai-to-make-shopping-easier-for-everyone/

The researchers sampled objects and attributes from all geographies worldwide to prepare the training dataset for these models, which helped them reduce the potential for bias. In addition, they trained and evaluated the AI model across subgroups, including 15 countries and four age groups, to improve the fairness of the framework. 

Using Multimodal Signals to Enhance Product Platforms

In most Facebook apps, you’ll find text(such as metadata or product descriptions) accompanying the images. Therefore, it was vital to employ SOTA multimodal advancements to improve content understanding across their platform. Signals from associated text significantly enhance the accuracy of product categorization in distinct ways.

Transformer architectures are widely used in natural language processing (NLP) tasks. In recent years, they have been extended multimodal frameworks. In starting, the researchers used a clothing attributes data set to evaluate a multimodal understanding framework. This dataset included catalog data that involves text input, and often, the text data can be misleading.

Source: https://ai.facebook.com/blog/advancing-ai-to-make-shopping-easier-for-everyone/

The researchers addressed this issue by combining the visual signals from the image and related text descriptions to guide the final model prediction. 

Enhancing Product Matches

For products without any text descriptions, they added a modality dropout trick during training. In this, they randomly remove either text or image when both modalities are present, ensuring robust performance even against these missing details. All these improvements provide remarkable accuracy gains compared with vision-only models.

Source: https://ai.facebook.com/blog/advancing-ai-to-make-shopping-easier-for-everyone/

Most products match the application’s embedding distance only capture overall attributes like color, shape, structure and cannot differentiate text-based details from one another. To improve the accuracy of their product match, the team created a flexible framework with solid feature engineering that allowed them to add new features without changing the current framework. They added a two-stage ranking component into product recognition that includes:

  • A multilayer perceptron model taking GrokNet embeddings and outputs rematch scores. 
  • A gradient boosting decision tree model getting multi-features from multiple modalities and outputs rematch scores. 

This advancement assembles features from modalities with appropriate ranking models and boosts the most favorable outcome into the top position of each query. 

Their new model is now available on Marketplace. In the future, the team plans to deploy these models to strengthen AI-assisted tagging and product matches across Facebook apps. They would like to explore other types of signals to boost the product matches, including engagement signals, which could complement the current image-text models. 

Source: https://ai.facebook.com/blog/advancing-ai-to-make-shopping-easier-for-everyone/