A New AI Research Introduces Recognize Anything Model (RAM): A Robust Base Model For Image Tagging

When it comes to natural language processing (NLP) tasks, large language models (LLM) trained on massive online datasets perform exceptionally well. Segment Anything Model (SAM) has shown outstanding zero-shot localization abilities in computer vision (CV) by scaling up data. 

Unfortunately, SAM cannot produce semantic labels, a fundamental task on par with localization. Recognizing many labels for a single image is the goal of multi-label image recognition, also known as image tagging. Since images contain various labels, including objects, sceneries, properties, and activities, image tagging is an important and useful computer vision problem.

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Two main factors hinder image labeling as follows:

  1. The extensive collection of high-quality data. An efficient data annotation engine that can semi-automatically or automatically annotate massive amounts of photos across various categories is still lacking, as is a standardized and comprehensive labeling system.
  2. There are not enough open-vocabulary and powerful models built using an efficient and flexible model design that takes advantage of large-scale weakly-supervised data.

The Recognize Anything Model (RAM) is a robust base model for image tagging, and it has just been introduced by researchers at the OPPO Research Institute, the International Digital Economy Academy (IDEA), and AI2 Robotics. When it comes to data, RAM can overcome problems such as inadequate labeling systems, insufficient datasets, inefficient data engines, and architectural constraints.

The researchers start by creating a standard, global naming convention. They use academic datasets (classification, detection, and segmentation) and commercial taggers (Google, Microsoft, and Apple) to enrich their tagging system. By combining all available public tags with common text-based tags, the labeling method yields 6,449 labels that collectively address the vast majority of use cases. The researchers state that it is possible to recognize the remaining open-vocabulary labels using open-set recognition.

Annotating large-scale photographs using the label system automatically is a challenging task. The proposed approach to image tagging is inspired by previous work in the field, which uses large-scale public image-text pairs to train robust visual models. To put these massive amounts of picture-text data to good use for tagging, the team employed automatic text semantic parsing to extract the image tags. With this method, they could obtain a large set of picture tags based on image-text pairs without relying on manual annotations.

Internet-sourced image-text combinations tend to be imprecise due to random noise. The team creates a data tagging engine to improve the accuracy of annotations. To solve the problem of missing labels, they adopt preexisting models to produce supplementary classifications. When dealing with mislabeled areas, they pinpoint certain sections within the image that correlate to distinct labels. Then, they use region clustering methods to find and eliminate anomalies within the same category. In addition, the labels that make inconsistent predictions are also removed to get a more precise annotation. 

RAM permits generalization to novel classes by adding semantic context to label searches. RAM’s identification abilities can be boosted by this model architecture for any visual dataset, demonstrating its versatility. By showing that a general model trained on noisy, annotation-free data may beat highly supervised models, RAM introduces a new paradigm to picture tagging. RAM necessitates a free and publicly available dataset with no annotations. The most powerful version of RAM must only be trained for three days on eight A100 GPUs. 

According to the team, improvements can yet be made to RAM. This includes running many iterations of the data engine, increasing the backbone parameters to boost the model’s capacity, and expanding the training dataset beyond 14 million photos to better cover varied areas.


Check Out The PaperProject, and Github. Don’t forget to join our 23k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

[Free AI Webinar] 'How to Build Personalized Marketing Chatbots (Gemini vs LoRA)'.