Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks

There has been a marked movement in the field of AGI systems towards using pretrained, adaptable representations known for their task-agnostic benefits in various applications. Natural language processing (NLP) is a clear example of this tendency since more sophisticated models demonstrate adaptability by learning new tasks and domains from scratch with only basic instructions. The success of natural language processing inspires a similar strategy in computer vision. 

One of the main obstacles to universal representation for various vision-related tasks is the requirement for broad perceptual ability. In contrast to natural language processing (NLP), computer vision works with complex visual data such as object location, masked contours, and properties. Mastery of various challenging tasks is required to achieve universal representation in computer vision. Distinctiveness and severe hurdles define this endeavor. The lack of thorough visual annotations is a major obstacle that prevents us from building a basic model that can capture the subtleties of spatial hierarchy and semantic granularity. A further obstacle is that there currently needs to be a unified pretraining framework in computer vision that uses a single network architecture to integrate semantic granularity and spatial hierarchy seamlessly.

A team of Microsoft researchers introduces Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. This solves the problems of needing a consistent architecture and limiting comprehensive data by creating a single, prompt-based representation for all vision activities. Annotated data of high quality and broad scale is required for multitask learning. Using FLD-5B, the data engine generates a complete visual dataset with a total of 5.4B annotations for 126M images—a significant improvement over labor-intensive manual annotation. The engine’s two processing modules are highly efficient. Instead of using a single person to annotate each image, as was done in the past, the first module employs specialized models to do it automatically and in collaboration. A more trustworthy and objective picture interpretation is achieved when numerous models collaborate to attain a consensus, reminiscent of the wisdom of crowds’ ideas. 

The Florence-2 model stands out for its unique features. It integrates an image encoder and a multi-modality encoder-decoder into a sequence-to-sequence (seq2seq) architecture, following the NLP community’s goal of developing flexible models with a consistent framework. This architecture can handle a variety of vision tasks without requiring task-specific architectural alterations. The model’s unified multitask learning technique with consistent optimization, using the same loss function as the aim, is made possible by uniformizing all annotations in the FLD-5B dataset into textual outputs. Florence-2 is a multi-purpose vision foundation model that can ground, caption, and detect objects using just one model and a standard set of parameters, activated by textual cues.

Despite its compact size, Florence-2 stands tall in the field, able to compete with larger specialized models. After fine-tuning using publicly available human-annotated data, Florence-2 achieves new state-of-the-art performances on the benchmarks on RefCOCO/+/g. This pre-trained model outperforms supervised and self-supervised models on downstream tasks, including ADE20K semantic segmentation and COCO object detection and instance segmentation. The results speak for themselves, showing significant improvements of 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets using Mask-RCNN, DIN, and the training efficiency is 4 times better than pre-trained models on ImageNet. This performance is a testament to the effectiveness and reliability of Florence-2.

Florence-2, with its pre-trained universal representation, has proven to be highly effective. The experimental results demonstrate its prowess in improving a multitude of downstream tasks, instilling confidence in its capabilities. 

Check out the Paper and Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit