Researchers at Meta and the University of Texas at Austin Propose ‘Detic’: A Method to Detect Twenty-Thousand Classes using Image-Level Supervision

The difficulty of object detection is divided into two parts: detecting the object (localization) and labeling it (classification). Traditional techniques rely on box labels for all classes since these two sub-problems are intimately coupled. Despite extensive data-gathering efforts, detection datasets are substantially smaller in overall size and object classes (vocabulary) than picture classification datasets.

The newest LVIS detection dataset, for example, has 1000+ classes and 120K photos; OpenImages offers 1.8M images for 500 classes. Furthermore, not all classes have enough annotations to train a reliable detector. Even the ten-year-old ImageNet dataset has 21K classes and 14M pictures in categorization.

Researchers from Meta AI and the University of Texas suggested the detector with image classes(Detic) in a recent publication, which incorporates image-level supervision in addition to detection supervision. The researchers discovered that the sub-problems of localization and classification might be separated. Modern region proposal networks have already located several ‘new’ items using existing detection supervision. As a result, the researchers concentrated on the classification sub-problem and employed image-level labels to train the classifier and expand the detector’s vocabulary.

A simple classification loss is suggested that applies image-level supervision to the proposal with the most extensive spatial extent while leaving other image-labeled data outputs unsupervised. This is simple to do and dramatically increases the detector’s vocabulary.

The weakly labeled data is used to supervise both the localization and classification subproblems of detection in most existing weakly-supervised detection systems. Due to the lack of box labels in picture classification data, these methods develop various label-to-box assignment algorithms to get boxes. For example, the picture label is assigned to proposals with good prediction scores by YOLO9000 and DLWL. 

Unfortunately, good initial detections are required for this assignment, which creates a chicken-and-egg problem: a suitable detector is necessary for a good label assignment. At the same time, many boxes are essential to training a good detector. When using classification data, the suggested solution totally bypasses the label assignment process by overseeing the classification sub-problem alone. This also allows the approach to learning detectors for new classes that would otherwise be difficult to predict and assign.

Experiments on the open-vocabulary LVIS and open-vocabulary COCO benchmarks show that the technique can significantly improve on both novel and base classes when compared to a strong box-supervised baseline. The model trained without novel class detection annotations improves the baseline by 8.3 points and matches the performance of applying full class annotations in training with image-level supervision from ImageNet-21K.

Detic uses image-level labels from classification datasets and box labels from detection datasets. The team creates a mini-batch utilizing photos from both types of datasets during training. The conventional two-stage detector training is used for images containing box labels. The features from a fixed region proposal are only trained for classification in image-level labeled pictures. On pictures with ground truth box labels, just the localization losses (RPN loss and bounding box regression loss) are computed.


Detic was tested on the LVIS large-vocabulary object detection dataset. The team primarily employs the open-vocabulary setting, although they also use the regular LVIS setting to present findings. Prior work is also compared using the COCO benchmark, which is a famous open-vocabulary benchmark. Object detection and instance segmentation labels for 1203 classes and 100K images are included in the LVIS dataset. Based on the number of training photos, the classes are classified into three categories: frequent, common, and rare. 

In all three circumstances, Detic significantly improves the baseline and other options. On the innovative classes, Detic outperforms ImageNet by 8.3 points and CC by 3.2 points. As a result, Detic with image-level labels has good open-vocabulary detection performance and can improve existing open-vocabulary detectors. Despite the lack of box labels for the novel classes, Detic with ImageNet outperforms Box-Supervised (all class). This result also shows that for new classes, bounding box annotations may not be required. Using our Detic approach in conjunction with big picture classification datasets is a simple and effective way to expand detector vocabulary.


Detic is a straightforward technique to apply image supervision in object detection with a vast vocabulary. The large-scale pretraining of CLIP benefits Detic’s generalization capabilities, and it’s unclear whether there are other approaches to train the classifier. Detic increases large-vocabulary detection with a variety of weak data sources, classifiers, detector architectures, and training recipes, according to the studies. 

Detic is easier to use than previous assignment-based weakly-supervised detection methods, but it supervises all picture labels to the same region and ignores overall dataset statistics. Detic, according to the team, will make object detection easier to implement, encourage future open-vocabulary detection research, and incorporate overall dataset statistics for future work.