Object detection has been an important task in the computer vision domain in recent decades. The goal is to detect instances of objects, such as humans, cars, etc., in digital images. Hundreds of methods have been developed to answer a single question: What objects are where?
Traditional methods tried to answer this question by extracting hand-crafted features like edges and corners within the image. Most of these approaches used a sliding-window approach, meaning that they kept checking small parts of the image in different scales to see if any of these parts contained the object they were looking for. This was really time-consuming, and even the slightest change in the object shape, lightning, etc., could have caused the algorithm to miss it.
Then there came the deep learning era. With the increasing capability of computer hardware and the introduction of large-scale datasets, it became possible to exploit the advancement in the deep learning domain to develop a reliable and robust object detection algorithm that could work in an end-to-end manner.
Using deep learning methods can result in extremely successful object detection methods. They are robust against the changes in the environment and objects in the image. Most of them can run in real time, even on mobile devices. Sounds really good, right? Does that mean we can say the object detection problem is solved for good? Well, not yet.
The problem we have is all these methods are bounded by the dataset they are trained on. If you train your model to detect pandas in the image, you will use lots of panda images to teach them what it looks like. Collecting those images is one aspect, but the bigger problem is labeling them. Going over thousands of images and marking the exact locations of pandas in each image is an extremely time-consuming task.
Also, you would need to do this for each object you want your model to recognize. Imagine you want to develop a generic object detection model that recognizes all the objects it will see. You can use large-scale datasets like COCO that include a variety of objects, but you will still be limited to the number of different categories in your dataset.
What if the model could discover new objects? In that case, we would not need to label every single object in the world. Maybe a set of known objects would be given to the model, and then when it sees a new one, it would understand and predict the label for it. This is what the authors of RNCDL paper try to achieve.
This problem is called novel class discovery and localization (NCDL). The goal is to discover and detect objects from raw and unlabeled data. Existing methods tackle this problem in a data-driven manner by injecting prior knowledge. This way, the unknown objects are grouped into semantic classes using some degree of supervision. However, it is not a joint solution.
Solving the novel class discovery and localization together is a more challenging problem, as each image in the dataset contains labeled and unlabeled object classes together. Therefore, each of these objects needs to be localized and categorized at the same time.
RNCDL is trained on mixed and modified COCO and LVIS datasets. Half of the COCO dataset is used as it is but for the remaining half, labels are removed to evaluate how well the network learns to detect novel classes using the long-tailed LVIS label set.
A two-staged detector is trained end-to-end for this problem. The model has two purposes; to correctly detect labeled objects and detect unlabeled objects by learning feature representations at the same time. First, the supervised training is done in the dataset, and then self-supervised training on the unlabeled data is done. The knowledge learned during the supervised phase is transferred to the later stage by maintaining weights for class-agnostic modules and segmentation heads.
For classification, a new classification is added next to the primary one, and they are trained together with the goal of categorizing every region proposal. A non-uniform classification prior helps the network acquire features that represent the variety of objects and avoids the network from being biased towards the labeled or background classes.
RNCDL can discover and detect novel classes and outperform previous approaches. Moreover, it can be generalized beyond the COCO dataset.
Check out the Paper, Github, and Project Page. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.