COCONut: A High-Quality, Large-Scale Dataset for Next-Gen Segmentation Models

Computer vision has advanced significantly in recent decades, thanks in large part to comprehensive benchmark datasets like COCO. However, nearly a decade after its introduction, COCO’s suitability as a benchmark for modern AI models is being questioned. Its annotations may contain biases and nuances reflecting the early stages of computer vision research. With model performance plateauing on COCO, there are concerns about overfitting to the dataset’s specific characteristics, potentially limiting real-world applicability.

To modernize COCO segmentation, researchers have proposed COCONut – a novel, large-scale universal segmentation dataset in this paper. Unlike previous attempts at creating large datasets that often compromised label accuracy for scale, COCONut features human-verified mask labels for 383K images. Imagine having to manually annotate millions of objects in images – it would take years! COCONut solves this challenge through an innovative assisted-manual annotation pipeline leveraging neural networks to augment human annotators.

✅ [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

The pipeline involves four key stages: machine-generated prediction, human inspection and editing, mask generation/refinement, and expert quality verification. Different neural models handle ‘thing’ (countable objects) and ‘stuff’ (amorphous regions) classes at each stage, ensuring high-quality annotations.

But how does this assisted-manual pipeline actually work? In the first stage, a bounding box detector and a mask segmenter generate initial proposals for ‘thing’ and ‘stuff’ classes, respectively. Human annotators then inspect these proposals, editing or adding new ones as needed. The refined boxes and points are fed into separate modules to generate final segmentation masks. Lastly, expert annotators verify a random sample of these masks, relabeling any that don’t meet stringent quality standards.

To scale up the dataset size while maintaining quality, the researchers built a data engine. It uses the annotated data to iteratively retrain the neural networks, generating improved proposals for the annotation pipeline. This positive feedback loop, coupled with additional images from other datasets, culminated in the COCONut-L split with 358K images and 4.75M masks.

The researchers conducted a thorough analysis comparing COCONut annotations to purely manual ones. Their expert annotators exhibited high agreement on both ‘thing’ and ‘stuff’ masks. Meanwhile, the assisted-manual pipeline significantly accelerated annotation speed, especially for ‘thing’ classes. COCONut is available in three sizes – COCONut-S (118K images), COCONut-B (242K images), and COCONut-L (358K images with 4.75M masks). Quantitative results showcase consistent improvements across various neural architectures as the training set size increases from COCONut-S to COCONut-L.

Interestingly, while larger pseudo-label datasets provided minimal gains, training on the fully human-annotated COCONut-B yielded the most significant performance boost. This underscores the importance of human-annotated data for training robust segmentation models.

COCONut represents a significant step forward in modernizing the COCO benchmark. With its meticulous human-verified annotations and a rigorously curated 25K image validation set (COCONut-val), it promises to be a more challenging testbed for evaluating contemporary segmentation models. The open-source release of COCONut paves the way for developing more capable and unbiased computer vision systems applicable to real-world scenarios.

Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

For Content Partnership, Please Fill Out This Form Here..

Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...