In the computer vision field, the process of detection and segmentation of the most noticeable objects from natural scenes is known as salient object detection (SOD).
Most of the existing SOD networks have similar designs and leverage the depth features extracted by backbone networks like AlexNet, ResNet, ResNeXt, DenseNet, VGG, etc. These backbone networks were originally built for image classification tasks. They extract features representing semantic meanings rather than the local details or global reference information needed for salient object detection. These networks also tend to require data-inefficient pretraining on ImageNet.
University of Alberta’s newly proposed U2-Net is a simple but powerful deep network architecture. It contains a novel two-layer nested U-shaped structure. The proposed ReSidual U-block (RSU) consists of a mixture of different-sized receive domains that helps capture contextual information on different scales more efficiently. It also uses pooling operations to increase the overall architecture depth without affecting the computational cost much.
Fig-1: (1) Plain convolution block PLN, (2) Residual-like block RES, (3) Inception-like block INC, (4) Dense-like block DSE and (5) Proposed ReSidual U-block RSU.
Three major components of RSU are:
- Input convolutional layer
- U-Net-like symmetric encoder-decoder structure of ‘L’ height
- Residual connection to join local and multiscale features using summation
The RSU replaces the ordinary single-flow convolution (in the original residual block) with a U-Net-like structure. It also replaces original features with a local feature transformed via a weighting layer, as shown below:
Fig-2: difference between original Residual block and proposed ReSidual U-block
Based on RSU blocks alone, the researchers developed U2-Net. It consists of a 6-stage encoder, a 5-stage decoder, and a saliency graph fusion module attached to the decoder stages and the last encoder stage. It builds a deep architecture with rich multiscale features, low computational costs, and low memory costs. Since it does not use any pre-trained backbone network for image classification processing, it has excellent adaptability to different working environments with minimal performance loss and maximum possible efficiency. Researchers took DUTS-TR (the largest salient object detection dataset) to train U2- Net. And for the evaluation purpose, six benchmark public datasets for salient object detection were used, namely DUTOMROM, DUTS-TE, ECSSD, HKU-IS, PASCAL-S, and SOD.
Fig-3: Illustration of the proposed U2 -Net architecture