Depth estimation is one of the fundamental problems in computer vision, and it’s essential for a wide range of applications, such as robotic vision or surgical navigation.
Various deep learning-based approaches have been developed to provide end-to-end solutions for depth and disparity estimation in recent times. One such method is self-supervised monocular depth estimation. Monocular depth estimation is the process of determining scene depth from a single image. For disparity estimation, the bulk of these models use a U-Net-based design.
Although relative depth is perceived very easily by humans, the same task for a machine has proven quite challenging due to the absence of an optimal architecture. To tackle this issue, more complex architectures are chosen to generate a high-resolution photometric output.
The Hamlyn Centre’s research team from Imperial College London introduces a unique randomly connected encoder-decoder architecture for self-supervised monocular depth estimation. The model architectural design, capable of extracting high order features from a single image and the loss function for imposing a solid feature distribution, is credited for the idea’s success.
The fact that it does not matter how the connections are wired is the basis for this research. The initial step in developing this model architecture was to challenge this concept. The researchers built this approach by modeling randomly connected neural networks as graphs, with each node acting as a convolution layer. A random graph generation method connects each of these nodes. Following the creation of the graph, it is turned into a neural network using a deep learning toolkit such as PyTorch.
A Cascaded random search approach is introduced to generate arbitrary network architectures to ensure efficient search in the connection space. In addition to this, a new variant of the U-Net topology was developed to improve the expressive power of skip connection feature maps, both spatially and semantically.
This unique design, unlike ordinary U-Net, contained convolution (learnable layers) in the skip connections themselves. As a result, our researchers can improve the encoder feature maps’ utilization of deep semantic features, often kept in the channels space but not explicitly employed.
Multiscale Loss Functions are critical for improving the image reconstruction process, according to researchers. These multiscale loss functions combine to provide a new loss function that enhances the quality of image reconstruction. The novel loss function efficiently extends deep feature adversarial and perceptual loss to many scales for high-quality view synthesis and error calculation.
The researchers compared the results of their method against state-of-the-art self-supervised depth estimate methods using two surgical datasets. The findings show that even a randomly connected network with normal convolution operations but unique interconnections can learn the task effectively. Furthermore, the multiscale penalty in loss functions is critical for creating finer details, according to this study.
The whole idea behind conducting this research is to create a base for further research related to neural network architecture design. The experiment results could be helpful for studies aiming to move away from traditional U-Net and manual trial-and-error procedures and toward more automated design methodologies.