Research From China Propose a Novel Context-Aware Vision Transformer (CA-ViT) For Ghost-Free High Dynamic Range Imaging

By fusing much low dynamic range (LDR) photographs with different exposures, multi-frame high dynamic range (HDR) imaging tries to provide images with a more extensive dynamic range and more realistic features. However, in reality, camera movements and dynamic foreground objects frequently contradict this ideal scenario, resulting in negative ghosting distortions in the reconstructed HDR results. Several techniques known as HDR de-ghosting algorithms have been suggested to generate high-quality, ghost-free HDR photos. Traditionally, numerous approaches include correcting the input LDR pictures or excluding misaligned pixels before the image fusion to reduce ghosting effects.

However, exact alignment is complex, and the overall HDR impact is decreased when relevant information is lost due to poor pixel rejection. As a result, CNN-based learning algorithms that explore in-depth features in data-driven ways have been developed to address the ghosting phenomenon. Current CNN-based de-ghosting techniques may be broadly divided into two groups. In the first category, homography or optical flow is used to pre-align LDR pictures, and a CNN is then used to conduct multi-frame fusion and HDR reconstruction. However, optical flow is inconsistent in the presence of occlusions and saturations, and homography is unable to align moving objects in the foreground. In order to handle ghosting artifacts and achieve state-of-the-art performance, the second category suggests end-to-end networks with implicit alignment modules or unique learning algorithms.

However, the restrictions become apparent in the presence of distant object motions and significant intensity changes. Convolution’s built-in location constraint explains the situation. CNN is inadequate at long-range modeling dependence (such as ghosting effects brought on by significant motion) since it requires stacking deep layers to generate a broad receptive field. Additionally, since the same kernels are used throughout the whole image, convolutions ignore the long-range intensity fluctuations of various image areas. Therefore, more performance enhancement is required by investigating content-dependent algorithms with long-range modeling capacity.

Due to its better long-range modeling capabilities, research interest in Vision Transformer (ViT) has lately increased. However, the experimental findings point to two significant problems that prevent its use in HDR de-ghosting. Generalization does not occur when trained on insufficient data, even though available datasets for HDR de-ghosting are limited due to the extravagant cost of collecting large numbers of realistic labeled samples. On the other hand, transformers lack the inductive biases inherent to CNN.

Contrarily, the neighbor pixel associations of both the intra-frame and the inter-frame are crucial for recovering local features over numerous frames. However, the pure Transformer is unsuccessful in obtaining such local context. In order to do this, they suggest a brand-new Context-Aware Vision Transformer (CAViT), which is designed with a dual-branch architecture to simultaneously capture global and local dependencies.

They use a window-based multi-head Transformer encoder for the global branch in order to capture distant contexts. They create a local context extractor (LCE) for the local branch that extracts the local feature maps through a convolutional block and chooses the most beneficial features over several frames using a channel attention method. Thus, the suggested CA-ViT enables the interaction of local and global settings. They present a novel Transformer-based architecture (dubbed HDR-Transformer) for ghost-free HDR photography by integrating with the CA-ViT. In particular, a feature extraction network and an HDR reconstruction network make up most of the proposed HDR-Transformer. Using a spatial attention module, the feature extraction network extracts shallow features and coarsely fuses them.

The suggested CA-ViT is the fundamental building block for the hierarchically constructed HDR reconstruction network. In order to rebuild ghost-free, high-quality HDR photos, the CA-ViTs describe long-range ghosting artifacts and local pixel interaction. This eliminates the need to stack intense convolution blocks.

Source: https://arxiv.org/pdf/2208.05114v1.pdf

The primary contributions of this study may be summed up as follows: – 

  • They propose a novel vision transformer termed CA-ViT that can fully utilize both global and local picture context dependencies while outperforming its predecessors by a wide margin.
  • They introduce a unique HDR-Transformer that can reduce processing costs, ghosting artifacts, and recreating high-quality HDR photos. This is the first Transformer-based HDR de-ghosting framework to be developed. 
  • They undertake in-depth tests on three sample benchmark HDR datasets to compare HDR-performance Transformers to current state-of-the-art techniques.

The official code implementation of this paper is available on Github.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Ghost-free High Dynamic Range Imaging with Context-aware Transformer'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit