Researchers at Heidelberg University have recently proposed a novel method to efficiently code inductive image biases into models while retaining all transformers’ flexibility. This approach combines the inductive bias’s effectiveness in convolutional neural networks (CNNs) with transformers’ expressivity to model and synthesize high-resolution images.
Transformers have shown promising results in learning long-range interactions on sequential data and have been employed for language tasks and increasingly adapted to reinforcement learning, audio, and computer vision.
Transformer architecture does not contain built-in inductive before the locality of interactions. It is, therefore, free to learn complicated relationships among its inputs. However, this also indicates that it has to learn all relationships making them computationally infeasible for long sequences like high-resolution images. Thus, the increased expressivity of transformers comes with rising computational costs due to pairwise interactions.
The novel method
Researchers at Heidelberg University have proposed a method to address this issue using (CNN) convolutional neural networks. CNN exhibits a strong locality bias and a bias towards invariance by using shared weights across all positions. They have used CNN to learn a context-rich vocabulary of image constituents and use transformers to model their composition within the images efficiently.
The introduced method represents the images as a composition of perceptually rich image constituents from a codebook of context-rich visual parts rather than representing them with pixels. The description length of compositions is significantly reduced, allowing the efficient model of the global interrelations within images using a transformer architecture. The produced images are realistic and high-resolution in both an unconditional and a conditional setting.
Furthermore, they have also used an adversarial approach to ensure that the dictionary of local parts captures perceptually important local structures to relieve the need to model low-level statistics with the transformer architecture. Allowing transformers to concentrate on long-range modeling relations enables them to generate high-resolution images. This approach directly gives control over the produced images using conditioning information regarding desired object classes or spatial layouts.
The proposed approach, retaining the advantages of transformers, has outperformed SOTA codebook-based methods based on convolutional architectures. The researcher says that convolutional and transformer architectures together can model the compositional nature of our visual world. The CNN and transformer combination hits the full potential of their complementary strengths and represents the first high-resolution image synthesis results with a transformer-based architecture.