UCSD and NVIDIA AI Researchers Propose ‘CoordGAN’: a Novel Disentangled GAN Mode That Produces Dense Correspondence Maps Represented by a Novel Coordinate Space

This research summary is based on the paper 'CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs'

Please don't forget to join our ML Subreddit

GANs (Generative Adversarial Networks) have had a lot of success synthesizing high-quality images, and a lot of recent research shows that they also learn a lot of interpretable directions in the latent space. Moving latent codes in a semantically relevant direction (e.g., posture) produces instances with smooth fluctuating appearance (e.g., constantly changing views), signaling that GANs implicitly learn which pixels or regions correspond to each other from different synthesized examples.

Instead, a dense correlation is created between semantically equivalent local regions but with differing appearances (e.g., patches of two different eyes). Because identifying large-scale, pixel-level annotations is exceedingly laborious, learning extensive correspondence across images of one category remains difficult. While most present research relies on supervised or unsupervised image classification networks, just a few studies have looked into how GANs might learn dense correspondence.

Nvidia researchers recently published a paper that looks into learning dense correspondence from GANs. Specifically, learning an explicit correspondence map is often a pixel-level semantic label map. This job is important for disentangling structure and texture in GANs since correspondence indicates structure (e.g., shapes of facial components) and is independent of texture (e.g., global appearances like skin tone and texture).

According to studies, disentangling semantic attributes can be accomplished by looking for latent directions acquired by GANs. However, all factors must be identified by humans. Some recent advancements exhibit efficient structure-texture disentanglement by enhancing the noise code input to GANs or adding spatial attention in the intermediary layers. However, they either provide a structure map with a low resolution (e.g., 4*4) or do not produce one at all.

The central aim of this study is to propose a new coordinate space from which pixel-level correspondence for all synthesized images in a category may be retrieved explicitly. In this work, researchers express the dense correspondence map of a generated image as a warped coordinate frame translated from a canonical 2D coordinate map, inspired by UV maps of 3D meshes, where shapes of one category are represented as deformations of one canonical template.

This allows a unique structure to be represented as a transformation between the warped and canonical frames. The team creates a Coordinate GAN (CoordGAN) with two independently sampled noise vectors controlling structure and texture. Researchers train an MLP as the aforementioned transformation in the structure branch, while the texture branch uses Adaptive Instance Normalization (AdaIN) to regulate the global appearance. This converts a sampled noise vector to a warped coordinate frame, which is modified further in the generator to control the hierarchical structure of the synthesized image.

Researchers use a texture swapping constraint to ensure the same structure for images with the same structure code but different texture codes and a texture swapping constraint to ensure similar texture for images with the same texture code but different structure codes during training to ensure that the network learns accurate dense correspondence.

The team puts models trained on the CelebAMask-HQ, Stanford Cars, and AFHQ-Cat datasets to quantitative and qualitative assessments. The team uses a resolution of 512*512 for the CelebAMask-HQ model and 128*128 for the other two models to train distinct models on each dataset.

On the objective of semantic segmentation label propagation, the proposed CoordGAN outperforms all baselines across all three datasets. Pix2Style2Pix is the most similar strategy, which similarly learns an encoder for a pre-trained StyleGAN2 model. Despite the fact that Pix2Style2Pix encoder features include both structure and texture information, CoordGAN correspondence maps, which only store structure information, achieve greater label propagation performance. These findings indicate that CoordGAN learns more precise correspondence than the other approaches.

Conclusion

Nvidia researchers recently published a study demonstrating that GANs can be trained to automatically emerge dense relationships. They offer CoordGAN; a new disentangled GAN model that generates dense correspondence maps in a novel coordinate space. This is supplemented by a GAN inversion encoder, which allows dense correspondence for real images to be generated. A future expansion of this work could be to learn a 3D UV coordinate map to reflect the underlying structure instead of a 3D map.

Paper: https://arxiv.org/pdf/2203.16521.pdf

Project: https://jitengmu.github.io/CoordGAN/

Video: https://www.youtube.com/watch?v=FP27huY0Yu0

Slides: https://jitengmu.github.io/CoordGAN/static/images/slides.pdf