Microsoft Researchers Unlock New Avenues In Image-Generation Research With Manifold Matching Via Metric Learning


By developing fresh images, generative image models provide a distinct value. These photos could be clear super-resolution copies of current images or even manufactured shots that look realistic. The framework of training two networks against each other has shown pioneering success with Generative Adversarial Networks (GANs) and their variants: a generator network learns to generate realistic fake data that can fool a discriminator network, and the discriminator network learns to correctly tell apart the generated counterfeit data from the actual data.

The research community must address two issues in order to use the most recent advances in computer vision for GANs. First, rather than using geometric metrics, GANs model data distributions using statistical measures such as the mean and moments. Second, in classic GANs, the discriminator network loss is solely represented as a 1D scalar value corresponding to the Euclidean distance between the genuine and fake data distributions. The research community has been unable to utilize breakthrough metric learning methods directly or experiment with novel loss functions and training strategies to continue to develop generative models due to these two limitations.

Microsoft researchers offer a novel framework for generative models called Manifold Matching via Metric Learning in a recent paper titled “Manifold Matching via Deep Metric Learning for Generative Modeling” (MvM). Two networks are trained against each other in the MvM framework. For the distribution generator network’s manifold matching objective, the metric generator network learns to define a better metric. The distribution generator network learns to provide more hard negative samples for the metric generator network’s metric learning objective. MvM creates a distribution generator network that can generate fake data distributions that are very close to the accurate data distributions, as well as a metric generator network that can provide an effective metric for capturing the data distribution’s internal geometric structure using adversarial training. In October, this research was accepted at the International Conference on Computer Vision (ICCV 2021).

One exciting feature of MvM is that it represents the performance of its networks using a learned intrinsic metric rather than the conventional Euclidean distance. This opens the door for the research community to use the most recent advances in metric learning to the subject of generative training models.


MvM is also easier to understand than GANs. The objective function of GANs is a single min-max value function. The training loss calculated with this objective function decreases as the discriminator network improves and increases as the generator network improves. As a result, the training loss varies as the training advances, leaving a human interpreter with no discernible indicator of the networks’ behavior or performance.

A human interpreter is now compelled to print created graphics and make qualitative choices to extract this information. MvM, on the other hand, employs two distinct objective functions. The first objective function computes the learning loss metric, which fluctuates up and down to show that the two networks are learning adversarially against each other, as desired. The manifold matching loss is calculated by the other objective function. As the created fake data distribution becomes more similar to the genuine data distribution, this goal function lowers monotonically across training epochs. This means that based on the value of the manifold matching loss, a human interpretation can derive quantitative inferences.

Finally, unlike GANs, MvM generates a multi-dimensional image representation. This allows researchers to test new and existing training frameworks and methodologies, such as unsupervised representation learning and other metric learning techniques. This is likely to hasten the development of generative models and open up hitherto unimaginable research opportunities.

MvM was used to assess the framework’s effectiveness and versatility by applying it to two common picture generating tasks: unsupervised image generation and image super-resolution. A StyleGAN2 architecture was trained for the unsupervised picture production task. Big images (512 pixels by 512 pixels) were generated using the Flickr-Faces-HQ (FFHQ) dataset—a prominent benchmark for proving the efficacy of image generation models. The results show that MvM is capable of producing visuals that closely reflect the dataset’s actual data.

For the image super-resolution task, researchers trained three different generator backbones, ResNet, RDN, and NSRNet, using GAN and MvM. Unlike the results from GANs that have a grid-like result, MvM leads to a line-like result that is closer to ground truth. It was qualitatively observed that the generator network trained with MvM surpasses the one trained with GAN in reconstructing fine details like outlines without inaccurate grid-like artifacts.