In The Latest AI Research, CMU And Adobe Researchers Propose An Elegant Emsembling Mechanism For GAN Training That Improves FID by 1.5x to 2x On The Given Dataset

This Article Is Based On The Research Paper 'Ensembling Off-the-shelf Models for GAN Training'. All Credit For This Research Goes To The Researchers of This Project 👏👏👏

Please Don't Forget To Join Our ML Subreddit

Image generation necessitates the ability to capture and model complicated statistics in real-world visual events. When trained on large-scale data, computer vision models have shown adept at capturing valuable representations, thanks to the effectiveness of supervised and self-supervised learning techniques.

Surprisingly, despite the aforementioned link between synthesis and analysis, state-of-the-art generative adversarial networks (GANs) are trained without the use of such pre-trained networks in an unsupervised way. This is a squandered opportunity to investigate, given the abundance of relevant models readily available in the research ecosystem.

Adobe researchers investigated the usage of a collection of pretrained deep feature extractors to aid in generative model training in a recent publication. GANs are specifically taught with a discriminator and a generator, which are both targeted at continuously learning the relevant statistics that distinguish genuinely and produced data. 

Using such robust, pre-trained networks as a discriminator inadvertently leads to overfitting and overpowering the generator, especially in small data sets. When applied with the original, learned discriminator, the researchers found that freezing the pretrained network provided stable training. Enabling the generator to reflect the true distribution in diverse, complementary feature areas was also promoted by combining multiple pretrained networks.

The team recommended using an autonomous model selection technique based on the linear separability of genuine and fake images in the feature space and gradually adding supervision from a set of available pretrained networks to determine which networks function best. To further stabilize the model training and avoid overfitting, they used label smoothing and differentiable augmentation.

To demonstrate the method’s usefulness, the researchers ran tests on different datasets in both small and large-scale sample settings. On many datasets, researchers enhanced the state-of-the-art. With only 10k samples, the team was able to replicate the results of StyleGAN2 trained on the entire dataset (1.6M photos). The researchers also displayed the learned models’ internal representation as well as training dynamics.


While the method considerably increases the quality of generated photographs, it does have some drawbacks, particularly when working with limited data. The method raises the training memory demand. The researchers believe that using efficient computer vision models to investigate the technology could make it more accessible. Second, in low-shot circumstances where only a dozen samples are available, the model selection technique is ineffective. In the future, researchers hope to use few-shot learning approaches in these scenarios. 

As more self-supervised and supervised computer vision models become accessible, they should be leveraged for generative modeling to their full potential. The researchers expect that by transferring knowledge from large-scale representation learning, their study can help to improve generative modeling.