There has been a long-standing desire to provide visual data in a way that allows for deeper comprehension. Early methods used generative pretraining to set up deep networks for subsequent recognition tasks, including deep belief networks and denoising autoencoders. Given that generative models may generate new samples by roughly simulating the data distribution, it makes sense that, in Feynman’s tradition, such modeling should also eventually reach a semantic grasp of the underlying visual data, which is necessary for recognition tasks.
According to this theory, generative language models, such as Generative Pre-trained Transformers or GPTs, thrive as both few-shot learners and pre-trained base models by acquiring a deep comprehension of language and a vast knowledge base. Recent studies in vision generative pretraining, however, are no longer popular. For instance, while utilizing ten more parameters than its contemporaneous contrastive algorithms, GAN-based BiGAN and auto-regressive iGPT significantly underperform them. The diverse focus partly causes the difficulty: Generation models must allocate capacity for low-level high-frequency features, whereas recognition models primarily concentrate on the high-level low-frequency structure of pictures.
Considering this disparity, it is still being determined if and how generative pretraining, despite its intuitive appeal, can successfully compete with other self-supervised algorithms on downstream recognition tasks. Denoising diffusion models have recently dominated the area of picture production. These models use a simple method of repeatedly improving noisy data. (Figure 1) The resulting photographs are astoundingly high quality; even better, they may produce a wide variety of unique samples. They review the possibility of generative pretraining in the setting of diffusion models in light of this advancement. First, they use ImageNet classification to finetune a pre-trained diffusion model directly.
The pre-trained diffusion model outperforms concurrent self-supervised pretraining algorithms like Masked Autoencoders (MAE), despite having a superior performance for unconditional image generation. However, compared to training the same architecture from scratch, the pre-trained diffusion model only slightly improves classification. Researchers from Meta, John Hopkins University and UCSC include masking into diffusion models, drawing inspiration from MAE, and recasting diffusion models as masked autoencoders (DiffMAE). They structure the masked prediction task as a conditional generative goal to estimate the pixel distribution of the masked region conditioned on the visible region. By learning to regress pixels of masked patches given the other visible patches, MAE exhibits great identification performance.
Using the MAE framework, they learn models using their diffusion technique without adding any additional training costs. Their model is taught to denoise the input at various noise levels during pretraining, and it learns a powerful representation for recognition and generation. With regard to the picture in the painting, where the model creates samples by repeatedly unfolding from random Gaussian noise, they assess the pre-trained model by finetuning on downstream identification tasks. DiffMAE’s ability to create complex visual features, such as objects, is due to its diffusion nature. MAE is known to yield hazy reconstructions and lacks high-frequency components. Moreover, DiffMAE performs well on jobs requiring image and video recognition.
In this work, they see the following:
(i) DiffMAE achieves performance equivalent to top self-supervised learning algorithms concentrating on recognition, making it a powerful pretraining method for finetuning downstream recognition tasks. Their DiffMAE can even outperform current work that blends MAE and CLIP when paired with characteristics from CLIP.
(ii) DiffMAE can produce high-quality pictures based on input that has been masked. Particularly, DiffMAE generations look more semantically meaningful and beat top inpainting techniques in terms of quantitative performance.
(iii) DiffMAE is easily adaptable to the video domain, offering top-notch inpainting and cutting-edge recognition accuracy that outperforms recent efforts.
(iv) They demonstrate a relationship between MAE and diffusion models because MAE efficiently completes the initial phase of diffusion’s inference process. In other words, they think that MAE’s performance is consistent with producing for reward. They also conduct a thorough empirical analysis to clarify the advantages and disadvantages of the design decisions on downstream recognition and inpainting generation tasks.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.