Meta AI Introduces ‘Make-A-Scene’: A Deep Generative Technique Based On An Autoregressive Transformer For Text-To-Image Synthesis With Human Priors

This Article Is Based On The Research Paper 'Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors'. All Credit For This Research Goes To The Researchers 👏👏👏

Please Don't Forget To Join Our ML Subreddit

In recent years, the research related to text-to-image generation has been growing exponentially. Nevertheless, the current methods still lack at least three essential characteristics. First of all, most models accept as input solely the text information. This is a massive limitation, as the controllability of the model is limited to style or color, but it can not be extended to structure or form, for example. The second limitation is related to human perception: indeed, the final aim of these models is to match human perception and attention, but, in reality, the generation process does not include any relevant prior knowledge on this. For example, the losses which control the generation are usually applied to the whole image without adding a specific focus on parts fundamental to human perception (such as human faces, animals, or salient objects). The last missing characteristic is the always-present problem of quality and resolution, as most of the works are limited to an output resolution of 256×256. 

For these reasons, the team of Facebook AI has introduced Make-A-Scene. This novel method successfully tackled these three gaps while attaining SOTA results for text-to-image generation. The proposed model is essentially three encoders with discrete tokens, an auto-regressive transformer that learns to generate sequences of tokens conditioned on the scene segmentation and a decoder that generates images from this transformer-generated sequence. It is important to note that the network does not use the segmented scene for computing the loss, thus, the segmentation is not necessary at inference time. The model is resumed in the figure below.


Scene representation

Three maps describe the segmented scene: panoptic, human, and face (each of these has more than one instance, e.g., the human class is divided into different body parts). Through these, the network learns to condition the final generated image. Thus, this conditioning is implicit because the network could decide to discard the scene information and generate the image solely on text. To tackle this, the author utilized VQ-SEG, a variation of VQ-VAE for semantic segmentation, which has as input and output the ground truth and reconstructed segmented scene with m channels, representing the sum of all the categories for each one of the three groups, plus a map of the edges that separate the different group and also the instances inside the same group. During inference, VQ-SEG takes as input the semantic map (if present) and compresses it into the scene tokens.

Face-aware quantization

In addition, three losses were defined to introduce an explicit emphasis on particular regions in the VQ-IMG encoder. Before training VQ-IMG, faces are located using the semantic information of VQ-SEG with a pre-trained face-embedding network and then reconstructed with the VQ-IMG decoder with a perceptual loss to compare the ground truth and reconstructed face crops by comparing the sum of internal representation. The same approach was used with a second loss which compares the ground truth and the object detected by a pre-trained VGG network. The third loss is a binary cross-entropy face loss between the original and reconstructed images to emphasize different face parts because it was seen a frequent reduction of the semantic segmentations representing face parts. VQ-IMG is used solely during training to generate the image tokens, while during inference, these tokens are generated by the transformer.

Scene based transformer

During training, an autoregressive transformer based on GPT-3 is trained to predict the next token on the sequence of the tokens generated by the three encoders. During inference, given the encoded text and (optionally) the encoded segmentation maps, the transformer is applied to predict the missing image tokens. The concatenation of the three encodings is then passed to the decoder to generate new samples. 

In addition, instead of using a classifier to “guide” the sample to a cluster of samples of one class, which tends to reduce the diversity of generated samples, during training, some of the text inputs are replaced with blanks, representing the unconditioned samples. During inference, two token sequences are generated in parallel (one conditional and one unconditional) and processed together to generate the next token in the sequence.


This model achieved state-of-the-art FID and human evaluation (a group of human evaluators was asked to choose between two images between two different models in terms of quality, photorealism, and text alignment), unlocking the ability to generate high fidelity images in a resolution of 512 x 512 pixels. Through scene controllability, they were also able to generate out-of-distribution samples which could not be generated by classic architectures (figure below).


Finally, a very excellent example of the controllability of this model is shown in this video, where the authors illustrated a children’s story through the power of Make-a-Scene.