Meta AI Research Releases ‘Make-A-Scene’: A Novel Text-To-Image Method That Enables You To Create Images Using Text Prompts And Freeform Sketches

It is natural and convenient for humans to use linguistic description (text) to describe a visual scene. In the last decade, many Computer Vision(CV) and Natural Language Processing (NLP) tools have been created due to growing AI research. AI communities are increasingly interested in generating images based on text descriptions because it bridges the gap between these two fields. The cross-modal problem of text to image transformation and keeping the generated image semantically consistent with the given text makes text to image generation (T2I) very difficult.

Although many researchers have explored the possibility of converting between text and images, these methods are limited majorly because of less controllability and resolution quality. 

To fully achieve AI’s promise to promote creative expression, people must be able to influence and control the content a system produces. Users should be able to express themselves using any manner they prefer, including speech, text, gestures, and even drawings, and it should be easy to use and intuitive.

Meta AI recently released a new method called Make-A-Scene, which shows how AI can enable anyone to bring their imaginations to life through computer-generated imagery. It allows the users to express their vision through both text and freeform sketches by putting creative control in their hands.

Text descriptions were used as input for previous state-of-the-art AI systems that generated images. The researchers state that it is more difficult to predict the composition of images generated by text prompts. For instance, take the text “a painting of a zebra riding on a bike .”The Zebra could be on the left or right of the image; it could be much larger or smaller than the bicycle, or it could face the camera or face away from it. Therefore lack of information and understanding of human perception can result in the image not reflecting a person’s creative voice.

This is no longer the case, thanks to Make-A-Scene. Various elements, forms, arrangements, depth, compositions, and structures are used to demonstrate how people can convey their vision with greater specificity by using text and simple drawings.

Make-A-Scene trains on millions of examples of images to learn the relationship between visuals and text. The team used publicly available datasets to train Make-A-Scene so that the wider AI community could analyze, study, and understand the system’s existing biases.


To enable detailed sketches as input, Make-A-Scene employs an innovative intermediate representation that captures the scene layout. If the designer like, it can also construct a scene layout using solely text-based cues. The concept focuses on identifying significant elements of the imagery, such as objects or animals, that are more likely to be significant to the creator. The widely used FID score, which rates the quality of images produced by generative models, indicated that this technique positively impacted the generation quality.

Along with controllability, the team demonstrates new capabilities provided by this method, including:

  1. Complex scene generation
  2. Generation outside distribution
  3. Editing of scenes
  4. Text editing in conjunction with anchored scenes.

The team tested their method with the help of human reviewers. Make-A-Scene generated two images for each participant: one created using only a text prompt and the other using both a sketch and a text prompt. The segmentation map of a public dataset was used as the sketch for the latter. The caption of the corresponding image was used as the text input for both applications. 

Their findings show that nearly always (99.54 percent of the time), the image created from text and the original sketch was judged as being better aligned. Additionally, it was more frequently (66.3 percent of the time) in line with the written request. This illustrates that Make-A-Scene generations stick closely to the vision conveyed in the sketch.

Make-A-Scene and other projects like it show the possibility of expanding the boundaries of creative expression — regardless of artistic ability — through scientific research and experimentation. With this work, people can easily bring their visions to life in the real and virtual worlds.

Currently, the team is testing and gathering feedback from the Meta employees who have been granted limited access to Make-A-Scene. The team plans to provide broader access to their demons to the wider community soon. 

This Article is written as a summary article by Marktechpost Staff based on the research paper 'Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and reference article.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...