Here is Another Breakthrough in Text-to-Image Synthesis, Called StoryDALL-E, Which Adapts Pretrained Text-to-Image Transformers for Story Continuation

Text-to-image synthesis algorithms, such as DALL-E, have demonstrated an extraordinary capacity to turn an input caption into a cohesive picture. Several latest techniques have also used multimodal solid models to create artistic representations of input captions, proving their ability to democratize art. However, these models are only intended to analyze a single, brief caption as input. To capture the meaning of the input language many text-to-image synthesis use cases require models to handle extensive narratives and metaphorical phrases, condition existing visuals, and create more than one picture. Several works have already constructed specific Generative Adversarial Networks (GAN) models such as image-to-image translation, style transfer, etc.

Story visualization is a challenging endeavor that combines picture production and story comprehension. However, the recent introduction of transformer-based large pretrained models opens up possibilities for more effectively leveraging latent knowledge from large-scale pretrained datasets for performing these specialized tasks in a paradigm similar to finetuning pretrained language models for performing downstream tasks based on language understanding. As a result, they investigate approaches for adapting a pretrained text-to-image synthesis model for complex downstream applications, with an emphasis on story visualization, in this study. Tale visualization methods, for example, turn a series of captions into a series of images that depict the story.

While previous work in narrative visualization has highlighted potential uses, the job offers specific challenges when applied to real-world scenarios. An agent must create an identical sequence of pictures that displays the contents of a set of captions that comprise a tale. The model is restricted to the fixed set of characters, locations, and events it has been trained on before. It does not understand how to portray a new character that appears in a caption during testing; captions do not carry enough information to characterize the character’s look adequately. As a result, for the model to generalize to new story parts, it must include a method for gathering more information about how these elements should be graphically portrayed. To begin, they make narrative visualization more suitable for these use scenarios by introducing a new job called story continuation.’

They present a starting scenario that can be obtained in real-world use situations in this work. They give DiDeMoSV a new visualization dataset and adapt two existing visualization datasets, PororoSV and FlintstonesSV, to the narrative continuation scenario. The model may then replicate and adjust components from this scene as it creates successive photos by adding them (see Fig below). It benefits by diverting attention away from text-to-image creation, a hotly debated topic. Instead, it diverts attention onto the narrative structure of a sequence of pictures, such as how an image should evolve to reflect new narrative material in the captions.

Predictions from the mega-StoryDALL-E model for (A) PororoSV, (B) FlintstonesSV, and (C) DiDeMoSV narrative continuation datasets. The initial frame supplied as extra input to the model is called the source frame.

To adopt a text-to-image synthesis model to this tale continuation job, they must first finetune a pretrained model (such as DALL-E) on a sequential text-to-image generation task with the extra flexibility to copy from a prior input. To do this, they first retrofit the model using additional layers that duplicate the vital output from the initial scene. Then, during the production of each frame, they incorporate a self-attention block for building narrative embeddings that offer a global semantic context of the tale. The model is finetuned on the tale continuation challenge, where these extra modules are learned from scratch. For tale continuation, they call their technique StoryDALL-E and compare it to a GAN-based model called StoryGANc.

They also investigate the parameter-efficient architecture of prompt-tuning and present a prompt composed of task-specific embeddings to entice the pretrained model to generate visuals for the target domain. The pretrained weights are frozen during training this prompt-tuning version of the model, and the new parameters are learned from scratch, saving time and memory. The results suggest that their retrofitting strategy in StoryDALL-E effectively exploits DALL-latent E’s pretrained knowledge for the tale continuation problem, outperforming the GAN-based model on various criteria. Furthermore, they discovered that the copying technique enables enhanced generation in low-resource circumstances and produces unseen characters during inference.

In summary, they present a novel story continuation dataset and introduce the job of story continuation, which is more closely linked with real-world downstream applications for narrative visualization.

  • They provide StoryDALL-E, a retrofitted adaption of pretrained transformers for tale continuation. They also create StoryGANc to serve as a robust GAN baseline for comparison.
  • They conduct comparison tests and ablations to demonstrate that finetuned StoryDALL-E outperforms StoryGANc on three narrative continuation datasets across several parameters.
  • Their investigation demonstrates that the copying increases the correlation of the produced pictures with the original image, resulting in enhanced visual story continuity and the development of low-resource and unnoticed characters.

 Code implementation in PyTorch can be found freely on GitHub. 

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'StoryDALL-E: Adapting Pretrained
Text-to-Image Transformers for Story Continuation'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit