Generative AI has come a long way recently. We are all familiar with ChatGPT, diffusion models, and more at this point. These tools are becoming more and more integrated into our daily lives. Now, we are using ChatGPT as an assistant to our daily tasks; MidJourney to assist design process and more AI tools to ease our routine tasks.
The advancement of generative AI models has enabled unique use cases that were different to achieve previously. We have seen someone write and illustrate an entire child book using generative AI models. We used to tell the stories the same way for ages, and now we This was a great example of how generative AI can revolutionize the storytelling that we have been using for ages.
Visual storytelling is a powerful method of conveying narrative content effectively to diverse audiences. Its applications in education and entertainment, such as children’s books, are vast. We know that we can generate stories and illustrations separately using generative AI models, but can we actually use them to generate a visual story consistently? The question then becomes; given a story in plain text and the portrait images of a few characters, can we generate a series of images to express the story visually?
To have an accurate visual representation of a narrative, story visualization must meet several vital requirements. Firstly, maintaining identity consistency is crucial to depict characters and environments consistently throughout the frames or scenes. Secondly, the visual content should closely align with the textual narrative, accurately representing the events and interactions described in the story. Lastly, a clear and logical layout of objects and characters within the generated images aids in seamlessly guiding the viewer’s attention through the narrative, facilitating understanding.
Generative AI has been used to propose several story visualization methods. Early work relied on GAN or VAE-based methods and text encoders to project text into a latent space, generating images conditioned on the textual input. While these approaches demonstrated promise, they faced challenges in generalizing to new actors, scenes, and layout arrangements. Recent attempts at zero-shot story visualization investigated the potential of adapting to new characters and scenes using pre-trained models. However, these methods lacked support for multiple characters and did not consider the importance of layout and local object structures within the generated images.
So, should we just give up on having an AI-based story visualization system? Are these limitations too difficult to be tackled? Of course not! Time to meet TaleCrafter.
TaleCrafter is a novel and versatile interactive story visualization system that overcomes the limitations of previous approaches. The system consists of four key components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V).
These components work together to address the requirements of a story visualization system. Story-to-prompt generation (S2P component leverages a large language model to generate prompts that depict the visual content of images based on instructions derived from the story. Text-to-layout generation (T2L) component utilizes the generated prompt to generate an image layout that offers location guidance for the main subjects. Then, the controllable text-to-image generation (C-T2I) module, the core component of the visualization system, renders images conditioned on the layout, local sketch, and prompt. Finally, the image-to-video animation (I2V) component enriches the visualization process by animating the generated images, providing a more vivid and engaging presentation of the story.
Overview of TaleCrafter. Source: https://arxiv.org/pdf/2305.18247.pdf
TaleCrafter‘s main contributions lie in two key aspects. Firstly, the proposed story visualization system leverages large language and pre-trained text-to-image (T2I) models to generate a video from plain text stories. This versatile system can handle multiple novel characters and scenes, overcoming the limitations of previous approaches that were limited to specific datasets. Secondly, the controllable text-to-image generation module (C-T2I) emphasizes identity preservation for multiple characters and provides control over layout and local object structures, enabling interactive editing and customization.
Check Out The Paper and Github link. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.