Researchers from Peking University and Microsoft Introduce COLE: An Effective Hierarchical Generation Framework that can Convert a Simple Intention Prompt into a High-Quality Graphic Design

Natural picture production is now on par with professional photography, thanks to a notable recent improvement in quality. This advancement is attributable to creating technologies like DALL路E3, SDXL, and Imagen. Key elements driving these developments are using the potent Large Language Model (LLM) as a text encoder, scaling up training datasets, increasing model complexity, better sampling strategy design, and improving data quality. The research team feels that now is the right time to focus on developing a more professional image, especially in graphic design, given its crucial functions in branding, marketing, and advertising. 

As a professional field, graphic design uses the power of visual communication to communicate clearly defined messages to certain social groups. It’s a field that demands imagination, ingenuity, and quick thinking. In graphic design, text and visuals are typically combined using digital or manual methods to create visually engaging stories. Its main objective is to organize data, provide meaning to concepts, and provide expression and emotion to objects that document human experiences. The creative use of typeface, text arrangement, ornamentation, and images in graphic design frequently allows ideas, feelings, and attitudes that cannot be expressed by words alone. Producing top-notch designs requires high imagination, ingenuity, and lateral thinking. 

According to the current study, the ground-breaking DALL路E3 has remarkable skills in producing high-quality design pictures, distinguished by visually arresting layouts and graphics, as seen in Figure 1. These pictures do not, however, come without shortcomings. Their ongoing struggles include misrendered visual text, which frequently leaves off or adds additional characters (a condition also noted in ). Moreover, because these created pictures are essentially uneditable, modifying them requires intricate procedures like segmentation, erasing, and inpainting. The requirement that users supply comprehensive text prompts is another significant constraint. Creating good prompts for visual design production usually requires a high level of professional skill. 

Figure 1 uses the DESIGNERINTENTION to illustrate the design pictures produced by DALL路E3 (augmented with GPT-4).

As Figure 2 illustrates, unlike DALL路E3, their COLE system can produce excellent quality graphic design graphics with only a basic requirement for user purposes. According to the research team, these three restrictions seriously impair the quality of graphic design pictures. A high-quality, scalable visual design generating system should ideally give a flexible editing area, generate accurate and high-quality typographic information for various uses, and demand low effort from users. Users could use human skills as needed to enhance the result further. This effort aims to establish a stable and effective autonomous text-to-design system that can produce excellent graphic design pictures from user intent prompts. 

Figure 2: A visual representation of the pictures produced by the COLE system is shown above. Interestingly, the only input our system receives is a textual intention description. The rest of the elements text, design graphics, and related typographic properties like font type, size, and position are all independently produced by the intelligent system.聽

The research team from Microsoft Research Asia and  Peking University propose COLE, a hierarchical generating approach to simplify the intricate process of creating graphic design images. Several specialized generation models, each intended to tackle a distinct sub-task, are involved in this process. 

First and foremost, the emphasis is on imaginative design and interpretation, primarily on comprehending intentions. This is achieved by using cutting-edge LLMs, namely the Llama2-13B, and optimizing it using a large dataset of almost 100,000 curated intention-JSON pairings. Important design-related information, including textual descriptions, item captions, and backdrop captions, are included in the JSON file. The research team also offers optional parameters for additional purposes, such as object location. 

Second, they focus on the arrangement and improvement of visuals, which includes two subtasks: the production of visual components and typographic features. Creating various visual features entails fine-tuning specialized cascaded diffusion models such as DeepFloyd/IF. These models are built in a way that guarantees a smooth transition between components, such as the layered object images and the adorned backdrop. The research team then predicts the typography JSON file using a typography Large Multimodal Model (LMM) constructed using LLaVA-1.5-13B. This uses the predicted JSON file from the Design LLM, the projected backdrop picture from a diffusion model, and the expected object image from a cascaded diffusion model. A visual renderer then assembles these components using the layout found in the anticipated JSON file. 

Third, quality assurance and comments are provided at the end of the process to improve the overall quality of the design. A reflection LMM must be painstakingly adjusted, and GPT-4V(ision) must be used for a comprehensive, multifaceted quality examination. This last stage makes tweaking the JSON file easier as needed, including changing the text box’s sizes and positions. Finally, the research team built a DESIGNERINTENTION, comprising approximately 200 professional graphic design intention prompts spanning various categories and about 20 creative ones, to assess the system’s capabilities. They then compared their approach to the state-of-the-art image generation system currently in use, carried out exhaustive ablation experiments for each generation model on various sub-tasks, provided a thorough analysis of the graphic designs produced by their system, and had a conversation about the drawbacks and potential future directions of graphic design image generation.


Check out the聽Paper and聽Project.聽All credit for this research goes to the researchers of this project. Also,聽don鈥檛 forget to join聽our 33k+ ML SubReddit,聽41k+ Facebook Community,Discord Channel,聽and聽Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

馃殌 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]