Deep neural networks based on Generative Adversarial Networks (GANs) have facilitated end-to-end trainable photo-realistic text-to-image generation. Many methods also use intermediate scene graph representations for improving image synthesis. Approaches using dialogue-based interactions allow users to provide instructions to refine and adjust generated scenes incrementally. This provides users with greater control by designating the relative positions of objects in the background. However, the language used in the above methods is restricted, and the produced images are limited to synthetic 3D visualizations or cartoons.
Intending to create a universal neural machine translation system capable of translating between any language pair, a team of Google researchers has developed a new framework, the Tag-Retrieve-Compose Synthesize system (TReCS). The proposed method significantly enhances the image generation process by improving how the language evokes image elements and how traces inform their placement. The system is trained on over 25 billion examples and has the potential to handle 103 languages. Its features align the mouse traces with text descriptions and create visual labels for the provided phrases.
- The new framework leverages controllable mouse traces as fine-grained visual grounding to generate high-quality images given user narratives. A tagger is used to predict object labels for each word in the phrase.
- A text-to-image dual encoder retrieves the images with semantically relevant masks. For each trace sequence, one mask is chosen to maximize the spatial overlap, which overcomes the ground-truth text-to-object information and better grounds descriptions.
- Selected masks are composed corresponding to trace order, with separate canvases for background and foreground objects. The foreground mask is placed over the background one to create a full scene segmentation.
- Lastly, a realistic image is synthesized by inputting the entire segmentation to mask-to-image translation models.
On evaluation, the new system outperforms the SOTA text-image generation techniques under both automatic and human judgment. It shows the practicability of generating realistic and controllable photos from the complicated text on noisy narratives translated from everyday speech. The TReCS system accounts for the intricacy of text-image generation of lengthy and complex text descriptions. The proposed method shows that mouse traces can be a useful source for realistic text-image generation.
One of the study’s limitations is the lack of suitable evaluation metrics to measure the generated images’ quality quantitively. The existing metrics do not reasonably reflect the semantic similarity between the ground-truth image and the machine-generated one.
In the coming years, the proposed idea can support various applications offering a friendly human-machine interface. It could help artists create prototypes, draw insights from machine-generated photos and generate realistic images. Additionally, it can be used to design a human-in-the-loop evaluation system to optimize the network.