In A Latest Computer Vision Research, Researchers Introduce ‘JoJoGAN’: An AI Method With One-Shot Face Stylization

A style mapper applies a preset style to the photos it receives (for example, taking faces to cartoons). In a recent study, researchers from the University of Illinois at Urbana-Champaign introduce JoJoGAN as a straightforward approach for learning a style mapper from a single sample of the style. The technique allows an inexperienced user, for example, to supply a style sample and then apply that style to their image of choice. The team discusses its method in the context of face photographs because stylizing faces is so appealing to inexperienced users; nonetheless, the concept may be applied to any image.

A procedure for learning a style mapper should be simple to use, produce compelling and high-quality results, require only one style reference but accept and benefit from more, allow users to control how much style is transferred, and allow more sophisticated users to control which aspects of the style are transferred in order to be useful. Researchers show that the technique achieves these objectives using both qualitative and quantitative evidence.

Because the natural way – using paired or unpaired image translation – isn’t truly practical, learning a style mapper is difficult. Collecting a new dataset for each style is inconvenient because many styles may not have many samples. By modifying the discriminator, one can fine-tune a StyleGAN using few-shot learning approaches. These approaches have trouble producing beautiful photos because they lack thorough monitoring from pixel-level losses, and they frequently fail to capture particular style nuances and variations.

JoJoGAN, on the other hand, uses GAN inversion and StyleGAN’s style-mixing property to create a paired dataset from a reference picture (or images — one image is sufficient). StyleGAN is fine-tuned using this paired dataset and a unique direct pixel-level loss. The basics are simple: a mapper (and hence a large number of stylized portraits) may be created in under a minute from a single reference photograph.

JoJoGAN can successfully incorporate radical style references (such as animal faces). Natural procedures determine which elements of the style are employed and how much of the style is used. Qualitative samples reveal that the resulting photographs are far superior to those produced by competing methods. The method is backed up by quantitative evidence.

Source: https://arxiv.org/pdf/2112.11641.pdf

The generator and the pretrained StyleGAN discriminator are both trained at the exact resolution. The discriminator computes features that do not neglect information throughout the training phase (otherwise, the generator could produce low detail images). When averaged over batches, discriminator characteristics are known to stabilize GAN training. For the activations, the researchers chose to use the difference in discriminator activations at specific layers per image.

A style mapper should be able to produce good-looking outputs, properly transfer features from the style reference, and maintain the input’s identity. JoJoGAN has these qualities, according to qualitative examination, and outperforms current approaches significantly.

JoJoGAN excels in capturing small elements that form a style while maintaining the input face’s identity. When there are numerous consistent style references, JoJoGAN results are usually better. A comparison was made between a multi-shot stylization using all and multiple one-shot stylizations of each of a set of samples. When there are several style examples, JoJoGAN is able to mix details to hew closer to the input, whereas one-shot stylization duplicates effects from the style reference strongly (as it must).

In one study, the team compares JoJoGAN to non-DST approaches, and in another, it compares it to DST. Users are presented with a style reference, an input face, and stylizations from the methods in each and asked to select the stylization that best reflects the style reference while maintaining the original identity. The initial study yielded 186 replies from 31 people, with 80.6 percent preferring JoJoGAN to other approaches; the effect is so great that there are no significant difficulties. The second survey garnered 96 replies from 16 people, with 74 percent preferring JoJoGAN to DST.

Conclusion 

It’s highly tempting to be able to stylize faces using reference photographs. The team introduced JoJoGAN in this work, which lets anyone shoot that one shot in a hassle-free method, resulting in incredibly high-quality photographs that nail the stylistic aspects. The team demonstrates how to use StyleGAN as a strong facial before approximating a big paired dataset. It allows them to finetune it with pixel-level loss and capture critical style nuances that other approaches lack.

Paper: https://arxiv.org/pdf/2112.11641.pdf

Github: https://github.com/mchong6/JoJoGAN