Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders

This paper/repo presents an unsupervised approach that enables a user to convert the input speech of a person to an output set of potentially-infinitely many speakers. A person can record in front of a mic and be able to make their favorite celebrity say the same words.

The approach used in the paper/repo is built on a simple set of autoencoders that come out of sample data to the distribution of the training set (motivated by PCA/linear autoencoders). An exemplar autoencoder is used to learn the voice and specific style (emotions and ambiance) of a target speaker.

Unlike existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. The authors also exhibit the usefulness of this approach for generating video from audio signals and vice-versa.

Summary Video

Paper: http://www.cs.cmu.edu/~aayushb/AudioCon/AudioCon.pdf

Abstract: https://arxiv.org/abs/2001.04463

Github: https://github.com/dunbar12138/Audiovisual-Synthesis

Project Page: https://dunbar12138.github.io/projectpage/Audiovisual/

Demo: https://scs00197.sp.cs.cmu.edu