This paper/repo presents an unsupervised approach that enables a user to convert the input speech of a person to an output set of potentially-infinitely many speakers. A person can record in front of a mic and be able to make their favorite celebrity say the same words.
The approach used in the paper/repo is built on a simple set of autoencoders that come out of sample data to the distribution of the training set (motivated by PCA/linear autoencoders). An exemplar autoencoder is used to learn the voice and specific style (emotions and ambiance) of a target speaker.
Unlike existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. The authors also exhibit the usefulness of this approach for generating video from audio signals and vice-versa.
Project Page: https://dunbar12138.github.io/projectpage/Audiovisual/