SpeechSplit, an autoencoder that can decompose speech into content, timbre, rhythm and pitch

Human speech can be broken into four important components: content, timbre, pitch, and rhythm. The first component ‘content’ of
speech shows the primary information in the speech that can be transcribed to text. The second component, ‘Timbre,’ carries information about the voice characteristics of a speaker; this helps in matching speaker identity. The emotion of the speaker is expressed by the last two components, Pitch and rhythm. Variation in ‘Pitch’ conveys the aspects of the tone of the speaker, and rhythm characterizes how fast the speaker utters each word or syllable.

Obtaining disentangled representations of four components of speech can be useful in speech analysis and generation applications. Currently, the available models can only disentangle timbre, while information about pitch, rhythm, and content is still mixed together. To disentangle the remaining three speech components is an under-determined problem without explicit annotations for each component, and expensive to obtain.

This paper proposes SpeechSplit, an autoencoder that can decompose speech into content, timbre, rhythm, and pitch. This model can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch, and rhythm without text labels.

Image Source: https://anonymous0818.github.io/

Paper: https://arxiv.org/pdf/2004.11284.pdf

Audio demo (interactive): https://anonymous0818.github.io/

Related Papers/Articles:




Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]