SpeechSplit, an autoencoder that can decompose speech into content, timbre, rhythm and pitch

Image Source: https://arxiv.org/abs/2004.11284

Human speech can be broken into four important components: content, timbre, pitch, and rhythm. The first component ‘content’ of
speech shows the primary information in the speech that can be transcribed to text. The second component, ‘Timbre,’ carries information about the voice characteristics of a speaker; this helps in matching speaker identity. The emotion of the speaker is expressed by the last two components, Pitch and rhythm. Variation in ‘Pitch’ conveys the aspects of the tone of the speaker, and rhythm characterizes how fast the speaker utters each word or syllable.

Obtaining disentangled representations of four components of speech can be useful in speech analysis and generation applications. Currently, the available models can only disentangle timbre, while information about pitch, rhythm, and content is still mixed together. To disentangle the remaining three speech components is an under-determined problem without explicit annotations for each component, and expensive to obtain.

This paper proposes SpeechSplit, an autoencoder that can decompose speech into content, timbre, rhythm, and pitch. This model can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch, and rhythm without text labels.

AdvertisementCoursera Plus banner featuring Johns Hopkins University, Google, and University of Michigan courses highlighting data science career-advancing content
Image Source: https://anonymous0818.github.io/

Paper: https://arxiv.org/pdf/2004.11284.pdf

Audio demo (interactive): https://anonymous0818.github.io/

Related Papers/Articles:






Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.