Human speech can be broken into four important components: content, timbre, pitch, and rhythm. The first component ‘content’ of
speech shows the primary information in the speech that can be transcribed to text. The second component, ‘Timbre,’ carries information about the voice characteristics of a speaker; this helps in matching speaker identity. The emotion of the speaker is expressed by the last two components, Pitch and rhythm. Variation in ‘Pitch’ conveys the aspects of the tone of the speaker, and rhythm characterizes how fast the speaker utters each word or syllable.
Obtaining disentangled representations of four components of speech can be useful in speech analysis and generation applications. Currently, the available models can only disentangle timbre, while information about pitch, rhythm, and content is still mixed together. To disentangle the remaining three speech components is an under-determined problem without explicit annotations for each component, and expensive to obtain.
This paper proposes SpeechSplit, an autoencoder that can decompose speech into content, timbre, rhythm, and pitch. This model can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch, and rhythm without text labels.
Audio demo (interactive): https://anonymous0818.github.io/