PlayHT Team Introduces an AI Model with the Concept of Emotions to Generative Voice AI: This Will Allow You to Control and Direct the Generation of Speech with a Particular Emotion

Speech Recognition is one of the recently developed techniques in the NLP domain. Research scientists also developed large language models for text-to-voice generative AI model development. It was very clear that AI can achieve results like humans in terms of voice quality, expressions, human behavior, and many more. But despite all these, there were problems associated with these models. These models had less diversity in language. There were some problems with speech recognition, emotions, and many more. Many researchers recognized these problems and found that these were due to the small dataset used for the model.

The improvements were started, and the PlayHT team introduced PlayHT2.0 as a solution for this case study. The main advantage of this model was that it used multiple languages and processed a large number of datasets. The model size was also increased using this model. Transformers in NLP also played a major role in implementing this model. The model processes the given transcripts and predicts the sound. This undergoes a process of converting text to speech called tokenization. This involves transforming simplified codes into sound waves for the generation of human speech.

The model has immense conversational abilities and it can have a conversation like normal human beings with some emotions. These techniques via AI chatbots are often used by many multinational companies for online calls and seminars. PlayHT2.0 model has also improved the speech quality via optimization techniques used in it. It also can replicate the exact voice. As the dataset used for the model is extremely large, the model can also speak another language while preserving the original. The training process of the model was carried out by a large number of epochs and varying hyperparameters. This resulted in the model acting on a variety of emotions in the speech recognition techniques.

The model is still in progress and will improve further. Research scientists are still working on the improvement of emotions. Prompt engineers and many researchers also found that the model could update over the upcoming weeks in terms of speed, accuracy, and good F1 score.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Bhoumik Mhatre is a Third year UG student at IIT Kharagpur pursuing + M.Tech program in Mining Engineering and minor in economics. He is a Data Enthusiast. He is currently possessing a research internship at National University of Singapore. He is also a partner at Digiaxx Company. 'I am fascinated about the recent developments in the field of Data Science and would like to research about them.'