DeepMind Develops An Artificial Intelligence (AI) System That Learns By Imitating Human Interactions

In the latest DeepMind research, the research group constructed a 3D virtual environment “made of a randomized set of rooms and a large number of home interactable objects to give a place and context for humans and AI to interact together,” according to the researchers. Humans and AI can interact via operating virtual robots that move around, operate items, and communicate with one other via text.

DeepMind researchers created AI that can interact organically with humans while learning by emulating human activities in a virtual world. The multimodal interactive agent (MIA) combines visual perception, language understanding and creation, navigation, and manipulation to engage in lengthy physical and verbal interactions with people. According to the research, the multimodal interactive agent (MIA) interacts with non-adversarial people 75% of the time.


To offer a location and context for humans and agents to engage, they designed the Playhouse environment, a 3D virtual environment made of a randomized collection of rooms and a wide variety of home interactable objects. Humans and agents may interact by commanding virtual robots that move around, manipulate things, and converse via text in the Playhouse. This virtual world allows for a wide range of discussions, from simple instructions to imaginative play.


Using language games, a collection of stimuli that urge individuals to imitate specific behaviors, we gathered human instances of Playhouse interactions. In a language game, one player (the setter) is given a prewritten prompt suggesting what type of assignment the other player should submit (the solver). For instance, the setter may be given the fast “Inquire the other player a question concerning the existence of an object.” After some investigation, the setter would ask, “Please tell me whether there is a blue duck in a room without any furniture.” They also added free-form instructions to allow setters to improvise interactions (for example, “Now pick any object you choose and whack the tennis ball off the stool. It rolls towards the clock, or someplace close it.”). 2.94 years of real-time human interactions were recorded in the Playhouse.

Self-supervised learning and supervised prediction of human actions (behavioral cloning) were used to train the AI. Human participants were required to communicate with agents and examine whether the agent effectively followed the instruction to evaluate MIA’s performance. The findings revealed that MIA had a success rate of over 70% in human-rated online interactions, equivalent to 75% of people’s success rate while playing as solvers.

This Article Is Based On The Research Paper 'Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, video and post. We have also used information from an outside blog post.

Please Don't Forget To Join Our ML Subreddit

I am consulting intern at MarktechPost. I am majoring in Mechanical Engineering at IIT Kanpur. My interest lies in the field of machining and Robotics. Besides, I have a keen interest in AI, ML, DL, and related areas. I am a tech enthusiast and passionate about new technologies and their real-life uses.