Reinforcement learning (RL) has enabled tools to make decisions and solve complex problems in unknown environments directly from high-dimensional image inputs, such as locomotion, robotic manipulation, and game playing. However, these successes are built upon in-depth supervision in manually crafted reward functions. The agents are rewarded and punished based on their performance and eventually learn a reward function maximizing rewards and minimizing punishment. But designing informative reward functions is costly, time-consuming, and likely to have an error. Also, these difficulties can increase with the complexity of the concerned task.
Unlike RL agents, natural agents learn through intrinsic objectives without externally provided assignments. For example, children are not assigned to crawl, but they naturally crawl and play around to explore their surroundings. This has motivated researchers to identify and provide RL agents with mathematical objectives that do not depend on a specific task and can be applied to any unknown environment.
Recently, researchers at the Vector Institute, University of Toronto, and Google Brain have examined three intrinsic motivation types to stimulate RL agents’ intrinsic objectives. It is observed that all three intrinsic goals correlate more strongly with a human behavior similarity metric than with any task reward.
The researchers have tested the following three common types of intrinsic motivation while evaluating agents without rewards:
• Input entropy encourages encountering rare sensory inputs (measured by a learned density model)
• The agents are rewarded for learning the rule of their environment by Information gain.
• The agents are rewarded for maximizing their influence over their sensory inputs or environment by empowerment.
The team collected a diverse dataset of different environments and behaviors and retrospectively computing agent objectives for evaluation. They analyzed the correlations between intrinsic objectives and supervised objectives (such as task reward and human similarity) and established a relationship between different intrinsic objectives without training a new agent for each objective.
The researchers used 100 million frames from the three Atari game environments to train seven RL agents with and without a task reward. As the 3D game Minecraft environment simulation is slower than Atari, they applied 12 million frames per agent. Human behavior was taken as the ground truth for the human similarity objective, and the team estimated the similarity between agents’ and humans’ actions in the shared environment.
All examined intrinsic objectives across all environments correlate more strongly with human similarity than the task rewards do. It recommends inherent goals over task rewards when designing general agents that behave like humans. It is also noticed that the input entropy and information gain are similar objectives while empowerment may offer complementary benefits, and therefore they recommend future work on combining intrinsic goals.
The human dataset is currently comparatively small to identify human similarity values, and it is unclear what instructions the human players received. Using additional human data and control over players’ instructions can help this area’s work. The team stated that to assign the agent observations to buckets, they have downscaled them. This is simple but does not account for the semantic similarity between images. Therefore they suggest learning the representations using deep neural networks for future work.