Learning from rewards is an unsaid practice among all. The above is also the guiding insight behind a family of algorithms that used deep reinforcement learning to smash most of Atari’s gaming catalog. An Artificial Intelligence “agent” traverses the game, tries out different actions, and registers the actions that lead it to win.
A team of researchers from UberAI and OpenAI worked to vouch for the concept of learning from rewards on Artificial Intelligence. While exploring the game, the record of each won state is maintained. In case of a defeat situation, the Artificial Intelligence agents were encouraged to go back to a previous step, promising a winning solution. The win state is reloaded, and new branches are intentionally explored to reach the next win solution. The working is somewhat similar to the concept of checkpoints in video gaming. You live, play, die, reload a saved point (Checkpoint), try something new, repeat for a perfect run-through.
The new family of algorithms called “Go-Explore” cracked the challenging Atari games that its predecessors had earlier unsolvable. The team found that installing Go-Explore as “brain” for a robotic arm in computer simulations made it possible to solve a challenging series of actions with very sparse rewards. The team believes the study can be adapted to other real-world problems, such as language learning or drug design.
How do you reward an algorithm?
Rewards are tough to design. Reinforcement learning works well if you have vibrant feedback to categorize a move/action as bad and another as good. However, in situations of very little feedback, it is hard to design the rewards. Rewards in the above scenario can intentionally lead to a dead end.
The other difficulty is providing denser rewards. In the case of frequently rewarding the agents for going along its journey might seem like upholding the AI agent. The above could result in a highly rigid agent that ignores new additions to its path. The team requires AI agents that can tackle both problems.
The key is to return to the past.
According to Huizinga, AI is motivated by exploring new and unusual situations. There are some significant downsides to the idea as well.
- The agent might think it had already found a good solution and stop going back to promising areas.
- The agent may even forget a previous checkpoint because of its mechanics to examine the next step.
In a complex task, the agent randomly stumbles around towards a solution and ignores potentially better ones. Go-Explore solves the above problem with a simple principle: first return, then explore. The agent’s different approaches are stored and load promising to save points to investigate further to get better results.
Paper: https://www.nature.com/articles/s41586-020-03157-9
Related Paper: https://arxiv.org/pdf/1901.10995.pdf
Consultant Intern: He is Currently pursuing his Third year of B.Tech in Mechanical field from Indian Institute of Technology(IIT), Goa. He is motivated by his vision to bring remarkable changes in the society by his knowledge and experience. Being a ML enthusiast with keen interest in Robotics, he tries to be up to date with the latest advancements in Artificial Intelligence and deep learning.