The success of any machine learning technique is hugely dependent on its training data. In the case of reinforcement learning (RL), we can either depend upon the limited data gathered by an agent or simulate a training environment that can be used to collect as much data as needed. The latter method is increasingly popular, but it has a significant issue, i.e., the Reinforcement Learning agent can learn what is built into the simulator. Still, it is terrible at generalizing tasks that are even moderately different than the ones simulated.
An approach to solve this problem is to automatically create more diverse training environments by randomizing all the simulator parameters. This process is called domain randomization (DR), but domain randomization does not effectively prepare an agent to transfer to previously unseen environments. A minimax adversary can be used to address this issue, to minimize the first RL agent’s performance by finding and exploiting weaknesses in its policy. But again, it provides no opportunity for the agent to learn.
In collaboration with UC Berkeley, Google AI has proposed a new multi-agent approach for training the adversary in a publication titled “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design,” presented at NeurIPS 2020. They propose an algorithm, Protagonist Antagonist Induced Regret Environment Design (PAIRED). The algorithm is based on minimax regret and prevents the adversary from creating impossible environments while allowing it to correct weaknesses in the agent’s policy at the same time. It was found that the agents trained with PAIRED learn more complex behavior and generalize better to unknown test tasks.
To constrain the adversary flexibly, PAIRED introduces a third RL agent called the antagonist agent, as it is associated with the adversarial agent, i.e., designing the environment. The initial agent, the one navigating the environment, is the protagonist. Once an environment is generated by the adversary, both the protagonist and antagonist play through that environment.
The adversary’s task is to maximize the antagonist’s reward and minimize the protagonist’s reward at the same time. Thus, it must create environments that are feasible to antagonists but challenging to the protagonist. The gap between the two rewards is regret. The adversary tries to maximize the regret. Meanwhile, the protagonist attempts to minimize it. Unsupervised environment design (UED) establishes a connection between environmental design and decision theory. PAIRED optimizes minimax regret, and minimax theory follows Maximin Principle. This formalism enables one to use decision theory tools to learn about each method’s benefits and drawbacks.
The exciting thing about minimax regret is that it rewards the adversary for generating a curriculum of initially easy, then increasingly tricky environments. In most environments, the reward function will give higher points for completing the task in fewer timesteps. By maximizing regret, the adversary searches for accessible environments that the protagonist fails to do. Once the protagonist has learned to solve each environment, the adversary must find a more complex environment that the protagonist cannot solve. Therefore, the adversary generates a curriculum of increasingly challenging tasks. The adversary creates a curriculum of increasingly more prolonged but possible mazes. This enables PAIRED agents to learn more complex behavior. The results provide promising evidence that PAIRED can be used to improve generalization for deep RL.
When applied to slightly more complex situations, such as teaching RL agents to navigate web pages, “Adversarial Environment Generation for Learning for Navigating the Web” examines the algorithm’s performance. The researchers propose an improved version of PAIRED and display how it can train an adversary to generate a curriculum of increasingly tricky websites. The team achieved a 75% success rate, with a 4x improvement over the most robust curriculum learning baseline.
Although Deep RL is very good at fitting a simulated training environment, addressing the real-world complexities requires automation. Unsupervised Environment Design (UED) is a framework that describes different methods for automatically creating a distribution of training environments. PAIRED is a practical approach for UED because regret maximization leads to a curriculum of increasingly challenging tasks.