Deep reinforcement learning (deep RL) has seen promising advances in recent years and produced highly performant artificial agents across a wide range of training domains. Artificial agents are now performing exceptionally well in individual challenging simulated environments, mastering the tasks they were trained for. However, these agents are restricted to playing only the games for which they were trained. Any deviation from this (e.g., changes in the layout, initial conditions, opponents) can result in the agent’s breakdown.
To address this issue, DeepMind researchers set out an aim to create a robust artificial agent whose behavior generalizes beyond the training set of games. For this, they introduce an open-ended 3D simulated environment space for training and evaluating artificial agents.
They describe that a neural network policy exhibiting general heuristic behavior can be obtained by training an agent effectively across a massively multi-task continuum. This enables it to earn rewards in all of the tasks in the held-out evaluation set that are humanly solvable. In addition, they remark that the agents become capable of tasks that are explicitly excluded from training and lie outside of its training distributions, such as hide-and-seek and capture-the-flag variations.
They developed an environment space called XLand that allows the procedural generation of complex 3D environments and multiplayer games to build a broad and diversified continuum of training and evaluation assignments. Visual scene understanding, navigation, physical manipulation, memory, logical thinking, and theory of mind are among the skills required of players.
The researchers first establish a multi-dimensional performance metric and normalized score percentiles to characterize agent performance and robustness across the assessment task space. They used an open-ended training process that iteratively improves the spectrum of normalized score percentiles.
The training technique is based on deep reinforcement learning, with an attention-based neural network design that allows for the agent’s implicit modeling of game goals. The training tasks that the agent consumes are dynamically created in response to the agent’s performance, with the generating function changing regularly to keep a population of agents improving across all percentiles of the normalized score.
Agent populations are trained sequentially, with each generation distilling from the best agent in the previous generation. This iteratively improves the frontier of normalized score percentiles while redefining the metric itself, which is an open-ended learning process.
When this environment space was combined with such a learning procedure, agents appeared to have broad ability across our held-out assessment space demonstrating “zero-shot” learning, failing on few activities that are even humanly impossible.
The results demonstrate the possibility of training agents with broad capabilities across a wide range of tasks without relying on human demonstrations. Their findings show that each component of this learning process gives an advantage, with dynamic task generation being especially significant for learning compared to uniform sampling from task space.
The team hopes that their study paves the way for future research into developing more adaptive agents capable of transferring to more complicated tasks.