Google AI, in collaboration with DeepMind and the University of Toronto, has recently introduced DreamerV2. It is the first Reinforcement Learning (RL) agent based on the world model to attain human-level success on the Atari benchmark. It includes the second generation of the Dreamer agent who learns behaviors entirely within a world model’s latent space trained from pixels. (World models are easy to teach in an unsupervised manner to learn a compressed spatial and temporal representation of the environment)
DreamerV2 accurately predicts future task rewards even when those rewards did not influence its representations, mostly from general information from the images. Top model-free algorithms are outperformed by DreamerV2 using a single GPU.
Recent advances in Deep Reinforcement learning
Deep reinforcement learning helps AI agents to improve their decisions over time. Modern approaches in deep reinforcement learning have empowered model-based methods to study world-models from image inputs utilized for planning. The world models can learn from lesser interactions, promote generalization from offline data, and apply the study across multiple tasks.
Despite their numerous advantages, the existing world model’s accuracy is not adequate to compete with the top model-free approaches on the most competitive RL benchmarks.
Previous studies have concentrated on the development of task-specific planning methods. These methods learn by predicting sums of expected task rewards. As these approaches are task-specific, it is uncertain how well they would generalize to new tasks or learn from unsupervised datasets.
DreamerV2 learns a world model which it uses to train actor-critic behaviors essentially from predicted trajectories. Computation of compact representations of its images is automatically learned by the world model, which discovers practical concepts, such as object positions, and studies how these concepts change in response to various actions. This allows the agent to generate abstractions of its images by ignoring irrelevant features, thus enabling massively parallel predictions on a single GPU. DreamerV2 predicts about 468 billion compact states for learning its behavior during 200 million environment steps.
DreamerV2 builds upon the Recurrent State-Space Model – RSSM, a deep planning network for reinforcement learning. Each image is turned into a stochastic representation using an encoder, incorporated into the world model’s recurrent state. As the representations are stochastic, they only extract the necessary information to make predictions, making the agent robust to unseen images. A decoder from each state reconstructs the corresponding image to learn general representations. To rank outcomes during planning, a small reward network is trained. Additionally, a predictor learns to guess the stochastic representations without access to the images used for computing them, enabling planning without generating images.
DreamerV2 introduces two new techniques to RSSM, leading to a more accurate world model for learning successful policies.
- The first technique represents each image with multiple categorical variables. This enables more accurate predictions of future representations and drives the world model to reason about the world in terms of discrete concepts. Each image is converted into 32 distributions, each over 32 classes using the encoder. Its meaning is determined automatically as the world model learns. The sampled one-hot vectors are concatenated to a sparse representation transferred onto the recurrent state. Representing images with categorical variables enables the predictor to accurately study the distribution over the possible subsequent images’ one-hot vectors.
- The second technique is KL balancing. Earlier world models used methods that support accurate reconstructions while keeping the stochastic representations close to their predictions to regularize the amount of information extracted from each image and facilitate generalization. Since the objective is optimized end-to-end, the stochastic representations and their predictions can be made similar by bringing either one towards the other. However, getting the representations towards their predictions can be problematic when the predictor is not precise yet. KL balancing allows the predictions to move faster toward the representations than vice versa, resulting in more accurate predictions.
DreamerV2 outperforms the top-world models by achieving 25% of the human record on average across games. The researchers aim to develop methods that can achieve human-level performance on all of the games and not just a few. The result exhibits that world models are powerful methods that can be used to attain high performance on reinforcement learning problems. The teams perceive that the success of unsupervised representation learning is now starting to be realized in reinforcement learning in world models.
Atari Benchmark: https://gym.openai.com/envs/#atari