One challenge in the field of reinforcement learning (RL) is that high-dimensional observations are difficult to control. The last three years have seen a major breakthrough with many new methods being developed for improved sample efficiency and better low dimensional representations. Methods such as autoencoders, variational inference, contrastive learning, self prediction or data augmentations all offer hope for overcoming this obstacle in RL research.
However, current take on model-free methods are still limited in three ways. First they can’t solve the more challenging visual control problems such as quadruped and humanoid locomotion. Second these often require significant computational resources, i.e lengthy training times using distributed multi-gpu infrastructure (in other words a lot of work). Lastly it’s unclear how different design choices affect overall system performance so you never really know what kind of outcome to expect.
Facebook AI Research unveiled DrQ-v2, a simple model-free algorithm that builds on the idea of using data augmentation to solve hard visual control problems. The technology is the first model free method and has had significant improvements in sample efficiency across tasks from DeepMind Control Suite. It’s also computationally efficient which allows solving most tasks in DeepMind Control Suite within 8 hours with just one GPU powering it all!
Recently, a model-based method called DreamerV2 was shown to solve visual continuous control problems and it also solved the humanoid locomotion problem from pixels. While our DrQ-v2 matches DreamerV2 in terms of sample efficiency and performance, we do so four times faster than their counterparts when considering wall clock time for training purposes. We believe this makes DrQ-v2 more accessible approach for research that focuses on these types of tasks while reinforcing the question if model free or model based is going to be better suited towards solving them.
DrQ-v2 is a new model-free off policy algorithm that builds upon DrQ, an actor critic approach. The improvements in this version of the software include:
- Switch the base RL learner from SAC to DDPG.
- Incorporate n-step returns to estimate TD error.
- Introduce a decaying schedule for exploration noise.
- Make implementation 3.5 times faster.
- Find better hyper-parameters
More Details in the Paper: https://arxiv.org/pdf/2107.09645.pdf
PyTorch implementation of DrQ-v2 (Github): https://github.com/facebookresearch/drqv2