Exploring Self-Supervised Policy Adaptation To Continue Training After Deployment Without Using Any Rewards

Humans possess a remarkable ability to adapt, generalize their knowledge and use their experiences in new situations. Simultaneously, building an intelligent system with common-sense and the ability to quickly adapt to new conditions is a long-standing problem in artificial intelligence. Learning perception and behavioral policies in an end-to-end framework by Deep Reinforcement Learning (RL) have achieved impressive results. But it has become commonly understood that such approaches fail to generalize to even subtle changes in the environment – changes that humans can quickly adapt. For the above reason, RL has shown limited success beyond the environment in which it was initially trained, which presents a significant challenge in deploying Reinforcement Learning policies in our diverse and unstructured real world.

Domain randomization is being used for improvement in applications of Reinforcement Learning. Researchers have sought to improve the generalization ability of policies by introducing randomization into the training environment. Policies that are perpetual to certain factors of variation are easy to be learned.

Thus the question arises Instead of learning a policy robust to all possible environmental changes, can we adopt a pre-trained approach to the new environment through interaction.

Policy Adaptation

Fine-tuning parameters using a reward signal is a simple way to adopt a policy to new environments. In the real-world, obtaining a reward signal often requires human feedback or careful engineering, neither of which are scalable solutions.

In recent work by BAIR lab, the team shows that it is possible to adopt a pre-trained policy to a new environment without any reward signal or human supervision. There are often differences in dynamics when a policy trained in simulation is deployed in the real world. The above happens due to imperfections in the simulation.


The team demonstrated the potency of the self-supervised policy adaptation (PAD) by training policies for robotic manipulation tasks and adapting them to the real world. The generalization was evaluated to an actual robot environment and to the following two challenging settings

  1. A table cloth with increased friction, and 
  2. Continuously moving disco lights. 

Simulations provide a good platform for the more comprehensive evaluation of RL algorithms. With PAD, DMControl Generalization Benchmark, a new benchmark for RL generalization based on the DeepMind Control Suite, was released. DMControl Generalization Benchmark trains the agents in a fixed environment and deploys them in a new environment. The problem of generalization in Reinforcement Learning by randomization was dealt with in the previous work. The team finds that adopting policies through a self-supervised objective is a promising alternative to domain randomization when the target environment is genuinely unknown. 

The agents capable of continuously learning, adapting to their surroundings, and learning both from explicit human feedback and unsupervised interaction are envisioned in the future.

Source: https://bair.berkeley.edu/blog/2021/02/25/ss-adaptation/

Project: https://nicklashansen.github.io/PAD/

Paper: https://arxiv.org/abs/2007.04309

Code: https://github.com/nicklashansen/policy-adaptation-during-deployment

Consultant Intern: He is Currently pursuing his Third year of B.Tech in Mechanical field from Indian Institute of Technology(IIT), Goa. He is motivated by his vision to bring remarkable changes in the society by his knowledge and experience. Being a ML enthusiast with keen interest in Robotics, he tries to be up to date with the latest advancements in Artificial Intelligence and deep learning.

[Sponsored] 🐝 Meet Julius AI: An intelligent data analyst tool that enables users to analyze, interpret, and visualize complex data using natural language commands in a chat interface