AI Researchers Including Yoshua Bengio, Introduce A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Human consciousness is an exceptional ability that enables us to generalize or adapt well to new situations and learn skills or new concepts efficiently. When we encounter a new environment, Conscious attention focuses on a small subset of environment elements, with the help of an abstract representation of the world internal to the agent. Also known as consciousness in the first sense (C1), the practical conscious extracts necessary information from the environment and ignore unnecessary details to adapt to the new environment.  

Inspired by the ability of humans conscious, the researchers planned to build an architecture that can learn a latent space beneficial for planning and in which attention can be focused on a small set of variables at any time. Since reinforcement learning (RL) trains agents in new complex environments, they aimed to develop an end-to-end architecture to encode some of these ideas into reinforcement learning (RL) agents.

In their recent paper, researchers from McGill University, UniversitĂ© de MontrĂ©al, DeepMind, and Mila propose a new end-to-end, model-based deep reinforcement learning (MBRL) agent. The MBRL agent dynamically attends to only the appropriate parts of its environmental state to improve its out-of-distribution (OOD) and systematic generalization. 

While most existing MBRL agents rely on reconstruction-based losses to obtain state representations, they struggle to deal with high-dimensional inputs. This is because the agents can sometimes focus on irrelevant obversions that are useless for reaching the goal. 

The proposed end-to-end latent-space MBRL agent is jointly shaped by relevant RL signals and does not require reconstructing the observations.  It uses tree-search-based Model Predictive Control (MPC) as the planning algorithm that only updates the value function based on actual data. The MPC-based planning effectively reducing the negative impact of model inaccuracies and exhibits improved OOD generalization. 


The team used an end-to-end training process that enables the learning of the representation online and is better adapted to non-stationarity in the transition distribution and rewards.

As set-based representations efficiently capture the dynamics, the team used a set representation to discover sparse interactions between objects in the environment. Set-based representations facilitate generalization across different environment dynamics in multi-task or non-static settings. This drives the agent to find preserved dynamics across environments.


They also used inductive bias to mimic a C1-like ability in the planning agent, thereby improving the generalization ability.  The planning focuses only on a relevant set of environmental elements. Moreover, the simulations and predictions are performed on a bottleneck set that contains all the essential transition-related information. Since only the essential objects participate in the transition, we get improved generalization both in-distribution and OOD.


The team compared the proposed model with baselines models such as Unconscious Planning, model-free, Dyna, WM-CP, etc. The results suggest that the proposed CP agent can plan effectively and improve both sample complexity and OOD generalization.