Researchers at ETH Zurich and UC Berkeley Propose Deep Reward Learning by Simulating The Past (Deep RLSP)


In Reinforcement Learning (RL), the task specifications are usually handled by experts. It needs a lot of human interaction to Learn from demonstrations and preferences, and hand-coded reward functions are pretty challenging to specify. 

In a new research paper, a research team from ETH Zurich and UC Berkeley have proposed ‘Deep Reward Learning by Simulating the Past’ (Deep RLSP). This new algorithm represents rewards directly as a linear combination of features learned through self-supervised representation learning. It enables agents to simulate human actions “backward in time to infer what they must have done.

Deep Reward Learning by Simulating the Past (Deep RLSP) algorithm

The research team starts with the premise that a given environmental state is already optimized toward a user’s preferences. Instead of manually specifying an agent’s work, they attempt to simulate the past trajectories that led to an observed state. The method starts at an observed state and simulates backward in time to derive a gradient that is amenable to estimation. It, therefore, learns an inverse policy and inverse dynamics model using supervised learning to perform the backward simulation.

RL’s environment is formed as a stochastic finite state machine with inputs and outputs, which can be viewed as a finite-horizon Markov Decision Process (MDP) containing states S and a set of actions A. Given a state and action, the transition function T determines the distribution over the following states, and the reward function r determines the agent’s objective. A policy π decides how to choose actions given a state. Here, the goal is to find a policy π∗ that maximizes the expected cumulative reward.

The researchers propose that if the future can be sampled by rolling out forward in time, they should sample the past by rolling out backward in time. They can then learn the inverse policy and the inverse dynamics using supervised learning and approximate the gradient’s expectation.

The gradient, although, has a few problems as it depends on a feature function. The team has attempted to remove this assumption by using self-supervised learning to learn the feature function to resolve this. Under fully observable environments, they make a variational autoencoder learn the feature function. It then directly encodes the states into a latent feature representation. For the case of partially observable environments, the researchers apply recurrent state-space models (SSMs). All these components together constitute the Deep RLSP algorithm.


The research team has employed a MuJoCo (Multi-Joint dynamics with Contact) physics simulator in their experiments to demonstrate that Deep RLSP can be scaled to high-dimensional, continuous, and complex environments. The researchers selected three environments from the Open AI Gym and compared Deep RLSP against a GAIL (Generative Adversarial Imitation Learning) baseline.

Average returns achieved by the policies learned through various methods, for different
numbers of input states.

The result shows that although GAIL was provided with both states and actions as input, it could only learn a good policy in only very simple environments. Deep RLSP meanwhile achieved reasonable behavior across all environments with the only state as an input.


The study has successfully demonstrated that learning helpful policies with neural networks doesn’t necessarily require manual effort. The Deep RLSP proposed by the team liberates researchers from this burden by extracting the information present in an environment’s current state.





Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.