Most reinforcement learning algorithms work on a ‘reward’ function to teach the agents in an unknown environment. The reward is given if the action taken results in a good outcome. But it’s a difficult task to define rewards for situations lacking clear objectives. For example, whether a room is clean or if a door is sufficiently shut. In such scenarios, the user cannot describe the task in words or numbers; however, he can readily provide examples of how the world would look like if it were solved.
Thus, Google AI suggests an alternative, example-based control, which aims at teaching agents how to solve new tasks by providing examples of success. This is termed as recursive classification of examples (RCE). It does not rely on formulated reward functions, distance functions, or features. It instead just uses the examples of success. RCE performs better than the prior approaches based on imitation learning on simulated robotics tasks.
Fig-1: To teach a robot to hammer a nail into a wall, most reinforcement learning algorithms require a user-defined reward function.
Fig-2: The example-based control method uses examples of what the world looks like when a task is completed to teach the robot to solve the task, e.g., examples where the nail is already hammered into the wall.
Working of RCE
Now, it might seem similar to supervised learning, where we have input-output pairs, i.e., we have labeled training data. But, in this case, the only thing we have, is success examples. The system doesn’t have prior knowledge about which states and actions lead to success. Even when the system interacts with the environment, the experience it gains can’t be labeled as leading to success or failure.
At first, a successful example is required. Secondly, even though we don’t know whether an arbitrary state-action pair will lead to success in solving a task, it is still possible to estimate the likelihood (or resemblance) of the task to be solved(successfully) if the agent started at the next state. Now, if the next state is likely to lead to future success, we can assume that the current state is most likely to lead to future success. Thus, this is a recursive classification, as the labels are inferred based on the predictions at the next time step, which again depends upon its respective next step.
This approach resembles existing temporal-difference methods, such as Q-learning and successor features. The only significant difference is that RCE does not require a reward function, unlike the methods mentioned above.
As shown above, an example was taken to evaluate the RCE method on different challenging robotic manipulation tasks. The task was for a robotic hand to pick up a hammer and hit a nail into a board. So, the existing method used a complex reward function, whereas the RCE method required only a few success examples.
Fig-3: Comparison of RCE with the prior methods.
Related videos: https://ben-eysenbach.github.io/rce/