Recently, researchers at DeepMind have proposed manipulation-independent representations (MIR) to support successful imitation of behaviors demonstrated by previously unseen manipulator morphologies using only visual observations.
Imitation learning is a powerful tool for robotic learning tasks where specifying a reinforcement learning (RL) reward is not possible or where the exploration problem is challenging.
Imitation derives a policy from collecting first-person action-state trajectories. However, we humans and the other animals imitate by observing behavior, understanding its perceived effect on the state of the environment, and finding out what actions our body can perform to reach a similar outcome.
Imitation learning involves implicitly giving a robotic entity prior information about the world by mimicking human behavior. These methods can be linked to cloning or inverse reinforcement learning. Therefore, such techniques make it possible to increase the efficiency of robots and their interaction capacities with humans while decreasing the cost of learning new skills.
The team was inspired by the idea of a robotic manipulator imitating any visually-demonstrated behavior of arbitrary complexity. In this work, the researchers explore the possibility of third-person visual imitation of manipulation trajectories, only from vision and without access to actions.
The proposed Manipulation-Independent Representations
Firstly, the team demonstrates how to imitate unconstrained manipulation trajectories executed by previously unknown manipulators and learn pixel-based representations that produce the correct information for a cross-embodiment trajectory tracking agent.
The introduced imitation method has two primary phases:
- Learning a MIR space.
- Cross-embodiment visual imitation through RL using the pre-trained MIR space.
To support imitation from unseen manipulator trajectories, the team distinguished three main properties of the desired MIR space:
- Cross-domain alignment: The researchers trained a shared embedding space designed to close the significant domain gaps amongst trajectories to obtain cross-domain alignment.
- Temporal smoothness: The team proposed a Temporally-Smooth Contrastive Network (TSCN) to secure the temporal smoothness. They also used a loss function to drive the learned representations to be more temporally smooth and include opposing pairs to improve the alignment quality.
- Suitability for RL: The team used several domain randomization levels in their simulated environments to make the representation actionable. This helped the representation focus on the actual change in the environment and capture the rough position and properties of a manipulator.
The team conducted experiments on eight environments and scenarios ( Canonical Simulation, Invisible Arm, Arm Randomized, Domain Randomized, Jaco Hand, Real Robot, Pick-up Stick, and Human Hand) to evaluate its performance on unconstrained manipulation trajectories imitation through unknown manipulators. They also examined MIR with baseline methods such as naive Goal Conditioned Policies (GCP) and Temporal Distance.
On evaluating, MIR achieved the best performance across all test domains. It significantly boostedJaco Hand performance on stacking success and excellently imitated the simulated Jaco Hand and Invisible Arm on lifting, with a 100 percent score.
This study demonstrates the importance of visual imitation representation in visual imitation and validates manipulation-independent representations (MIR) for successful cross embodiment visual imitation.