This article summary is based on the research paper: 'Continuous Scene Representations for Embodied AI'. All credits for this research goes to the authors of this paper. 👏 👏 👏 👏 Please don't forget to join our ML Subreddit Need help in creating ML Research content for your lab/startup? Talk to us at [email protected]
In artificial intelligence, an embodied agent, also known as an interface agent, is an intelligent agent that interacts with the world through a physical body within that environment. Embodied agents require a complete depiction of their environment in order to act within a scenario. A perceptual comprehension of an agent’s environment should ideally go beyond detecting item identities and include relationships between objects as well as between the agent and its surroundings.
Scene graphs are a contender for such representations because they provide a concise and explicit description of a scene. In a typical scene graph pipeline, a collection of relationship labels is defined, then used to manually annotate connections between objects in frames. Finally, a model is trained to infer target graphs. However, once labels are defined, they are limited to a set of relationships established during model training.
Apart from this, scene graphs are often static in the literature. Ideally, while an agent investigates in an embodied environment, the scene representation should update on the go as new things are discovered. When an agent returns to a position, it should check to see if anything has changed.
Inspired by these insights, the researchers at Allen Institute for AI have developed a scene representation that is more suitable for embodied AI. The team has developed Continuous Scene Representations (CSR), a unique approach for constructing a scene representation from egocentric RGB snapshots. The goal is to describe relations between objects as continuous vectors and update the representation on-the-fly as the agent moves in order to overcome the constraints of standard scene graphs.
There are a number of issues in constructing an interactive scene representation. New objects should be accommodated in the graph, and their associations with existing objects should be discovered. A successful algorithm should also be able to tell when distinct views of the same item belong to the same object. To address these issues, the team proposes using a contrastive loss to learn object-relational embeddings to represent the nodes and edges of a scene representation.
The agent keeps track of previously encountered embeddings in its memory. When the agent extracts new embeddings from egocentric observations, it compares them to the memory to see which are new and which are old.
The team conducts various tests on CSR. The results show that CSR beats the baselines in terms of query and positive image matching, demonstrating the task’s difficulty. The findings indicate that CSR picks up on some underlying spatial correlations.
Continuous Scene Representations (CSR) is a technique for modeling objects and their relationships as continuous-valued feature vectors. The researchers offer an algorithm for updating the CSR as an agent moves, as well as a connection between the representation and the agent. They expect that the research will spur further research into scene representation learning strategies for other embodied activities, in addition to learning task-specific representations.