In order to achieve the best possible performance accuracy, it is crucial to understand whether an agent is on the right or preferred track during training. This can be in the form of felicitating an agent with a reward in reinforcement learning or using an evaluation metric to identify the best possible policies. As a result, being able to detect such successful behavior becomes a fundamental prerequisite while training advanced intelligent agents. This is where success detectors come into play, as they can be used to classify whether an agent’s behavior is successful or not. Prior research has shown that developing domain-specific success detectors is comparatively easier than more generalized ones. This is because defining what passes as a success for most real-world tasks is quite challenging as it is often subjective. For instance, a piece of AI-generated artwork might leave some mesmerized, but the same cannot be said for the entire audience.
Over the past years, researchers have come up with different approaches for developing success detectors, one of them being reward modeling with preference data. However, these models have certain drawbacks as they give appreciable performance only for the fixed set of tasks and environment conditions observed in the preference-annotated training data. Thus, to ensure generalization, more annotations are needed to cover a wide range of domains, which is a very labor-intensive task. On the other hand, when it comes to training models that use both vision and language as input, generalizable success detection should ensure that it gives accurate measures in both cases: language and visual variations in the task specified at hand. Existing models were typically trained for fixed conditions and tasks and are thus unable to generalize to such variations. Moreover, adapting to new conditions typically requires collecting a new annotated dataset and re-training the model, which is not always feasible.
Working on this problem statement, a team of researchers at the Alphabet subsidiary, DeepMind, has developed an approach to train robust success detectors that can withstand variations in both language specifications and perceptual conditions. They have achieved this by leveraging large pretrained vision language models like Flamingo and human reward annotations. The study is based on the researcher’s observation that pretraining Flamingo on vast amounts of diverse language and visual data will lead to training more robust success detectors. The researchers claim that their most significant contribution is reformulating the task of generalizable success detection as a visual question answering (VQA) problem, denoted as SuccessVQA. This approach specifies the task at hand as a simple yes/no question and uses a unified architecture that only consists of a short clip defining the state environment and some text describing the desired behavior.
The DeepMind team also demonstrated that fine-tuning Flamingo with human annotations leads to generalizable success detection across three major domains. These include interactive natural language-based agents in a household simulation, real-world robotic manipulation, and in-the-wild egocentric human videos. The universal nature of the SuccessVQA task formulation enables the researchers to use the same architecture and training mechanism for a wide range of tasks from different domains. Moreover, using a pretrained vision-language model like Flamingo made it considerably easier to fully enjoy the advantages of pretraining on a large multimodal dataset. The team believes this made generalization possible for both language and visual variations.
In order to evaluate their reformulation of success detection, the researchers conducted several experiments across unseen language and visual variations. These experiments revealed that pretrained vision-language models have comparable performance on most in-distribution tasks and significantly outperform task-specific reward models in out-of-distribution scenarios. Investigations also revealed that these success detectors are capable of zero-shot generalization to unseen variations in language and vision, where existing reward models failed. Although the novel approach, as put forward by DeepMind researchers, has remarkable performance, it still has certain shortcomings, especially in tasks related to the robotics environment. The researchers have stated that their future work will involve making more improvements in this domain. DeepMind hopes that the research community views their initial work as a stepping stone towards achieving more regarding success detection and reward modeling.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our26+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.