Stanford AI Lab Introduces AGQA: A New Benchmark For Compositional, Spatio-Temporal Reasoning

2122
Source: http://ai.stanford.edu/blog/agqa/

Designing machines capable of exhibiting a compositional understanding of visual events has been an important goal of the computer vision community. Stanford AI has recently introduced the benchmark,’ Action Genome Question Answering’ (AGQA). It measures temporal, spatial, and compositional reasoning via nearly two hundred million question answering pairs. The questions are complex, compositional, and annotated to allow definitive tests that find the types of questions that the models can and cannot answer.

The researchers designed a synthetic generation process using rules-based question templates to generate questions from scene information, representing what occurs in the video using symbols. Synthetic generation enables researchers to control the content, structure, and compositional reasoning steps needed to answer each generated question.

Action Genome Question Answering consists of nearly 192 Million complex and compositional question-answer pairs. Each question has comprehensive annotations about the content and structure of the question. These annotations include mapping items in the question to the relevant part of the video and a program consisting of the reasoning steps needed to answer the question. AGQA also provides detailed metrics, such as test splits to measure performance on different question types and three new metrics specifically designed to measure compositional reasoning.

The video is first represented through scene graphs to generate questions synthetically. Then a sample of frames is taken from the video in which each frame annotates the objects, actions, and relationships that occur in that frame. The researchers then built 28 templates. The templates consist of a natural language frame referencing type items within the scene graphs.

Next, the scene graphs and the templates to generate natural language question-answer pairs are combined. Finally, the researchers take the generated pairs and balance the distributions of answer and question types. Answer distributions for different categories are smoothened, and then questions are sampled such that the dataset has a diversity of question structures.

The researchers validate their question-answer pairs through human validation, and they have observed that the annotators agree with 86.02% of our answers.
Model performance hugely depends on linguistic biases. Therefore, the research team ran three state-of-the-art models on their benchmark, i.e., HCRN, HME, and PSAC. They discovered that the models struggle on the benchmark. If the model only chooses the most likely answer (“No”), it would achieve a 10.35% accuracy. HME, the highest-scoring model, achieved a 47.74% accuracy. Though HCRN achieved 47.42% accuracy overall, it still reaches 47% accuracy without seeing the videos.

To better understand model performance on different types of questions, the researchers split the test set by the reasoning skills needed to answer the question. Various models achieved the highest accuracy in each category. The model performance also varied among these categories, though all three models performed the worst on activity recognition.

AGQA splits questions by the criteria that the question’s semantic focus is on objects, relationships, or actions.
Finally, the researchers also annotate each question by its structure. Query questions are open-answered, whereas Verify questions verify if a question is true. Logic questions use a logical operator and Choose questions to offer a choice between two options. Almost every model performed the worst on open-answered questions and best on verify and logic questions.

The researchers also provide three new metrics that precisely measure compositional reasoning. The first metric measures the model’s ability to generalize to novel compositions, whereas the second metric measures generalization to indirect references. The research team used indirect references to increase the complexity of the questions. The third metric, on the other hand, measures generalization to more complex questions. It was observed that the models struggle on this task since none of them outperform 50% on binary questions having only two answers.

https://arxiv.org/pdf/2103.16002.pdf

Lastly, the researchers annotate the number of compositional steps needed to answer each question. It was found that though humans remain consistent as questions become more complex, models decrease inaccuracy.

https://www.youtube.com/watch?v=6Rw1QF9Hono

AGQA has opened many avenues for progress in multiple directions. For example, Neuro-symbolic and meta-learning modeling approaches can improve compositional reasoning. It has thus provided comprehensive metrics that can serve as a baseline for exploring multiple exciting new domains. The benchmark points towards the weak points of existing models, which mainly include overreliance on linguistic biases and difficulty generalizing to novel and more complex tasks.

Paper: https://arxiv.org/pdf/2103.16002.pdf

Data: https://cs.stanford.edu/people/ranjaykrishna/agqa/

Stanford Blog: http://ai.stanford.edu/blog/agqa/