Neural networks have achieved success in various perceptual tasks. However, it is stated that they are ineffective in solving problems requiring higher-level reasoning. Recent experiments with two recently released video question-answering datasets (CLEVRER and CATER) show that neural networks cannot adequately reason about the Spatio-temporal and compositional structure of visual scenes.
On the other hand, Neuro-symbolic models that combine algorithms with symbolic reasoning techniques to predict, explain, and consider counterfactual possibilities are assumed to be much more suitable than neural networks. It leverages several independently-learned modules such as:
- A neural network ‘perceptual’ front-end to detect objects
- A dynamics module to infer objects’ behavior over time
- A symbolic statistical semantic parser that represents the questions
- A hand-coded symbolic executor interprets inputs and predicts answers
However, researchers at DeepMind assert that neural networks can outperform neuro-symbolic models under the right testing conditions. For example, in some symbolic domains like language, neural networks outperform hybrid neuro-symbolic methods to classify or predict. Therefore, researchers have reconciled existing neural network limitations in video domains with their symbolic fields’ successes.
This reconciliation is achieved by designing a neural network architecture for Spatio-temporal reasoning about videos ( the components are learned, and all standard representations are distributed rather than symbolic or localist) throughout the neural network layers.
The proposed neural network architecture leverages attention to integrating information effectively. An important aspect is self-supervision (meaning the model infer masked-out objects in videos using the underlying dynamics to extract more data), which allows our model to learn better representations and achieve higher data efficiency.
The architecture guarantees visual elements in the videos that correspond to physical items essential for higher-level reasoning. As neural networks are flexible, the same architecture and algorithm can be applied to various tasks without any manual changes to the system’s internal workings.
The results have many implications for the development of machines that can reason about their experiences. Contrary to previous studies’ outcomes, models based on exclusively distributed representations can perform well on visual-based tasks that measure high-level cognitive functions.
The team states that the resulting model surpasses neuro-symbolic models’ performance on the CLEVRER dataset across all questions, with the most significant advantage on the counterfactual questions. CLEVRER dataset draws on insights from psychology and consists of 20,000 5-second videos of colliding objects generated by a physics engine and over 300,000 questions and answers that focus on four logical reasoning elements: descriptive, explanatory, predictive, and counterfactual.
The researchers declare the critical aspects of their successful approach as below:
- Self-attention to effectively integrate information over time
- Soft-discretization of the input at the right level of abstraction
- Self-supervised learning to extract more information from each sample.
Their neural networks have matched the performance of some of the best neuro-symbolic models with 40% less training data (without pre-training or labeled data), challenging the belief that neural networks are certainly more data-hungry compared to neurosymbolic models.
The results also show that deep networks can replicate many human cognition and reasoning properties and profit from distributed representation’s flexibility and expressivity. The team says that a large-scale neural language model can explicitly acquire arithmetic reasoning and analogy-making without explicit training. This suggests that current neural network limitations are enhanced when scaling more data and using larger and more efficient architectures. They hope that new challenging tasks may be proposed to empirically determine the full extent of what is achievable by neural networks.