The purpose of Computer Vision (CV) is to allow machines to obtain valuable information from their surroundings, by analyzing visual data that can be provided by different sources such as digital images and videos. The nature of such information depends on the final goal of the machine. Think, for example, of self-driving cars. A CV module that is capable of detecting in real-time objects that appear in front of the car is essential to avoid accidents. On the other hand, a robot that has to give directions to people inside a railway station can change the way of speaking based on whether the listener is a child or an adult. This information can be obtained thanks to CV software that applies image classification methods to the frames captured by the cameras installed on the robot.
Since CV is one of the most attractive fields of Artificial Intelligence (AI), the Deep Learning (DL) revolution significantly changed even this research branch. Indeed, DL solutions are nowadays widely used to achieve different CV tasks like object detection, face recognition, image classification, object tracking in video, motion estimation, and many others. DL methods have become so popular thanks to their ability to automatically extract meaningful features from the available data. This reduces the effort needed to perform a handcrafted feature extraction like the detection of the corners in an image.
On the other hand, the main drawbacks of DL solutions are their intrinsic opacity and the labeled data scarcity. Indeed, the high complexity of deep neural networks makes it impossible for humans to understand the rationale behind their predictions. However, in different domains, it is essential to understand why a machine made a specific decision, in order to prevent, for example, ethical and racial issues. At the same time, DL models often require a huge amount of labeled data to be trained. During the training process, the model must be fed with data samples associated with the label that we want will be predicted after the deployment of the considered system (the age of a people based on their faces, what kind of animal is the one in a picture, …). This association between training data and corresponding labels is not always a feasible process and, in any case, requires huge human efforts in terms of time and costs.
For these reasons, CV researchers will focus their efforts on discovering and taking advantage of solutions that can soften the above-mentioned issues.
Here are some of the computer vision trends to watch:
Trend 1: Explainable AI Solutions
Deep eXplainable AI (XAI) consists of methods that help humans in understanding the decisions of DL solutions, in order to make them more transparent and trustable. Most XAI methods have been developed to be applied to any existing DL model, without the need to change it. Such methods have been recently criticized by the research community since they don’t provide enough details about the decision process of the DL model, as you can clearly see in the figure below.
Hence, the trend will be the deployment of DL solutions that are explainable-by-design. This means that the DL model itself is able to produce an explanation associated with each of its predictions. In the figure below, you can see an example of an explainable-by-design DL model recently developed at the University of Twente, in the Netherlands.
Some recent papers about Explainable AI methods applied to CV tasks:
- COIN: Counterfactual Image Generation for VQA Interpretation
- Demystifying Deep Learning Models for Retinal OCT Disease Classification using Explainable AI
- Explainable Artificial Intelligence for Human Decision-Support System in Medical Domain
- A Gradient Mapping Guided Explainable Deep Neural Network for Extracapsular Extension Identification in 3D Head and Neck Cancer Computed Tomography Images
Trend 2: Self-Supervised Learning
The purpose of Self-supervised Learning is to take advantage of huge amounts of unlabeled data in order to learn meaningful features from them through a Pre-Text Task, and then fine-tune such features with the few available labeled data by learning a Downstream Task. Consider the following example. Our final goal is to train an Image Captioning deep neural network that will work on images of animals, however, we don’t have enough labeled data to accurately train our model. We can exploit the available unlabeled data to make the model learn the features necessary to distinguish the different types of animals. As you can see in the figure below, the Pre-Text Task just consists of a classification problem, where our network has to detect the rotation applied to the input image. So, we apply a random rotation to every available unlabeled image, and then we pseudo-labeled it with such rotation. While the model learns how to detect which rotation has been applied to any input image, it will learn high-level features about the animals of such images. Indeed, to detect the rotation applied to the image of a cat, it is important, for example, to recognize its muzzle in different positions. These features will be very useful also to distinguish the various animals, as required by the Downstream Task.
Self-supervision is currently a very hot topic among AI researchers. For instance, the popular GANs (Generative Adversarial Networks) are based on a specific type of Self-supervised Learning method called Generative-Contrastive (or Adversarial).
Here, you can find some recent researches that focus on Self-Supervised Learning:
- Self-Supervised Learning For Segmentation
- Self-Supervised Learning via multi-Transformation Classification for Action Recognition
- Self-supervised Learning from 100 Million Medical Images
- Towards High Fidelity Monocular Face Reconstruction with Rich Reflectance using self-supervised Learning and Ray Tracing
- Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism
- Detection of Abnormal Behavior with Self-Supervised Gaze Estimation
Trend 3: Neuro-Symbolic AI
Neuro-symbolic AI aims at combining modern deep learning techniques with traditional symbolic AI methods that typically rely on rule-based reasoning about entities and their relations. For example, if we know that Bob and Alice are children of Carl, then we can deduce that Bob and Alice are brother and sister. The main benefits of Neuro-symbolic AI approaches are the ability to learn with less data and to provide inherently interpretable models. The MIT-IBM Watson AI Lab already focuses its efforts on this extremely promising research area. One of the contributions of this lab is CLEVRER: Collision Events for Video Representation and Reasoning, a work developed as a collaboration between MIT CSAIL, IBM Research, Harvard University, and Google DeepMind.
Some recent papers that talk about Neuro-Symbolic AI:
- Neuro-Symbolic Artificial Intelligence: Current Trends
- Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization
- End-to-End Neuro-Symbolic Architecture for Image-to-Image Reasoning Tasks
- Improving the Robustness to Variations of Objects and Instructions with a Neuro-Symbolic Approach for Interactive Instruction Following