DeepMind recently released a state-of-the-art deep learning model called Perceiver via a recent paper. It adapts the Transformer to let it consume all the types of input ranging from audio to images and perform different tasks, such as image recognition, for which particular kinds of neural networks are generally developed. It works very similarly to how the human brain perceives multi-modal input.
Perceiver is a neural network model that can process and classify input data from various sources. This deep-learning model includes Transformers (a.k.a. attention), which will help to make predictions regardless of the type of input received, such as images or sound waves.
The Perceiver is in the spirit of a multi-tasking approach. It mainly takes in three kinds of inputs: images, videos, and point clouds, i.e., a collection of dots that explains what a LiDAR sensor on top of a car sees.

Once the system is trained, it shows meaningful results on benchmark tests, including the classic ImageNet test of image recognition, Audio Set, and ModelNet, a test whereby a neural net must use nearly 2,000 points in space to identify an object correctly.
The Perceiver manages to achieve the task mainly using two tricks. The first trick is reducing the amount of data that the Transformer needs to operate on directly. The Perceiver acts what the team calls, in an asymmetric fashion. Many of its abilities are spent examining the actual data, but some only go through the summary, the compressed version, thereby reducing the overall time spent.
The second trick is to give the model some clues about the structure of the data. The researchers use Fourier Features, which explicitly tag each piece of input with some meaningful information about the structure.
The results of the benchmark tests are interesting. Perceiver performs better than the industry standard ResNet-50 neural network on ImageNet, in terms of accuracy and better than the Vision Transformer. On the Audio Set test, the Perceiver performs better than most of the state-of-the-art models.

However, there are several issues with Perceiver. One is that the program doesn’t always perform as well as programs made for a particular modality. For example, on the point cloud, it falls far short of a 2017 neural network built just for point clouds, PointNet++. Another issue is that almost nothing about Perceiver appears to bring the benefits of more efficient computing and fewer data. The Perceiver can learn different kinds of representations.
The team shows many attention maps, i.e., visual studies that purport to represent what the Perceiver emphasizes in each clump of training data. Those attention maps suggest that the Perceiver is adapting where it places the focus of computing. Another weakness that the researchers specifically highlight is the question of the Fourier features. They help in some cases, but it’s not clear how or even if that crutch can be dispensed. Perceiver doesn’t show any synergy between the different modalities, so that image and sound and point clouds still exist apart from one another.
Paper: https://arxiv.org/pdf/2103.03206.pdf
Consultant Intern: Kriti Maloo is currently pursuing her B.Tech from Indian Institute of Technology (IIT) Bhubaneswar. She is interested in Data Analytics and its applications in various domains. She is a Bibliophile and loves to explore new advancements in the field of technology.