Recently, DeepMind has open-sourced Perceiver IO–a general-purpose deep learning model architecture that can handle many different types of inputs and outputs. This “drop-in” replacement for Transformers is powerful enough to outperform baseline models without being constrained by domain knowledge.
A new preprint on arXiv describes Perceiver IO, a more general version of the AI architecture that can produce many different outputs from multiple inputs. This means it is applicable to real-world domains like language and vision as well as difficult games like StarCraft II. Unlike Perceiver, Perceiver IO is an advanced model that overcomes the limitation of only being able to produce very simple outputs by learning how to flexibly query the latent space.
In comparison to Transformers, the Perceiver IO is more efficient. This model can process a number of inputs in one sequence without incurring high compute and memory costs that come with it. It also allows for any desired output data type, which makes this powerful tool flexible but not overwhelming due to its simplicity.
Deep-learning models are designed for a particular type of data; computer vision (CV) models typically use convolutional neural networks, while natural language processing (NLP) ones rely on sequence learning. Systems that handle multi-modal input data, such as Google’s combined vision-language model – which handles both vision and language inputs – often have domain-specific architectures to process the different input types before combining them using an additional module. Many computer vision problems can be solved using Transformer architecture. However, the compute and memory resources required by Transformer increases with the square of input sequence length, making them impractical for some high-dimensional data types like video files or audio clips.
The Perceiver IO architecture uses a cross-attention to project high dimensional input arrays into lower dimensions. The latent space is then processed with standard Transformer self attention structure for superior data representation and processing accuracy. The Transformer module which processes this latent space is much more efficient than one that directly handles large arrays. The difference in size between input and output means a deeper processing chain can be used, enabling greater accuracy when detecting objects with similar features. Therefore, the latent representation is converted to an output by applying a query array with all the desired data elements.
To help researchers and machine learning communities at large, the Deepmind has now open-sourced its code.
Source 1: https://deepmind.com/blog/article/building-architectures-that-can-handle-the-worlds-data
Source 2: https://www.infoq.com/news/2021/08/deepmind-perceiver-io/