Google AI Introduces Performer: A Generalized Attention Framework based on the Transformer architecture

Transformer model, a deep learning framework, has achieved state-of-the-art results across diverse domains, includingĀ natural language,Ā conversation,Ā images, and evenĀ music. The core block of any Transformer architecture is theĀ attention module, which computes similarity scores for all pairs of positions in an input sequence. Since it requires quadratic computation time and quadratic memory size of the storing matrix, with the increase in the input sequence’s length, its efficiency decreases.

Thus, for long-range attention, one of the most common methods isĀ sparse attention. It reduces the complexity by computing selective similarity scores from the sequence, based on various methods. There are still certain limitations like unavailability of efficient sparse-matrix multiplication operations on all accelerators, lack of theoretical guarantees, insufficiency to address the full range of problems, etc.

Introduction to ā€œPerformerā€

To resolve these issues, Google AI introduces theĀ Performer, a Transformer architecture with attention mechanisms that scale linearly. The framework is implemented byĀ Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, providing scalableĀ low-varianceĀ andĀ unbiasedĀ estimation of attention mechanisms expressed by random feature maps decompositions (in particular, regular softmax-attention). Mapping helps in preserving linear space and time complexity.

In the original attention mechanism, the query, i.e., the row matrix and the key, i.e., the column matrix, are multiplied and passed through a softmax operation to form the attention matrix. But in this, the query-key product matrix canā€™t be decomposed back into its original components after passing through the softmax operation.

https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html

One can obtain a linear time attention mechanism using the decomposition of the attention matrix. After decomposing the attention matrix, one can rearrange matrix multiplications to approximate the regular attention mechanismā€™s result without explicitly constructing the quadratic-sized attention matrix, which leads to FAVOR+ (Fast Attention via Matrix Associativity).

Thus, using the algorithm mentioned above,Ā PerformerĀ comes up with the idea of a reduction in space and time complexity, along with lower energy costs by improving inference speed.

Github: https://github.com/google-research/google-research

Paper: https://arxiv.org/pdf/2009.14794.pdf