Transformer model, a deep learning framework, has achieved state-of-the-art results across diverse domains, including natural language, conversation, images, and even music. The core block of any Transformer architecture is the attention module, which computes similarity scores for all pairs of positions in an input sequence. Since it requires quadratic computation time and quadratic memory size of the storing matrix, with the increase in the input sequence’s length, its efficiency decreases.
Thus, for long-range attention, one of the most common methods is sparse attention. It reduces the complexity by computing selective similarity scores from the sequence, based on various methods. There are still certain limitations like unavailability of efficient sparse-matrix multiplication operations on all accelerators, lack of theoretical guarantees, insufficiency to address the full range of problems, etc.
Introduction to “Performer”
To resolve these issues, Google AI introduces the Performer, a Transformer architecture with attention mechanisms that scale linearly. The framework is implemented by Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, providing scalable low-variance and unbiased estimation of attention mechanisms expressed by random feature maps decompositions (in particular, regular softmax-attention). Mapping helps in preserving linear space and time complexity.
In the original attention mechanism, the query, i.e., the row matrix and the key, i.e., the column matrix, are multiplied and passed through a softmax operation to form the attention matrix. But in this, the query-key product matrix can’t be decomposed back into its original components after passing through the softmax operation.
One can obtain a linear time attention mechanism using the decomposition of the attention matrix. After decomposing the attention matrix, one can rearrange matrix multiplications to approximate the regular attention mechanism’s result without explicitly constructing the quadratic-sized attention matrix, which leads to FAVOR+ (Fast Attention via Matrix Associativity).
Thus, using the algorithm mentioned above, Performer comes up with the idea of a reduction in space and time complexity, along with lower energy costs by improving inference speed.