The Transformer architecture is widely adopted in performing Natural Language Processing (NLP) tasks. The Transformer’s self-attention mechanism connects each token in the input through a relevance weighted basis of every other token. However, this mechanism is limited in its effectiveness and has a significant computational overhead.
A new study conducted by Google researchers proposes replacing the self-attention sublayers with simple linear transformations that “mix” input tokens. This approach speeds up the Transformer encoder architectures while keeping the accuracy costs limited. Additionally, the complexity and memory footprint of the Transformer architecture is reduced.
On investigating the effectiveness of faster, structured linear transformations, they found that the Fourier Transform achieves nearly the same performance as dense linear mixing and scales very efficiently to long inputs, particularly on GPUs.
The proposed model named FNet is a layer normalized ResNet architecture with multiple layers, where each layer has a Fourier mixing sublayer followed by a feed-forward sublayer. In the new model, the self-attention sublayer of each transformer encoder layer is replaced with a Fourier Transform sublayer. The team replaced the attention sublayer with two parameterized matrix multiplications – one mixes the sequence dimension, and the other mixes the hidden dimension.
The team compared the FNet model with several other models, including the following models:
- A Linear encoder, where every self-attention sublayer is replaced with a Fourier sublayer
- Random encoder, where each self-attention sublayer is replaced with linear sublayers
- Feed Forward-only encoder, where the self-attention sublayer is entirely removed from the Transformer layers.
The team remarked that the model is faster and achieves comparable accuracy to the most accurate efficient Transformer architectures. FNet’s performance is faster than all transformer architectures at both training and inference across all sequence lengths.
The proposed model offers a great compromise between speed, memory footprint, and accuracy. It achieves 92% of the accuracy of BERT in a typical classification transfer learning setup on the GLUE benchmark and is capable of training seven times faster than on GPUs. Unparameterized mixing that is facilitated by the discrete FT coefficients when coupled with simple nonlinearities in the feed-forward sublayers is sufficient to model the GLUE tasks. The study also reveals that an FNet hybrid model containing only two self-attention sublayers achieves 97% of BERT’s accuracy on the GLUE benchmark.