Early days of research in Natural language processing established long-term dependencies. It also brought the vanishing gradient problem in front of us because nascent models were handling the input sequences one by one, without parallelization. Recently, revolutionary transformer-based architectures and the self-attention mechanisms have enabled token pairs’ interactions across complete sequences resulting in modeling arbitrary dependencies in a constant number of layers. The above helped achieve state-of-the-art performance across many Natural language processing tasks.
However, these advantages came to a high-cost barrier because the transformer-based networks’ memory and computational requirements grow quadratically with sequence length. The above results in significant efficiency bottlenecks when dealing with long sequences. Researchers from the University of Wisconsin-Madison, American Family Insurance, UC Berkeley, and Google Brain propose Nyströmformer. Nyströmformer is an O(n) approximation in both memory and time for self-attention. The above is designed to rescue us from the quadratic cost associated with the long input sequences.
Nyström method is an efficient technique for obtaining a low-rank approximation of a large kernel matrix. The method proposed by the team leverages Nyström approximation tailored for a softmax matrix capable of reducing complexity from O(n^2 ) to O(n) for self-attention computation.
The algorithm’s foundation idea is defining the matrix form of landmarks and then using them to form the three matrices needed for approximation. The team selected the crossroads before the softmax operation so that the approximation can be made to avoid calculating the full softmax matrix S.
The Nystrom approximation thus scales linearly (O(n) complexity) concerning the input sequence length in terms of both time and memory.
The researchers conducted experiments in a transfer learning setting in two stages with the popular transformer model BERT as a baseline to evaluate the model.
- Nyströmformer was trained on English Wikipedia and Bookcorpus data.
- The pre-trained Nyströmformer model was fine-tuned for various Natural language processing tasks on the General Language Understanding Evaluation benchmark datasets and IMDB reviews. GLUE benchmark includes datasets like SST-2, QNLI, MRPC, QQP, and MNLI.
The results achieved by the team show that Nyströmformer offers time efficiency and good memory over standard self-attention and Longformer self-attention. The model also performs competitively with the BERT-base model. Overall, It’s a big step towards running transformer models on very long sequences as Nyströmformer provides self-attention approximation with very high efficiency.