How it started :
As we know Recurrent Neural Networks (RNNs), Long Short Term Memory networks (LSTMs), Gated Recurrent Units (GRUs) are quite successful in the recent times. LSTMs are used for many tasks like Sentiment analysis, Machine translation, Caption generator, etc. It showed some remarkable results for encoder-decoder models. First, let us understand what encoder-decoder models are. Consider the task of Machine translation. The idea used here is we pass the input, i.e., a text in one language, into a Sequential Model and get the encoded version of the sentence. Now using this encoded vector, we pass it into another sequential model to get the output. There was a problem with this model. The model was not able to translate long sequences. The reason behind this is simple when you pass input to the encoder model; The model will give more importance to the ending words than the starting words. The reason behind this lies in the math behind Sequential Models.
Now to get rid of this, researchers developed a new encoder-decoder architecture based on the Attention mechanism. The idea is simple. We allow the model to select the words required from the encoder model to predict the next word in the decoder model. First, we pass the input sequence to a sequential model and get a decoded vector of every word in the input sequence. Then, we add a new layer that allows the model to decide which words to consider to predict the next word in the output sequence.
This model was very successful, but this idea helped researchers develop a model architecture called Transformers. The 2017 paper ‘Attention is all you need’ introduces these architectures based on attention mechanisms. This paper proved that ‘Attention’ itself could produce some great results. This paper is the most significant breakthrough in Deep Learning history, and Many pre-trained models were developed, such as Google BERT, OpenAI GPT-2, etc. These models can do many tasks such as Question and Answering, Machine Translation, Summarization, Chat Bots, etc.
Recently a research team from Google and EPFL proved that attention is not all you need and also proved that pure attention decays in rank doubly exponentially with respect to depth. Here is what researchers did,
- Studied the structure of a basic transformer model by experimenting with main structural elements in Transformer architecture are MLPs ( Multi-Layered Perceptons), Skip connections, Layer Normalization.
- Considered each of the structural elements differently and analyzed the performance of the model.
- They observed that the model heavily relied on short paths. They behaved as if they are ensembles of shallow single head self-attention. MLP helped the model not to converge, and Layer normalization helped in rank collapse. That made the model stronger.
If you want to know more about this try reading the paper.