Understanding Attention mechanism and Machine Translation Using Attention-Based LSTM (Long Short Term Memory) Model

Source: https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a

First, let us understand why an Attention Mechanism made machine translation easy. Previously encoder-decoder models were used for machine translation. The encoder-decoder model contains two networks encoder and decoder. The encoder model encodes the input sequence into a vector. This vector contains all the information about the sentence, which, when passed to the decoder model, outputs the required language’s sentence.

Let us understand the problem with this model. The model works appropriately if and only if the encoder encodes the input sequence correctly. But the classic encoder-decoder model could not capture the information of a long text. The traditional encoder-decoder model gives more importance to the last words of the sequence; hence, the encoder will forget the first words if it is very long for a long sequence. Thus we use Attention-based models.

In the Attention-based model, we encode each word into an encoded vector and allow the model to select the words required to predict the next word in the output sequence. Hence we use LSTM to encode each word into a vector, then pass these vectors into the attention layer and pass the output to another decoder model to get the output sequence.

In the Attention Layer, each word has a weight associated with it. If the weight is more, it means that a particular word in the input sequence is significant in predicting the following word in the output sequence.Now the question is how do we calculate the weights for that there are two methods Luong’s attention and Bahdanau’s attention the weights are calculated as follows.

an image by stackoverflow.com

Here alpha represents the weights, C represents the context vector, score(p,q) means that p comes from the decoder and q comes from encoder. q is called query vector and p is called value vector. for bahdanau’s attention we pass query, values.Values are all the encoded version of the input words and it first calculate score and then calculates

Here query comes from the decoder model, and values come from the encoder model, which returns the context_vector and attention weights.

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    query_with_time_axis = tf.expand_dims(query, 1)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))
    attention_weights = tf.nn.softmax(score, axis=1)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

You can visit the official TensorFlow page to get the complete code for this model.