Deep Learning with Keras – Part 7: Recurrent Neural Networks


In this part of the series, we will introduce Recurrent Neural Networks aka RNNs that made a major breakthrough in predictive analytics for sequential data. This article covers RNNs on both conceptual and practical levels. We will start with the definition of RNNs, why and when they are used, then we will build an RNN ourselves for sentiment analysis.

Why RNNs?

So far we have been working with regular tabular data. This data has no real notion of a sequence. That means that it does not matter whether we shuffle the fields or not, the model will still be able to train correctly. In case of sequential data it all changes. The order of inputs is very important in sequential data, and any order change would drastically affect the meaning behind the input. For example: We cannot switch the frames in a video, clips in an audio, or words in a sentence without messing up the whole meaning.

The networks we learned so far do not take this ordering into consideration. We will illustrate the idea with an example. Given a dataset of sentences: I love orange juice. I hate this movie. The sky is blue…, build a model that predicts the sentiment behind each sentence. This means you have to know whether the author is having a positive, negative or neutral opinion.

In classic fully connected networks we can solve this situation as follows:

Simple network design

You can guess that this solution is not so efficient for many reasons:

  • The ordering between the words is not well captured in the network
  • The length may vary from one sentence to another
  • Neurons do not really understand the relation between words in the sentence

What are RNNs?

RNNs are designed much differently. Have a look at an alternative network for the same problem:

RNN design

The red arrows represent the activation value from each neuron. The first red arrow is the initial activation value (can be set randomly). All the following are calculated based on the weights of the current word and the activation of the previous one. Notice how each neuron is connected to the one before using the activation value. This what makes RNNs powerful. RNNs enable the network to understand the ordering of each word according to its position in a sentence.

The figure represents a very simple RNN architecture. Actually there are many others. Some enable bidirectional connections between words, others output multiple values. In short we can classify they according to the number of inputs and number of outputs.

Different RNN designs
  • Many to many RNNs are useful for translation. We give a sentence of many words, and expect the network to provide many translated words in different language.
  • Many to one is the same one we have designed for sentiment analysis. We give multiple input tokens and expect one value as result.
  • One to many is used often for sequence generation. You could give the network one input and it will generate a sequence based on what it learned. Example: Give a word and generate news, poetry, music, etc.

These different types are somehow advanced, and we will cover them in future articles. For now, let us focus on our many to one sentiment analysis network.


We are going to test our RNNs with the Spam SMS dataset. For more information about the dataset and how to preprocess the data please refer to the previous post in this series .Deep Learning with Keras – Part 6: Textual Data Preprocessing.

We have reached this far in the previous tutorial:

import pandas as pd
data = pd.read_csv('./data/sms.csv')
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=10000)
data['Tokens'] = tokenizer.texts_to_sequences(data['SMS'])
from keras.preprocessing.sequence import pad_sequences
sequences = pad_sequences(data['Tokens'], maxlen=40, padding='pre', truncating='post')
data['Padded'] = sequences.tolist()

The goal is to extend the above code by adding and training the RNN model. But let us review what we did first. We started by loading the data, we used a tokenizer to break the sentences into tokens, we specified the num_words=10000 to get only the top 10000 words in the document, we padded the sequences so that they can all have a length of 40.

Next we have to split our data to training and testing. We will do it using the sklearn library:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Padded'], data['Label'], test_size=0.3)

The X is the data[‘Padded’] and the Y is the data[‘Label’]. If we have a look at the shape of each set you will see that they have the shapes: ((3900,), (1672,), (3900,), (1672,)). So we need to fix that in order to match the shape of: number of samples * sequence length (which is 40 in our case). This is how it is done:

import numpy as np
X_train = np.vstack(X_train.values)
X_test = np.vstack(X_test.values)

Time to define the model:

from keras import layers, models
model = models.Sequential()
model.add(layers.Embedding(10000, 128, input_length=40))
model.add(layers.LSTM(32, activation='tanh'))
model.add(layers.Dense(1, activation='sigmoid'))

We need to discuss a lot of things here.

First, the Embedding layer is a special layer used especially for text. We gave it the following parameters:

  • number of words/tokens in the data: in our case we chose to take 10000 words from the dataset so this is our number
  • the output embedding size: an embedding is a vector that represents the characteristics of each word. Here we chose to extract 128 characteristics from each word. (More about this layer will be discussed in future tutorials)
  • the length of each sentence: which is the padded sequence length (40 in our case)

After the embedding layer we added an LSTM layer. LSTM (Long short term memory) is a special type of RNN that proved to have a very good performance. (We will learn more about the different RNN implementations in future tutorials)

Finally, we added our Sigmoid Dense Layer that will give the final classification results (0, 1)

We will compile the model with Adam optimizer, binary crossentropy (since we have 0 or 1 labels) and used accuracy as a metric to visualize the performance.

model.compile('adam', 'binary_crossentropy', metrics=['acc'])

Time for training:, y_train, epochs=5)


Epoch 1/5 3900/3900 [==============================] - 6s 2ms/step - loss: 0.2269 - acc: 0.9297 Epoch 2/5 3900/3900 [==============================] - 5s 1ms/step - loss: 0.0386 - acc: 0.9926 Epoch 3/5 3900/3900 [==============================] - 5s 1ms/step - loss: 0.0151 - acc: 0.9974 Epoch 4/5 3900/3900 [==============================] - 5s 1ms/step - loss: 0.0067 - acc: 0.9992 Epoch 5/5 3900/3900 [==============================] - 5s 1ms/step - loss: 0.0045 - acc: 0.9992 

In only 5 ephocs, we were able to reach a 99% accuracy! Is it too good to be true? Let us evaluate it on the testing data:

model.evaluate(X_train, y_train)


3900/3900 [==============================] - 1s 224us/step 
[0.0025394210962053293, 0.9997435897435898] 

Very good! A 99% accuracy on the testing data as well.


  1. Try draw a sketch of one input sentence and the architecture of the defined RNN.
  2. Try to write an SMS yourself and see how the model will classify it. This challenge requires you to build the whole pipeline first before being able to pass the SMS your model. Write your code in the comments section below.


Here you go. You now know how to train RNNs. These models are very powerful for sequential data. They can be used on text, time-series, videos, etc. Stay tuned for more information…

Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at please contact at [email protected]