Generating Your Shakespeare Text Using Sequential Models Such As Long-Short-Term-Memory (LSTMs), Gated Recurrent Units (GRUs), Recurrent Neural Network (RNNs)

In the previous article, we discussed Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs) and applied them to detect fake news. This article will explain what RNNs are and how to use them. First, let us know what the problem is with usual artificial networks. The problem with standard networks is that they cannot capture the sequential relationship between the words in a sentence, i.e., we know that the next word depends on the incomplete sentence already known. To capture this, we introduce a sequential Neural Net, which is RNN recurrent neural networks.

Sequential Models ( RNNs ) :

Let us understand how Recurrent Neural Network (RNN’s) work, considering there is a sentence given, and we have to predict the next word, so this is how RNN works. It takes the first word of the sentence, passes it through a neural net, and predicts the next word but to predict the third word, it takes the activation of the hidden state and the second word as input. This process continues. Here, the sequential relation is captured because we use the previous word’s hidden state to predict the next one, which means somehow an encoded version of the previous sentence is being used to predict the next word. This is the reason why RNN’s are so powerful.

Source image by

Sequential Models ( LSTMs and GRUs ) :

There are more effective structures which are Gated Recurrent Units (GRUs) and Long-Short-Term-Memory (LSTMs). The practical problem of why GRUs and LSTMs are used instead of RNN is as follows, in RNN, we use the information from every previous word to predict the next word right, but sometimes a part of a sentence is enough to predict the next word in LSTMs and GRUs. We use this idea and design the network in such a way that we allow the model which words to select. This is the intuition behind sequential models. Now let us apply them to create some exciting content. We are going to generate a model which writes like Shakespeare sounds excellent, right? Let us get into it.

Source image by

Before creating a model, we should preprocess the data so that it fits the model perfectly. Let’s do that first. So here is the data. We teach the model to predict the next word starting by considering the previous 12 words. Hence the length of the input sequence must be 12, and that of output should be 1, right?. In this article, we learn about two different implementations those are 

1.character level modelling 

2.Using a word embedding

Character level modelling :

Here the input at each time step will be an encoded version of the letter. In this model, we use the previous 12 characters to predict the next one. Since 12 is a very small number, let’s try with 100.The data here was from Shakespeare’s writing. You can try it out using some other text.

Preprocessing :

In preprocessing, we should create the data set in such a way that input is 100 one hot encoded version of characters and output should be one hot encoded version of the predicted character. The implementation is as follows.

from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io
import tensorflow as tf
import matplotlib.pyplot as plt
import platform
import time
import pathlib
import os

cache_dir = './tmp'
dataset_file_name = 'shakespeare.txt'
dataset_file_origin = ''

dataset_file_path = tf.keras.utils.get_file(

ss = open(dataset_file_path,mode='r')
text =
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

def build_data(text, Tx = 100, stride = 1):  
    X = []
    Y = []
    for i in range(0, len(text) - Tx, stride):
        X.append(text[i: i + Tx])
        Y.append(text[i + Tx]) 
    print('number of training examples:', len(X))
    return X, Y

X,Y = build_data(text[:10000])

def vectorization(X, Y, n_x, char_indices, Tx = 100):
    m = len(X)
    x = np.zeros((m, Tx, n_x), dtype=np.bool)
    y = np.zeros((m, n_x), dtype=np.bool)
    for i, sentence in enumerate(X):
        for t, char in enumerate(sentence):
            x[i, t, char_indices[char]] = 1
        y[i, char_indices[Y[i]]] = 1    
    return x, y 

x,y = vectorization(X,Y,len(chars),char_indices,Tx=100)

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    out = np.random.choice(range(len(chars)), p = probas.ravel())
    return out

Model Development :

Here I used LSTM but you can try changing it to GRU and RNN.

model = Sequential()
model.add(LSTM(256, input_shape=(100, len(chars)),return_sequences=True))
model.add(Dense(len(chars), activation='softmax'))
maxlen = 100

def on_epoch_end(epoch, _):
  if(epoch>0 and epoch%150 == 0):
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.5]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')

        for i in range(500):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char


optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print_callback = LambdaCallback(on_epoch_end=on_epoch_end), y,

Results :

The generated text which I got after training the model

We can observe that the results are pretty good. My suggestion is you must face hyper parameter tuning, which helps you a lot in the future, so by using this code, you may not get the optimal solution. To get it, you must do some hyper parameter tuning and many more things. I suggest you play with this model by changing model architecture and hyper parameters such as length of sequence epochs, batch size, optimizers, etc. The difference between word level and character level models will be that there will be extra embedding layer in word level model. Try it out.

Shivesh Kodali is a content writing consultant at MarktechPost. He is currently pursuing his B.Tech in Electronics and Communication Engineering from Indian Institute of Technology(IIT), Kharagpur. He is a Deep Learning fanatic who loves understanding and implementing its complex algorithms.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...