Implementing Batching for Seq2Seq Models in Pytorch

In this tutorial, we will discuss how to implement the batching in sequence2sequene models using Pytorch. We will implement batching by building a Recurrent Neural Network to classify the nationality of a name based on character level embeddings. This is a follow-up blog post to my previous post on Classifying the Name Nationality of a Person using LSTM and Pytorch.

Batching is a process of passing (or training) several training instances simultaneously either forward or backward in the network. 

Import Libraries

Before we start building network, we need to import libraries

#load the packages

from io import open
import os, string, random, time, math
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim

Data Set

Data set is a text file contains the name of the person and nationality of the name separated by a comma. Here is a look at the data

In my previous post, we have already discussed how to implement the basic Sequence to Sequence model without batching to classify the name nationality of a person. In this post, we will directly implement batching for representing the names and nationalities of a person and then use that representation to train the model.

Batching in Pytorch

Batching is characterized into two topics

1. Vectorisation – Vectorisation is the task of performing an operation in batches parallelly, instead of doing it sequentially. This is what is known as data parallelism mostly using GPUs. Vectorisation is heavily used not only in GPUs but also in CPUs to bring in performance improvement because we are running multiple tasks parallelly.

2. Aggregated Gradient Comparison – Generally we compute the gradients of one image using CNN then update the parameters and then take another image to compute gradients update the parameters, the process continues till we complete all the images. The problem with this approach is that when we compute gradients image by image there will be a lot of variability in parameter updates.

Instead of computing gradients image by image, we can take a batch of images then compute the gradients using our neural network so that it reduces the variability in parameter updates.

In sequence to sequence models batching means simultaneously encoding the inputs and processing them using our neural network either RNN, LSTM or GRU. Without batching, what we would do is it we will be processing the input one by one i.e… we encode one character after another character of the input then train the network using those encoding. In this tutorial, we will discuss how to process a batch of names for training the network.

In batching, we take multiple input names and process the characters present in these inputs simultaneously by merging them across the character level. This way we are vectorizing the inputs but not across the characters of the same name.

The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.

Encoding Names Batching

To implement the batching, we need to encode the sequence of names such that we would be able to process them simultaneously instead of sequentially. In effect, we will create an encoding such that, we would get only one vector representation for all the input names in the batch.

#create a batched name rep

def batched_name_rep(names, max_word_size):
    rep = torch.zeros(max_word_size, len(names), n_letters)
    for name_index, name in enumerate(names):
        for letter_index, letter in enumerate(name):
            pos = all_letters.find(letter)
            rep[letter_index][name_index][pos] = 1
    return rep

#function to print the output
def print_char(name_reps):
    name_reps = name_reps.view((-1, name_reps.size()[-1]))
    for t in name_reps: 
        if torch.sum(t) == 0:
            index = t.argmax()

The above function batched_name_rep takes a list of names and then creates the one-hot vector representation of the names. First, we declare a tensor of zeros as input with a size equal to the maximum length of input names. We then iterate through each character and create a one-hot vector representation of all the names.

The print_char function is a helper function, to help us visualize how batching works when we are encoding the multiple names simultaneously.

Sample encoding looks like this

Encoding Nationalities

The logic for encoding nationalities is much simpler than encoding names. For encoding nationality, we just find the index of the occurrence of that particular nationality in our list of nationalities. Then assign that index as an encoding.

def batched_lang_rep(langs):
    rep = torch.zeros([len(langs)], dtype=torch.long)
    for index, lang in enumerate(langs):
        rep[index] = languages.index(lang)
    return rep

Recurrent Neural Network Model

In my previous article we have discussed how to implement RNN to Pytorch nn.Module. We will be using the same RNN network to train the batched inputs.

#create simple rnn network 
class RNN_net(nn.Module):
    #Create a constructor
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN_net, self).__init__()
        self.hidden_size = hidden_size 
        self.rnn_cell = nn.RNN(input_size, hidden_size)
        self.h20 = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 1)

    #create a forward pass function
    def forward(self, input_, hidden = None, batch_size = 1):
        out, hidden = self.rnn_cell(input_, hidden)
        output = self.h20(hidden.view(-1, self.hidden_size))
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self, batch_size = 1):
        #function to init the hidden layers
        return torch.zeros(1, batch_size, self.hidden_size)

The __init__ function (constructor function) helps us to initialize the parameters of the network like weights and biases associated with the hidden layers. The __init__ function takes input size (size of the representation of one character), hidden layer size, and output size (which is equal to the number of languages we have).

The nn.Linear() function automatically defines weights and biases for each hidden layer instead of manually defining them. Let’s see what’s going on inside __init__ function,

The i2h layer computes the hidden representation at the current time by taking the combination of current time step input and hidden representation of the previous layer. The i2o layer computes the output at the current time step by taking the combination of current time step input and hidden representation of the previous layer.

Now our forward function takes the encoded representation of a character and it’s hidden representation as the input. The forward function first concatenates the input and hidden representation of a character and uses that as an input to compute the output label using i2h, i2o and softmax layer.

Batched Dataloader

In this section, we will implement the batched dataloader where we pass the set of names and get the padded tensor representations as the output required for training.

#create dataloader
def batched_dataloader(npoints, X_, y_, verbose=False, device = 'cpu'):
    names = []
    langs = []
    X_lengths = []
    for i in range(npoints):
        index_ = np.random.randint(len(X_))
        name, lang = X_[index_], y_[index_]
    max_length = max(X_lengths)
    names_rep = batched_name_rep(names, max_length).to(device)
    langs_rep = batched_lang_rep(langs).to(device)
    padded_names_rep = torch.nn.utils.rnn.pack_padded_sequence(names_rep, X_lengths, enforce_sorted = False)
    if verbose:
    if verbose:
    if verbose:
        print('Lang Rep',
        print('Batch sizes', padded_names_rep.batch_sizes)
    return, langs_rep

The batched_dataloader takes three mandatory parameters as an input,

  • npoints – Number of inputs
  • X_ – X train data
  • y_ – y train data

In this function, we get the randomly sampled data based on the npoints specified by the user and then we padded representation for these data points using batched_name_rep and batched_lang_rep. Since all character sequences must have the same length as defined by the corresponding input layer, padding will be applied where needed.

The way we apply padding is that,

  • Find the maximum input length across all the sequences (say, 10)
  • Add special word <pad> to all shorter sequences so that they become of the same length (10, in this case).

Important points to note about padding is that:

  • Padding was only done to ensure that the input sequences are of uniform size.
  • The computations in the RNN are only performed until the character i.e…padding is not considered as an input for the network.

Once we got our padded representation, we packs a Tensor containing padded sequences of variable length using Pytorch pack_padded_sequence function. The pack_padded_sequence takes two mandatory inputs,

  • names_rep – Padded representation of the names.
  • X_lengths – List of sequences lengths of each batch element.

For unsorted sequences, use enforce_sorted = false. If enforce_sorted is true, the sequences should be sorted by length in a decreasing order.

Training Recurrent Neural Network

n this section, we will create a generic training setup that can be used for other networks like LSTM, GRU. To train our network, we need to define the loss function and optimization algorithm. In this case, we will use NLLLoss to calculate the loss of the network and make use of the SGD optimizer to find the global minima.

We will also compare the time taken between the normal training loop and batched training loop

Before we start training our network, let’s define a custom function to calculate the accuracy of our network.

#create an evaluation function 

def eval(net, n_points, topk, X_, y_, device = "cpu"):
    "Evaluation function"

    net = net.eval().to(device)
    data_ = dataloader(n_points, X_, y_)
    correct = 0

    for name, language, name_ohe, lang_rep in data_:

        #get the output
        output = infer(net, name, device)
        val, indices = output.topk(topk) #get the top k values
        indices = #convert to devices
        if lang_rep in indices:
            correct += 1

    accuracy = correct/n_points
    return accuracy

The evaluation function takes network instance, the number of data points, k, test x, and test y as the input parameters. In this function,

  • We load the data using the data loader.
  • Iterating through all person names present in the data loader.
  • Invoking our model on the inputs and getting the outputs.
  • Computing the predicted class.
  • Calculating the total number of correctly predicted classes and returning the final percentage.
def train_setup(net, lr = 0.01, n_batches = 100, batch_size = 10, momentum = 0.9, display_freq=5, device = 'cpu'):
    net =
    criterion = nn.NLLLoss()
    opt = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
    loss_arr = np.zeros(n_batches + 1)
    for i in range(n_batches):
        loss_arr[i+1] = (loss_arr[i]*i + train_batch(net, opt, criterion, batch_size, device))/(i + 1)
        if i%display_freq == display_freq-1:
            print('Iteration', i, 'Loss', loss_arr[i])
            # print('Top-1:', eval(net, len(X_test), 1, X_test, y_test), 'Top-2:', eval(net, len(X_test), 2, X_test, y_test))
            plt.plot(loss_arr[1:i], '-*')
    print('Top-1 Accuracy:', eval(net, len(X_test), 1, X_test, y_test, device), 'Top-2 Accuracy:', eval(net, len(X_test), 2, X_test, y_test, device))

In our training loop,

  • For each epoch, we iterate through the batch data loader.
  • Get the padded and packed representation of inputs.
  • Reset any previous gradient present in the optimizer, before computing the gradient for the next batch.
  • Execute the forward pass and get the output.
  • Compute the loss based on the predicted output and actual output.
  • Backpropagate the gradients.
  • At the end of each epoch, we are printing the progress messages.

Hyperparameters used in the training process are as follows:

  • Learning rate: 0.15
  • Loss function: Negative Log-Likelihood Loss
  • Optimizer: Stochastic Gradient Descent with Momentum
  • Number of batches = 5000
  • Batch size = 512
  • Number of hidden layers = 128

Visualization of Loss Plot

We can plot the loss of the network against each iteration to check the model performance.

After training the model for 5000 batches, we are able to achieve a top-1 accuracy of 73% and a top-2 accuracy of 85% with the RNN Model.

Comparison of Normal Training and Batched Training

To compare the training speeds of normal mode of training and batched training, we need to define two training setups.

Basic training setup

#basic train function
def train(net, opt, criterion, n_points):
    total_loss = 0
    data_ = dataloader(n_points, X_train, y_train)
    total_loss = 0

    for name, language, name_ohe, lang_rep in data_:
        hidden = net.init_hidden()
        for i in range(name_ohe.size()[0]):
            output, hidden = net(name_ohe[i:i+1], hidden)
        loss = criterion(output, lang_rep)
        total_loss += loss
    return total_loss/n_points

Batched training setup

def train_batch(net, opt, criterion, n_points, device = 'cpu'):
    batch_input, batch_groundtruth = batched_dataloader(n_points, X_train, y_train, False, device)
    output, hidden = net(batch_input)
    loss = criterion(output, batch_groundtruth)

    return loss

The only difference between the two training setups is present in how we the inputs from the data loader. In normal training, the data loader encodes the inputs one after the other (sequentially). In batched data loader we encode multiple inputs parallelly so that we get an increase in performance due to vectorization.

Define the same training parameters.

  • Learning rate: 0.01
  • Momentum = 0.9
  • Loss function: Negative Log-Likelihood Loss
  • Optimizer: Stochastic Gradient Descent with Momentum
  • Number of hidden layers.
  • Number of data points in one batch: 256

We will python magic function %time to time the execution of a training setup.

As you see from the above comparison, training using a batched setup is around 42 times faster than normal training setup. By implementing batching we are utilizing data parallelism to improve the performance of the network.

Long Short Term Memory – LSTM Model with Batching

In this section, we will discuss how to implement and train the LSTM Model with batching for classifying the name nationality of a person’s name. We will make use of Pytorch nn.Module and nn.LSTM subclass to create a custom called LSTM_net

class LSTM_net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM_net, self).__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = nn.LSTM(input_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, input, hidden = None):
        out, hidden = self.lstm_cell(input, hidden)
        output = self.h2o(hidden[0].view(-1, self.hidden_size))
        output = self.softmax(output)
        return output, hidden
    def init_hidden(self, batch_size = 1):
        return (torch.zeros(1, batch_size, self.hidden_size), torch.zeros(1, batch_size, self.hidden_size))

The LSTM network will be same as we used in the previous article, only difference is that how we pass input representation to the network. From the implementation standpoint, the only change in the __init__ function is that we are using the nn.LSTM function. The nn.LSTM function will handle all the necessary computations including the computation of the hidden state itself.

init_hidden initializes two tensors of zero values. One tensor represents the hidden state and another tensor represents the hidden cell state. The forward function takes an encoded character and it’s hidden representation as the parameters to the function similar to RNN. Pytorch LSTM takes expects all of its inputs to be 3D tensors that’s why we are reshaping the input using view function.

Training setup for LSTM

n_hidden = 128
net = LSTM_net(n_letters, n_hidden, n_languages)
train_setup(net, lr=0.15, n_batches=8000, batch_size = 512, display_freq=1000, device = device_gpu)

The loss plot for the LSTM network would look like this,

After training the model for 8000 batches, we are able to achieve a top-1 accuracy of 79% and a top-2 accuracy of 89% with the LSTM Model.

There you have it, we have successfully built our nationality classification model using Pytorch with Batching. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.

Recommended Reading


In this post, we discussed the need to implement batching in Pytorch and the advantages of batching. After that, we have discussed how to encode the names and nationalities before training the model. Finally, we have seen the implementations of the RNN and LSTM Model used for training the data. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message on LinkedIn citing this article.

Connect with Me

Note: This is a guest post, and the opinion in this article is of the guest writer. If you have any issues with any of the articles posted at please contact at  

 | Website

Niranjan Kumar is working as a Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. He writes about the latest tools and technologies in the field of Deep Learning. He is one of the top writers in Artificial Intelligence at Medium. A Graduate of Praxis Business School, Niranjan Kumar holds a degree in Data Science. Feel free to contact him via LinkedIn for collaboration on projects

🚀 [FREE AI WEBINAR] 'Optimise Your Custom Embedding Space: How to find the right embedding model for YOUR data.' (July 18, 2024) [Promoted]