In this tutorial, we will discuss how to implement the batching in sequence2sequene models using Pytorch. We will implement batching by building a Recurrent Neural Network to classify the nationality of a name based on character level embeddings. This is a follow-up blog post to my previous post on Classifying the Name Nationality of a Person using LSTM and Pytorch.
Batching is a process of passing (or training) several training instances simultaneously either forward or backward in the network.
Import Libraries
Before we start building network, we need to import libraries
#load the packages from io import open import os, string, random, time, math import matplotlib.pyplot as plt import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split import torch import torch.nn as nn import torch.optim as optim
Data Set
Data set is a text file contains the name of the person and nationality of the name separated by a comma. Here is a look at the data

In my previous post, we have already discussed how to implement the basic Sequence to Sequence model without batching to classify the name nationality of a person. In this post, we will directly implement batching for representing the names and nationalities of a person and then use that representation to train the model.
Batching in Pytorch
Batching is characterized into two topics
1. Vectorisation – Vectorisation is the task of performing an operation in batches parallelly, instead of doing it sequentially. This is what is known as data parallelism mostly using GPUs. Vectorisation is heavily used not only in GPUs but also in CPUs to bring in performance improvement because we are running multiple tasks parallelly.
2. Aggregated Gradient Comparison – Generally we compute the gradients of one image using CNN then update the parameters and then take another image to compute gradients update the parameters, the process continues till we complete all the images. The problem with this approach is that when we compute gradients image by image there will be a lot of variability in parameter updates.
Instead of computing gradients image by image, we can take a batch of images then compute the gradients using our neural network so that it reduces the variability in parameter updates.
In sequence to sequence models batching means simultaneously encoding the inputs and processing them using our neural network either RNN, LSTM or GRU. Without batching, what we would do is it we will be processing the input one by one i.e… we encode one character after another character of the input then train the network using those encoding. In this tutorial, we will discuss how to process a batch of names for training the network.
In batching, we take multiple input names and process the characters present in these inputs simultaneously by merging them across the character level. This way we are vectorizing the inputs but not across the characters of the same name.
The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.
Encoding Names Batching
To implement the batching, we need to encode the sequence of names such that we would be able to process them simultaneously instead of sequentially. In effect, we will create an encoding such that, we would get only one vector representation for all the input names in the batch.
#create a batched name rep def batched_name_rep(names, max_word_size): rep = torch.zeros(max_word_size, len(names), n_letters) for name_index, name in enumerate(names): for letter_index, letter in enumerate(name): pos = all_letters.find(letter) rep[letter_index][name_index][pos] = 1 return rep #function to print the output def print_char(name_reps): name_reps = name_reps.view((-1, name_reps.size()[-1])) for t in name_reps: if torch.sum(t) == 0: print('') else: index = t.argmax() print(all_letters[index])
The above function batched_name_rep
takes a list of names and then creates the one-hot vector representation of the names. First, we declare a tensor of zeros as input with a size equal to the maximum length of input names. We then iterate through each character and create a one-hot vector representation of all the names.
The print_char
function is a helper function, to help us visualize how batching works when we are encoding the multiple names simultaneously.
Sample encoding looks like this

Encoding Nationalities
The logic for encoding nationalities is much simpler than encoding names. For encoding nationality, we just find the index of the occurrence of that particular nationality in our list of nationalities. Then assign that index as an encoding.
def batched_lang_rep(langs): rep = torch.zeros([len(langs)], dtype=torch.long) for index, lang in enumerate(langs): rep[index] = languages.index(lang) return rep
Recurrent Neural Network Model
In my previous article we have discussed how to implement RNN to Pytorch nn.Module
. We will be using the same RNN network to train the batched inputs.
#create simple rnn network class RNN_net(nn.Module): #Create a constructor def __init__(self, input_size, hidden_size, output_size): super(RNN_net, self).__init__() self.hidden_size = hidden_size self.rnn_cell = nn.RNN(input_size, hidden_size) self.h20 = nn.Linear(hidden_size, output_size) self.softmax = nn.LogSoftmax(dim = 1) #create a forward pass function def forward(self, input_, hidden = None, batch_size = 1): out, hidden = self.rnn_cell(input_, hidden) output = self.h20(hidden.view(-1, self.hidden_size)) output = self.softmax(output) return output, hidden def init_hidden(self, batch_size = 1): #function to init the hidden layers return torch.zeros(1, batch_size, self.hidden_size)
The __init__
function (constructor function) helps us to initialize the parameters of the network like weights and biases associated with the hidden layers. The __init__
function takes input size (size of the representation of one character), hidden layer size, and output size (which is equal to the number of languages we have).
The nn.Linear()
function automatically defines weights and biases for each hidden layer instead of manually defining them. Let’s see what’s going on inside __init__
function,
The i2h
layer computes the hidden representation at the current time by taking the combination of current time step input and hidden representation of the previous layer. The i2o
layer computes the output at the current time step by taking the combination of current time step input and hidden representation of the previous layer.
Now our forward
function takes the encoded representation of a character and it’s hidden representation as the input. The forward function first concatenates the input and hidden representation of a character and uses that as an input to compute the output label using i2h
, i2o
and softmax
layer.
Batched Dataloader
In this section, we will implement the batched dataloader where we pass the set of names and get the padded tensor representations as the output required for training.
#create dataloader def batched_dataloader(npoints, X_, y_, verbose=False, device = 'cpu'): names = [] langs = [] X_lengths = [] for i in range(npoints): index_ = np.random.randint(len(X_)) name, lang = X_[index_], y_[index_] X_lengths.append(len(name)) names.append(name) langs.append(lang) max_length = max(X_lengths) names_rep = batched_name_rep(names, max_length).to(device) langs_rep = batched_lang_rep(langs).to(device) padded_names_rep = torch.nn.utils.rnn.pack_padded_sequence(names_rep, X_lengths, enforce_sorted = False) if verbose: print(names_rep.shape, padded_names_rep.data.shape) print('--') if verbose: print(names) print_char(names_rep) print('--') if verbose: print_char(padded_names_rep.data) print('Lang Rep', langs_rep.data) print('Batch sizes', padded_names_rep.batch_sizes) return padded_names_rep.to(device), langs_rep
The batched_dataloader
takes three mandatory parameters as an input,
- npoints – Number of inputs
- X_ – X train data
- y_ – y train data
In this function, we get the randomly sampled data based on the npoints
specified by the user and then we padded representation for these data points using batched_name_rep
and batched_lang_rep
. Since all character sequences must have the same length as defined by the corresponding input layer, padding will be applied where needed.
The way we apply padding is that,
- Find the maximum input length across all the sequences (say, 10)
- Add special word <pad> to all shorter sequences so that they become of the same length (10, in this case).
Important points to note about padding is that:
- Padding was only done to ensure that the input sequences are of uniform size.
- The computations in the RNN are only performed until the character i.e…padding is not considered as an input for the network.
Once we got our padded representation, we packs a Tensor containing padded sequences of variable length using Pytorch pack_padded_sequence
function. The pack_padded_sequence
takes two mandatory inputs,
- names_rep – Padded representation of the names.
- X_lengths – List of sequences lengths of each batch element.
For unsorted sequences, use enforce_sorted = false
. If enforce_sorted
is true
, the sequences should be sorted by length in a decreasing order.
Training Recurrent Neural Network
n this section, we will create a generic training setup that can be used for other networks like LSTM, GRU. To train our network, we need to define the loss function and optimization algorithm. In this case, we will use NLLLoss
to calculate the loss of the network and make use of the SGD
optimizer to find the global minima.
We will also compare the time taken between the normal training loop and batched training loop
Before we start training our network, let’s define a custom function to calculate the accuracy of our network.
#create an evaluation function def eval(net, n_points, topk, X_, y_, device = "cpu"): "Evaluation function" net = net.eval().to(device) data_ = dataloader(n_points, X_, y_) correct = 0 #iterate for name, language, name_ohe, lang_rep in data_: #get the output output = infer(net, name, device) val, indices = output.topk(topk) #get the top k values indices = indices.to(device) #convert to devices if lang_rep in indices: correct += 1 accuracy = correct/n_points return accuracy
The evaluation
function takes network instance, the number of data points, k, test x, and test y as the input parameters. In this function,
- We load the data using the data loader.
- Iterating through all person names present in the data loader.
- Invoking our model on the inputs and getting the outputs.
- Computing the predicted class.
- Calculating the total number of correctly predicted classes and returning the final percentage.
def train_setup(net, lr = 0.01, n_batches = 100, batch_size = 10, momentum = 0.9, display_freq=5, device = 'cpu'): net = net.to(device) criterion = nn.NLLLoss() opt = optim.SGD(net.parameters(), lr=lr, momentum=momentum) loss_arr = np.zeros(n_batches + 1) for i in range(n_batches): loss_arr[i+1] = (loss_arr[i]*i + train_batch(net, opt, criterion, batch_size, device))/(i + 1) if i%display_freq == display_freq-1: clear_output(wait=True) print('Iteration', i, 'Loss', loss_arr[i]) # print('Top-1:', eval(net, len(X_test), 1, X_test, y_test), 'Top-2:', eval(net, len(X_test), 2, X_test, y_test)) plt.figure() plt.plot(loss_arr[1:i], '-*') plt.xlabel('Iteration') plt.ylabel('Loss') plt.show() print('\n\n') print('Top-1 Accuracy:', eval(net, len(X_test), 1, X_test, y_test, device), 'Top-2 Accuracy:', eval(net, len(X_test), 2, X_test, y_test, device))
In our training loop,
- For each epoch, we iterate through the batch data loader.
- Get the padded and packed representation of inputs.
- Reset any previous gradient present in the optimizer, before computing the gradient for the next batch.
- Execute the forward pass and get the output.
- Compute the loss based on the predicted output and actual output.
- Backpropagate the gradients.
- At the end of each epoch, we are printing the progress messages.
Hyperparameters used in the training process are as follows:
- Learning rate: 0.15
- Loss function: Negative Log-Likelihood Loss
- Optimizer: Stochastic Gradient Descent with Momentum
- Number of batches = 5000
- Batch size = 512
- Number of hidden layers = 128
Visualization of Loss Plot
We can plot the loss of the network against each iteration to check the model performance.

After training the model for 5000 batches, we are able to achieve a top-1 accuracy of 73% and a top-2 accuracy of 85% with the RNN Model.
Comparison of Normal Training and Batched Training
To compare the training speeds of normal mode of training and batched training, we need to define two training setups.
Basic training setup
#basic train function def train(net, opt, criterion, n_points): opt.zero_grad() total_loss = 0 data_ = dataloader(n_points, X_train, y_train) total_loss = 0 for name, language, name_ohe, lang_rep in data_: hidden = net.init_hidden() for i in range(name_ohe.size()[0]): output, hidden = net(name_ohe[i:i+1], hidden) loss = criterion(output, lang_rep) loss.backward(retain_graph=True) total_loss += loss opt.step() return total_loss/n_points
Batched training setup
def train_batch(net, opt, criterion, n_points, device = 'cpu'): net.train().to(device) opt.zero_grad() batch_input, batch_groundtruth = batched_dataloader(n_points, X_train, y_train, False, device) output, hidden = net(batch_input) loss = criterion(output, batch_groundtruth) loss.backward() opt.step() return loss
The only difference between the two training setups is present in how we the inputs from the data loader. In normal training, the data loader encodes the inputs one after the other (sequentially). In batched data loader we encode multiple inputs parallelly so that we get an increase in performance due to vectorization.
Define the same training parameters.
- Learning rate: 0.01
- Momentum = 0.9
- Loss function: Negative Log-Likelihood Loss
- Optimizer: Stochastic Gradient Descent with Momentum
- Number of hidden layers.
- Number of data points in one batch: 256
We will python magic function %time to time the execution of a training setup.

As you see from the above comparison, training using a batched setup is around 42 times faster than normal training setup. By implementing batching we are utilizing data parallelism to improve the performance of the network.
Long Short Term Memory – LSTM Model with Batching
In this section, we will discuss how to implement and train the LSTM Model with batching for classifying the name nationality of a person’s name. We will make use of Pytorch nn.Module
and nn.LSTM
subclass to create a custom called LSTM_net
class LSTM_net(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(LSTM_net, self).__init__() self.hidden_size = hidden_size self.lstm_cell = nn.LSTM(input_size, hidden_size) self.h2o = nn.Linear(hidden_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def forward(self, input, hidden = None): out, hidden = self.lstm_cell(input, hidden) output = self.h2o(hidden[0].view(-1, self.hidden_size)) output = self.softmax(output) return output, hidden def init_hidden(self, batch_size = 1): return (torch.zeros(1, batch_size, self.hidden_size), torch.zeros(1, batch_size, self.hidden_size))
The LSTM network will be same as we used in the previous article, only difference is that how we pass input representation to the network. From the implementation standpoint, the only change in the __init__
function is that we are using the nn.LSTM
function. The nn.LSTM
function will handle all the necessary computations including the computation of the hidden state itself.
init_hidden
initializes two tensors of zero values. One tensor represents the hidden state and another tensor represents the hidden cell state. The forward
function takes an encoded character and it’s hidden representation as the parameters to the function similar to RNN. Pytorch LSTM takes expects all of its inputs to be 3D tensors that’s why we are reshaping the input using view function.
Training setup for LSTM
n_hidden = 128 net = LSTM_net(n_letters, n_hidden, n_languages) train_setup(net, lr=0.15, n_batches=8000, batch_size = 512, display_freq=1000, device = device_gpu)
The loss plot for the LSTM network would look like this,

After training the model for 8000 batches, we are able to achieve a top-1 accuracy of 79% and a top-2 accuracy of 89% with the LSTM Model.
There you have it, we have successfully built our nationality classification model using Pytorch with Batching. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.
Recommended Reading
- Getting Started With Pytorch In Google Collab With Free GPU
- Classifying the Name Nationality of a Person using LSTM and Pytorch
Conclusion
In this post, we discussed the need to implement batching in Pytorch and the advantages of batching. After that, we have discussed how to encode the names and nationalities before training the model. Finally, we have seen the implementations of the RNN and LSTM Model used for training the data. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message on LinkedIn citing this article.
Connect with Me
- LinkedIn – https://www.linkedin.com/in/niranjankumar-c/
- GitHub – https://github.com/Niranjankumar-c
- Twitter – https://twitter.com/Nkumar_n
- Medium – https://medium.com/@niranjankumarc
Note: This is a guest post, and the opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.com
Niranjan Kumar is working as a Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. He writes about the latest tools and technologies in the field of Deep Learning. He is one of the top writers in Artificial Intelligence at Medium. A Graduate of Praxis Business School, Niranjan Kumar holds a degree in Data Science. Feel free to contact him via LinkedIn for collaboration on projects