Classifying the Name Nationality of a Person using LSTM and Pytorch

The personal name tends to have different variations from country to country or even within a country. Typically the name of a person can be broken into two halves. The first name is the name given at birth and the last name (surname) presents the name of the family to which the child is born. But a large majority of the people from Tamilnadu will not have a surname. 

In Chinese name Mao Ze Dong the family name is Mao, ie. the first name when reading (left to right). The given name is Dong. The middle character, Ze, is a generational name. Because of these inconsistencies or rather lack naming standards, it is a complete mess. Even the smartest programs today are not trained to handle these inconsistent naming standards. 

In this tutorial, we will build a Recurrent Neural Network Model which classifies the nationalities of each name from the character level embeddings.

Recurrent Neural Network

In Feed-forward Neural Networks (FNN) the output of one data point is completely independent of the previous input i.e… the health risk of the second person is not dependent on the health risk of the first person and so on. Similarly, in the case of Convolution Neural Networks (CNN), the output from the softmax layer in the context of image classification is entirely independent of the previous input image.

Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous step is fed as input to the current step. Read more about RNN here.

Run this notebook in Colab

All the code discussed in the article is present on my GitHub. You can open the code notebook with any setup by directly opening my Jupyter Notebook on Github with Colab which runs on Google’s Virtual Machine. It’s recommended that you click here to quickly open the notebook and follow along with this tutorial. To learn more about how to execute Pytorch tensors in Colab read my blog post.

Import Libraries

Before we start building our network, first we need to import the required libraries.

#import packages
from io import open
import os, string, random, time, math
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
import torch 
import torch.nn as nn
import torch.optim as optim

#clearing output
from IPython.display import clear_output 


Dataset is a text file contains the name of the person and nationality of the name separated by a comma.
Here is a look at the data:

Since the input, the model which is the name of the person is of varying size we have to use a sequence model instead of Feed Forward Neural Network. To load the dataset, we iterate through each row in the data and create a list of tuples containing name and nationality so that we can easily feed it into our sequential model.

languages = []
data = []
X = []
y = []

with open("name2lang.txt", 'r') as f:
    #read the dataset
    for line in f:
        line = line.split(",")
        name = line[0].strip()
        lang = line[1].strip()
        if not lang in languages:
        data.append((name, lang))

n_languages = len(languages)

The dataset contains more than 20k names and 18 unique nationalities like Portuguese, Irish, Spanish, etc…

Split Data

Since the data is quite large, we will split the data into training and testing in the ratio of 70 – 30. In this classification problem, we will use a stratified sampling technique since it’s an imbalanced dataset.

#split the data 70 30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123, stratify = y)
print("Training Data: ", len(X_train))
print("Testing Data: ", len(X_test))

#Training Data:  14035
#Testing Data:  6015

Encoding Names and Nationalities

The sequence model we will make takes the encodings of the character as an input rather than the raw text data. So we have to encode the input and label at the character level. Once we create encodings at the character level, we need to concatenate all the character level encodings to get the encodings for the whole word. We do this operation for all the names and nationalities.

#get all the letters
all_letters = string.ascii_letters + ".,;"
n_letters = len(all_letters)

print("Number of letters: ", n_letters)

Encoding Names:

To encode names first, we will get all the ASCII characters into a list. Now we have a list of all possible characters that can appear in the names of a person. We iterate through each character present in the name and find the index of that character in our list of ASCII characters. Using that index number we will create a one-hot vector for that character and repeat this process all the characters to get final encoding.

def name_rep(name):
    rep = torch.zeros(len(name), 1, n_letters)
    for index, letter in enumerate(name):
        pos = all_letters.find(letter)
        rep[index][0][pos] = 1
    return rep

#sample encoding

The above function name_rep create an one-hot encoding for the names. First, we declare a tensor of zeroes with an input size equal to the length of the name and outsize equal to the total number of characters in our list. After that, we iterate through each character to find the index of a letter and set that index position value equal to 1, leaving the remaining values to be equal to 0.

Sample encoding would look like this.

Encoding Nationalities

The logic for encoding nationalities is much simpler than encoding names. For encoding nationality, we just find the index of the occurrence of that particular nationality in our list of nationalities. Then assign that index as an encoding.

#function to create lang representation

def nat_rep(lang):
    return torch.tensor([languages.index(lang)], dtype = torch.long)

Recurrent Neural Network Model

In section, we will discuss how to build an RNN model using Pytorch nn.Module. We will write a class RNN_net for our model which will subclass nn.Module.

#define a basic rnn network

class RNN_net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN_net, self).__init__()
        #declare the hidden size for the network
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size) #input to hidden layer
        self.i2o = nn.Linear(input_size + hidden_size, output_size) #input to output layer
        self.softmax = nn.LogSoftmax(dim = 1) #softmax for classification 
    def forward(self, input_, hidden):
        combined =, hidden), 1) #concatenate tensors on column wise
        hidden = self.i2h(combined) #generate hidden representation
        output = self.i2o(combined) #generate output representation
        output = self.softmax(output) #get the softmax label
        return output, hidden
    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)

torch.nn.Linear(in_features, out_featuers) takes two mandatory parameters. 

  • in_features — The size of each input sample
  • out_features — The size of each output sample

The __init__ function (constructor function) helps us to initialize the parameters of the network like weights and biases associated with the hidden layers. The __init__ function takes input size (size of the representation of one character), hidden layer size, and output size (which is equal to the number of languages we have). The nn.Linear() function automatically defines weights and biases for each hidden layer instead of manually defining them.

Let’s see what’s going on inside __init__ function,

The i2h layer computes the hidden representation at the current time by taking the combination of current time step input and hidden representation of the previous layer. The i2o layer computes the output at the current time step by taking the combination of current time step input and hidden representation of the previous layer.

Now our forward function takes the encoded representation of a character and it’s hidden representation as the input. The forward function first concatenates the input and hidden representation of a character and uses that as an input to compute the output label using i2h, i2o and softmax layer.

Inference on Recurrent Neural Network Model

Before we start training our first, we will use the model to make inferences on the data. So that we can be sure that our network architecture is working as we expected.

#function to make inference
def infer(net, name):
    name_ohe = name_rep(name)
    hidden = net.init_hidden()
    for i in range(name_ohe.size()[0]):
        output, hidden = net(name_ohe[i], hidden)
    return output

#declare the size of the hidden layer representation
n_hidden = 128

#create a object of the class
net = RNN_net(n_letters, n_hidden, n_languages)

#before training the network, make a inference to test the network
output = infer(net, "Adam")
index = torch.argmax(output)
print(output, index)

The infer function takes the network instance and person name as the input parameters. In this function:

– We are setting the network to evaluation mode.
– Computing the One-Hot representation of the input person name.
– Creating the hidden representation based on the hidden size.
– Iterate through all the characters and feeds the computed hidden representation back to the network.
– Finally computes the output nationality for that person name.

Training Recurrent Neural Network

In this section, we will create a generic training setup that can be used for other networks like LSTM, GRU. To train our network, we need to define the loss function and optimization algorithm. In this case, we will use NLLLoss to calculate the loss of the network and make use of the SGD optimizer to find the global minima.

Before we start training our network, let’s define a custom function to calculate the accuracy of our network.

 #create a function to evaluate model

def eval(net, n_points, k, X_, y_):
     data_ = dataloader(n_points, X_, y_)
     correct = 0

     for name, language, name_ohe, lang_rep in data_:
         output = infer(net, name) #prediction
         val, indices = output.topk(k) #get the top k predictions
         if lang_rep in indices:
             correct += 1
     accuracy = correct/n_points
     return accuracy 

The evaluation function takes network instance, the number of data points, k, test x, and test y as the input parameters. In this function,

  • We load the data using the data loader.
  • Iterating through all person names present in the data loader.
  • Invoking our model on the inputs and getting the outputs.
  • Computing the predicted class.
  • Calculating the total number of correctly predicted classes and returning the final percentage.

We will write a simple train_setup function to train our network.

def train_setup(net, lr = 0.01, n_batches = 100, batch_size = 10, momentum = 0.9, display_freq = 5):

    criterion = nn.NLLLoss() #define a loss function
    opt = optim.SGD(net.parameters(), lr = lr, momentum = momentum) #define a optimizer
    loss_arr = np.zeros(n_batches + 1)
    #iterate through all the batches
    for i in range(n_batches):
        loss_arr[i + 1] = (loss_arr[i]*i + train(net, opt, criterion, batch_size))/(i + 1)

        if i%display_freq == display_freq - 1:
            clear_output(wait = True)
            print("Iteration number ", i + 1, "Top - 1 Accuracy:", round(eval(net, len(X_test), 1, X_test, y_test),4), Top-2 Accuracy:', round(eval(net, len(X_test), 2, X_test, y_test),4), 'Loss:', round(loss_arr[i]),4)
            plt.plot(loss_arr[1:i], "-*")

#declare all the parameters
n_hidden = 128
net = RNN_net(n_letters, n_hidden, n_languages)
train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)

In our training loop,

  • For each epoch, we iterate through the data loader.
  • Get the input data and labels.
  • Reset any previous gradient present in the optimizer, before computing the gradient for the next batch.
  • Execute the forward pass and get the output.
  • Compute the loss based on the predicted output and actual output.
  • Backpropagate the gradients.
  • At the end of each epoch, we are printing the progress messages.

Hyperparameters used in the training process are as follows:

  • Learning rate: 0.0005
  • Loss function: Negative Log-Likelihood Loss
  • Optimizer: Stochastic Gradient Descent with Momentum
  • Number of batches = 100
  • Batch size = 256

Visualization of Loss Plot

We can plot the loss of the network against each iteration to check the model performance.

Loss Plot for RNN Model

After training the model for 100 batches, we are able to achieve a top-1 accuracy of 68% and a top-2 accuracy of 79% with the RNN Model.

Long Short Term Memory – LSTM Model

In this section, we will discuss how to implement the LSTM Model for classifying the name nationality of a person’s name. We will make use of Pytorch nn.Module and nn.LSTM subclass to create a custom called LSTM_net.

#LSTM class
class LSTM_net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM_net, self).__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = nn.LSTM(input_size, hidden_size) #LSTM cell
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 2)

    def forward(self, input_, hidden):
        out, hidden = self.lstm_cell(input_.view(1, 1, -1), hidden)
        output = self.h2o(hidden[0])
        output = self.softmax(output)
        return output.view(1, -1), hidden

    def init_hidden(self):
        return (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))

From the implementation standpoint, the only change in the __init__ function is that we are using the nn.LSTM function. The nn.LSTM function will handle all the necessary computations including the computation of the hidden state itself.

init_hidden initializes two tensors of zero values. One tensor represents the hidden state and another tensor represents the hidden cell state. The forward function takes an encoded character and it’s hidden representation as the parameters to the function similar to RNN. Pytorch LSTM takes expects all of its inputs to be 3D tensors that’s why we are reshaping the input using view function.

To train the LSTM network, we will our training setup function.

#create hyperparameters
n_hidden = 128
net = LSTM_net(n_letters, n_hidden, n_languages)
train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)

The loss plot for the LSTM network would look like this,

LSTM Loss Plot
LSTM Loss Plot

There you have it, we have successfully built our nationality classification model using Pytorch. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.

Where to go from here?

This image has an empty alt attribute; its file name is 0*JKvDcmuishcKJ_WT

In this article, we have discussed the RNN Model and LSTM Model but if you want to improve the performance of the network you can try out:

  • Implementing Gated Recurrent Unit Model (Bonus: I have already implemented GRU in my Git repo).
  • Play with hyper-parameters of LSTM and GRU Model
  • Increasing the performance by moving the training to GPU.

Recommended Reading

If you are a beginner in using Pytorch framework, these are the best resources for you Pytorch


In this post, we discussed the need to classify the nationality of a person based on the name. Then we have seen how to load our custom dataset in the format of training our model. After that, we have discussed how to encode the names and nationalities before training the model. Finally, we have seen the implementations of the RNN and LSTM Model used for training the data. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message on LinkedIn citing this article.

Connect with Me

Note: This is a guest post, and the opinion in this article is of the guest writer. If you have any issues with any of the articles posted at please contact at

Niranjan Kumar is working as a Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. He writes about the latest tools and technologies in the field of Deep Learning. He is one of the top writers in Artificial Intelligence at Medium. A Graduate of Praxis Business School, Niranjan Kumar holds a degree in Data Science. Feel free to contact him via LinkedIn for collaboration on projects