Building a Feedforward Neural Network using Pytorch NN Module


Feedforward neural networks are also known as Multi-layered Network of Neurons (MLN). These network of models are called feedforward because the information only travels forward in the neural network, through the input nodes then through the hidden layers (single or many layers) and finally through the output nodes.

Source: PadhAI

Traditional models such as McCulloch Pitts, Perceptron and Sigmoid neuron models capacity is limited to linear functions. To handle the complex non-linear decision boundary between input and the output we are using the Multi-layered Network of Neurons.



In this post, we will discuss how to build a feed-forward neural network using Pytorch. We will do this incrementally using Pytorch TORCH.NN module. The way we do that it is, first we will generate non-linearly separable data with two classes. Then we will build our simple feedforward neural network using PyTorch tensor functionality. After that, we will use abstraction features available in Pytorch TORCH.NN module such as Functional, Sequential, Linear and Optim to make our neural network concise, flexible and efficient. Finally, we will move our network to CUDA and see how fast it performs.

Note: This tutorial assumes you already have PyTorch installed in your local machine or know how to use Pytorch in Google Collab with CUDA support, and are familiar with the basics of tensor operations. If you are not familiar with these concepts kindly refer to my previous post linked below.

Rest of the article is structured as follows:

  • Import libraries
  • Generate non-linearly separable data
  • Feedforward network using tensors and auto-grad
  • Train our feedforward network
  • NN.Functional
  • NN.Parameter
  • NN.Linear and Optim
  • NN.Sequential
  • Moving the Network to GPU

If you want to skip the theory part and get into the code right away, Click here

Import libraries

Before we start building our network, first we need to import the required libraries. We are importing the numpy to evaluate the matrix multiplication and dot product between two vectors, matplotlib to visualize the data and from thesklearn package, we are importing functions to generate data and evaluate the network performance. Importing torch for all things related to Pytorch.

#required libraries
import numpy as np
import math
import matplotlib.pyplot as plt
import matplotlib.colors
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, log_loss
from tqdm import tqdm_notebook 

from IPython.display import HTML
import warnings
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import make_blobs

import torch

Generate non-linearly separable data

In this section, we will see how to randomly generate non-linearly separable data using sklearn.

#generate data using make_blobs function from sklearn.
#centers = 4 indicates different types of classes
data, labels = make_blobs(n_samples=1000, centers=4, n_features=2, random_state=0)
print(data.shape, labels.shape)

#visualize the data
plt.scatter(data[:,0], data[:,1], c=labels, cmap=my_cmap)

#splitting the data into train and test
X_train, X_val, Y_train, Y_val = train_test_split(data, labels, stratify=labels, random_state=0)
print(X_train.shape, X_val.shape, labels.shape)

To generate data randomly we will use make_blobs to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs centers=4 as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels.

Visualize data using matplotlib

Once we have our data ready, I have used the train_test_split function to split the data for training and validation in the ratio of 75:25.

Feedforward network using tensors and auto-grad

In this section, we will see how to build and train a simple neural network using Pytorch tensors and auto-grad. The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

We will use map function for the efficient conversion of numpy array to Pytorch tensors.

#converting the numpy array to torch tensors
X_train, Y_train, X_val, Y_val = map(torch.tensor, (X_train, Y_train, X_val, Y_val))
print(X_train.shape, Y_train.shape)

After converting the data to tensors, we need to write a function that helps us to compute the forward pass for the network.

#function for computing forward pass in the network
def model(x):
    A1 = torch.matmul(x, weights1) + bias1 # (N, 2) x (2, 2) -> (N, 2)
    H1 = A1.sigmoid() # (N, 2)
    A2 = torch.matmul(H1, weights2) + bias2 # (N, 2) x (2, 4) -> (N, 4)
    H2 = A2.exp()/A2.exp().sum(-1).unsqueeze(-1) # (N, 4) #applying softmax at output layer.
    return H2

We will define a function model which characterizes the forward pass. For each neuron present in the network, forward pass involves two steps:

  1. Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias.
  2. Activation represented by ‘h’: Activation function is Sigmoid function.

Since we have multi-class output from the network, we are using Softmax activation instead of Sigmoid activation at the output layer (second layer) by using Pytorch chaining mechanism. The activation output of the final layer is the same as the predicted value of our network. The function will return this value outside. So that we can use this value to calculate the loss of the neuron.

#function to calculate loss of a function.
#y_hat -> predicted & y -> actual
def loss_fn(y_hat, y):
     return -(y_hat[range(y.shape[0]), y].log()).mean()

#function to calculate accuracy of model
def accuracy(y_hat, y):
     pred = torch.argmax(y_hat, dim=1)
     return (pred == y).float().mean()

Next, we have our loss function. In this case, instead of the mean square error, we are using the cross-entropy loss function. By using the cross-entropy loss we can find the difference between the predicted probability distribution and actual probability distribution to compute the loss of the network.

Train our feed-forward network

We will now train our data on the feed-forward network which we created. First, we will initialize all the weights present in the network using Xavier initialization. Xavier Initialization initializes the weights in your network by drawing them from a distribution with zero mean and a specific variance (by multiplying with 1/sqrt(n)),

Since we have only two input features, we are dividing the weights by 2 and then call the model function on the training data with 10000 epochs and learning rate set to 0.2

#set the seed

#initialize the weights and biases using Xavier Initialization
weights1 = torch.randn(2, 2) / math.sqrt(2)
bias1 = torch.zeros(2, requires_grad=True)

weights2 = torch.randn(2, 4) / math.sqrt(2)
bias2 = torch.zeros(4, requires_grad=True)

#set the parameters for training the model
learning_rate = 0.2
epochs = 10000
X_train = X_train.float()
Y_train = Y_train.long()
loss_arr = []
acc_arr = []

#training the network
for epoch in range(epochs):
    y_hat = model(X_train)  #compute the predicted distribution
    loss = loss_fn(y_hat, Y_train) #compute the loss of the network
    loss.backward() #backpropagate the gradients
    acc_arr.append(accuracy(y_hat, Y_train))

    with torch.no_grad(): #update the weights and biases
        weights1 -= weights1.grad * learning_rate
        bias1 -= bias1.grad * learning_rate
        weights2 -= weights2.grad * learning_rate
        bias2 -= bias2.grad * learning_rate

For all the weights and biases, we are setting requires_grad = True because we want to track all the operations performing on those tensors. After that, I have set the parameter values required for training the network and converted the X_train to float because the default tensor type in PyTorch is a float tensor. Because we are using Y_train as an index for another tensor while calculating the loss, I have converted it into a long tensor.

For each epoch, we will loop through the entire training data and call model function for the computation of forward pass. Once we compute the forward pass, we will apply the loss function on the output and call loss.backward() to propagate the loss backward into the network. loss.backward() updates the gradients of the model, in this case, weights and bias. We now use these gradients to update the weights and bias. We do this within the torch.no_grad() context manager because we need to ensure that there is no further expansion of the computation graph.

Set the gradients to zero, so that we are ready for the next loop. Otherwise, our gradients would record a running tally of all the operations that had happened (i.e. loss.backward()adds the gradients to whatever is already stored, rather than replacing them).

That’s it: we’ve created and trained a simple neural network entirely from scratch!. Let’s compute the training and validation accuracy of the model to evaluate the performance of the model and check for any scope of improvement by changing the number of epochs or learning rate.

Using NN.Functional

In this section, we will discuss how can refactor our code by taking advantage of PyTorch’s nn classes to make it more concise and flexible. First, we will import the torch.nn.functional into our namespace by using the following command.

import torch.nn.functional as F

This module contains a wide range of loss and activation functions. The only change we will do in our code is that instead of using the handwritten loss function we can use the inbuilt cross entropy function present in torch.nn.functional

loss = F.cross_entropy()

Putting it together

weights1 = torch.randn(2, 2) / math.sqrt(2)
bias1 = torch.zeros(2, requires_grad=True)
weights2 = torch.randn(2, 4) / math.sqrt(2)
bias2 = torch.zeros(4, requires_grad=True)

learning_rate = 0.2
epochs = 10000
loss_arr = []
acc_arr = []

for epoch in range(epochs):
    y_hat = model(X_train) #compute the predicted distribution
    loss = F.cross_entropy(y_hat, Y_train) #just replace the loss function with built in function
    acc_arr.append(accuracy(y_hat, Y_train))

    with torch.no_grad():
        weights1 -= weights1.grad * learning_rate
        bias1 -= bias1.grad * learning_rate
        weights2 -= weights2.grad * learning_rate
        bias2 -= bias2.grad * learning_rate

Let’s confirm that our loss and accuracy are the same as before by training the network with same number of epochs and learning rate.

  • Loss of the network using handwritten loss function: 1.54
  • Loss of the network using inbuilt F.cross_entropy: 1.411

Using NN.Parameter

Next up, we’ll use nn.Module and nn.Parameter, for a clearer and more concise training loop. We will write a class FirstNetwork for our model which will subclass nn.Module. In this case, we want to create a class that holds our weights, bias, and method for the forward step.

Import torch.nn as nn
class FirstNetwork(nn.Module):
    def __init__(self):    
        #wrap all the weights and biases inside nn.parameter()
        self.weights1 = nn.Parameter(torch.randn(2, 2) / math.sqrt(2))
        self.bias1 = nn.Parameter(torch.zeros(2))
        self.weights2 = nn.Parameter(torch.randn(2, 4) / math.sqrt(2))
        self.bias2 = nn.Parameter(torch.zeros(4))
    def forward(self, X):
        a1 = torch.matmul(X, self.weights1) + self.bias1
        h1 = a1.sigmoid()
        a2 = torch.matmul(h1, self.weights2) + self.bias2
        h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1)
        return h2

The __init__ function (constructor function) helps us to initialize the parameters of the network but in this case, we are wrapping the weights and biases inside nn.Parameter. Since we are wrapping the weights and biases inside nn.Parameter they are automatically added to the list of its parameters.

Since we’re now using an object instead of just using a function, we first have to instantiate our model:

#we first have to instantiate our model
model = FirstNetwork() 

Next, we will write our training loop inside a function called fit that accepts the number of epochs and learning rate as its arguments. Inside the fit method we will call our model object model to execute the forward pass, but behind the scenes, Pytorch will call our forward method automatically.

def fit(epochs = 10000, learning_rate = 0.2):
    loss_arr = []
    acc_arr = []
    for epoch in range(epochs):
        y_hat = model(X_train) #forward pass
        loss = F.cross_entropy(y_hat, Y_train) #loss calculation
        acc_arr.append(accuracy(y_hat, Y_train))
        loss.backward() #backpropagation
        with torch.no_grad():
            #updating the parameters
            for param in model.parameters():
                param -= learning_rate * param.grad
            model.zero_grad() #setting the gradients to zero   

In our training loop, instead of updating the values for each parameter by name, and manually zero out the grads for each parameter separately. Now we can take advantage of model.parameters() and model.zero_grad() (which are both defined by PyTorch for nn.Module) and update all the parameters of the model in one shot, to make those steps more concise and less prone to the error of forgetting some of our parameters.

One important point to note from the programming standpoint is that now we have successfully decoupled the model and fit function. In fact, you can see that there is nothing about the model, the fit function knows. It applies the same logic to whatever model is defined.

Using NN.Linear and Optim

In the previous sections, we are manually defining and initializing self.weights and self.bias, and computing forward pass this process is abstracted out by using Pytorch class nn.Linear for a linear layer, which does all that for us.

class FirstNetwork_v1(nn.Module):
    def __init__(self):
        self.lin1 = nn.Linear(2, 2) #automatically defines weights and biases
        self.lin2 = nn.Linear(2, 4)
    def forward(self, X):
        a1 = self.lin1(X) #computes the dot product and adds bias
        h1 = a1.sigmoid()
        a2 = self.lin2(h1) #computes dot product and adds bias
        h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1)
        return h2

torch.nn.Linear(in_features, out_featuers) takes two mandatory parameters. 

  • in_features — size of each input sample
  • out_features — size of each output sample

The way we achieve the abstraction is that in __init__ function, we will declare self.lin1 = nn.Linear(2,2) because the size of input and output is the same for the first hidden layer which is 2. nn.Linear(2,2) will automatically define weights of size (2,2) and bias of size 2. Similarly, for the second layer, we will declare another variable assigned to nn.Linear(2,4) because there are two inputs and 4 outputs going through that layer.

Now our forward method looks simple, we no longer need to compute the dot product and bias to it manually. We can simply call self.lin1() and self.lin2(). Instantiate our model and calculate the loss in the same way as before:

fn = FirstNetwork_v1() #object

We are still able to use our same fit method as before.

Using NN.Optim

So far, we have been using Stochastic Gradient Descent in our training and updating parameters manually like this:

 #updating the parameters         
for param in model.parameters():             
    param -= learning_rate * param.grad

Pytorch also has a package torch.optim with various optimization algorithms. We can use the step method from our optimizer to take a forward step, instead of manually updating each parameter.

from torch import optim
opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer

In this problem, we will be using optim.SGD() — Stochastic Gradient Descent. The optimizer takes parameters of the model we are using and learning rate as its arguments. In fact, we can use the optim to implement Nesterov accelerated gradient descent and Adam among various optimization algorithms present. Read documentation

def fit_v1(epochs = 10000, learning_rate = 0.2, title = ""):
    loss_arr = []
    acc_arr = []
    opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer
    for epoch in range(epochs):
        y_hat = model(X_train)
        loss = F.cross_entropy(y_hat, Y_train)
        acc_arr.append(accuracy(y_hat, Y_train))

        opt.step() #updating each parameter.
        opt.zero_grad()  #resets the gradient to 0

The only change in our training loop is that after loss.backward() instead of manually updating each parameter, we will simply say:


We are using the step method from our optimizer to take a forward step and then optim.zero_grad() resets the gradient to 0 and we need to call it before computing the gradient for the next batch.

Using NN.Sequential

In this section, we will see another important feature of torch.nn module which helps in simplifying our code nn.Sequential. Sequential object executes the series of transformations contained within it, in a sequential manner. To implement the nn.Sequential we will define a custom network in __init__ the function.

class FirstNetwork_v2(nn.Module):
    def __init__(self):
        torch.manual_seed(0) = nn.Sequential( #sequential operation
            nn.Linear(2, 2), 
            nn.Linear(2, 4), 

    def forward(self, X):

In we are specifying the series of operations that our data goes through in the network, in a sequential manner. Now our forward function looks very simple, it will just apply the self.neton the input X.

We’ll clean up our fit function so we can reuse it in the future.

model = FirstNetwork_v2() #object

def fit_v2(x, y, model, opt, loss_fn, epochs = 10000):
    """Generic function for training a model """
    for epoch in range(epochs):
        loss = loss_fn(model(x), y) 

    return loss.item()

#define loss 
loss_fn = F.cross_entropy
#define optimizer 
opt = optim.SGD(model.parameters(), lr=0.2)

#training model 
fit_v2(X_train, Y_train, model, opt, loss_fn)

Now our new fit function fit_v2 is fully independent of the model, optimizer, loss function, epochs, and input data. This gives us the flexibility to change any of these parameters without boring about our training loop, power of abstraction.

Moving the Network to GPU

In this final section, we will discuss how we can leverage GPU to train our model. First check that your GPU is working in Pytorch:


create a device object for the GPU so that we can reference it:

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Moving the inputs and model to GPU

#moving inputs to GPU

model = FirstNetwork_v2() #moving the network to GPU

#calculate time
tic = time.time()
print('Final loss', fit_v2(X_train, Y_train, model, opt, loss_fn))
toc = time.time()
print('Time taken', toc - tic)

There you have it, we have successfully built our neural network for multi-class classification using Pytorch torch.nn Module. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.

Photo by Markus Spiske on Unsplash

What’s Next?

If you want to take this step up the game and make it more complicated you can use the make_moons function that generates two interleaving half circular data essentially gives you a non-linearly separable data. Also, you can add some Gaussian noise into the data to make it more complex for the neural network to arrive at a non-linearly separable decision boundary.

Even with the current data points, you can try out few scenarios:

  1. Try out a deeper neural network, eg. 2 hidden layers
  2. Try out different parameters in the optimizer (eg. try momentum, nestrov)
  3. Try out other optimization methods (eg. RMSProp and Adam) which are supported in optim
  4. Try out different initialization methods which are supported in nn.init


In this post, we have built a simple neuron network from scratch using Pytorch tensors and autograd. After that, we discussed different classes of torch.nn that help us in create and train neural networks and, making our code shorter, more understandable, and/or more flexible. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message in LinkedIn citing this article.


Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.