Feedforward neural networks are also known as Multi-layered Network of Neurons (MLN). These network of models are called feedforward because the information only travels forward in the neural network, through the input nodes then through the hidden layers (single or many layers) and finally through the output nodes.

Traditional models such as McCulloch Pitts, Perceptron and Sigmoid neuron models capacity is limited to linear functions. To handle the complex non-linear decision boundary between input and the output we are using the Multi-layered Network of Neurons.
Outline
In this post, we will discuss how to build a feed-forward neural network using Pytorch. We will do this incrementally using Pytorch TORCH.NN
module. The way we do that it is, first we will generate non-linearly separable data with two classes. Then we will build our simple feedforward neural network using PyTorch tensor functionality. After that, we will use abstraction features available in Pytorch TORCH.NN
module such as Functional, Sequential, Linear and Optim to make our neural network concise, flexible and efficient. Finally, we will move our network to CUDA and see how fast it performs.
Note: This tutorial assumes you already have PyTorch installed in your local machine or know how to use Pytorch in Google Collab with CUDA support, and are familiar with the basics of tensor operations. If you are not familiar with these concepts kindly refer to my previous post linked below.
Rest of the article is structured as follows:
- Import libraries
- Generate non-linearly separable data
- Feedforward network using tensors and auto-grad
- Train our feedforward network
- NN.Functional
- NN.Parameter
- NN.Linear and Optim
- NN.Sequential
- Moving the Network to GPU
If you want to skip the theory part and get into the code right away, Click here
Import libraries
Before we start building our network, first we need to import the required libraries. We are importing the numpy
to evaluate the matrix multiplication and dot product between two vectors, matplotlib
to visualize the data and from thesklearn
package, we are importing functions to generate data and evaluate the network performance. Importing torch
for all things related to Pytorch.
#required libraries import numpy as np import math import matplotlib.pyplot as plt import matplotlib.colors import time import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, mean_squared_error, log_loss from tqdm import tqdm_notebook from IPython.display import HTML import warnings from sklearn.preprocessing import OneHotEncoder from sklearn.datasets import make_blobs import torch warnings.filterwarnings('ignore')
Generate non-linearly separable data
In this section, we will see how to randomly generate non-linearly separable data using sklearn
.
#generate data using make_blobs function from sklearn. #centers = 4 indicates different types of classes data, labels = make_blobs(n_samples=1000, centers=4, n_features=2, random_state=0) print(data.shape, labels.shape) #visualize the data plt.scatter(data[:,0], data[:,1], c=labels, cmap=my_cmap) plt.show() #splitting the data into train and test X_train, X_val, Y_train, Y_val = train_test_split(data, labels, stratify=labels, random_state=0) print(X_train.shape, X_val.shape, labels.shape)
To generate data randomly we will use make_blobs
to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs centers=4
as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels.

Once we have our data ready, I have used the train_test_split
function to split the data for
and training
validation
in the ratio of 75:25.
Feedforward network using tensors and auto-grad
In this section, we will see how to build and train a simple neural network using Pytorch tensors and auto-grad. The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

We will use map
function for the efficient conversion of numpy array to Pytorch tensors
.
#converting the numpy array to torch tensors X_train, Y_train, X_val, Y_val = map(torch.tensor, (X_train, Y_train, X_val, Y_val)) print(X_train.shape, Y_train.shape)
After converting the data to tensors, we need to write a function that helps us to compute the forward pass for the network.
#function for computing forward pass in the network def model(x): A1 = torch.matmul(x, weights1) + bias1 # (N, 2) x (2, 2) -> (N, 2) H1 = A1.sigmoid() # (N, 2) A2 = torch.matmul(H1, weights2) + bias2 # (N, 2) x (2, 4) -> (N, 4) H2 = A2.exp()/A2.exp().sum(-1).unsqueeze(-1) # (N, 4) #applying softmax at output layer. return H2
We will define a function model
which characterizes the forward pass. For each neuron present in the network, forward pass involves two steps:
- Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias.
- Activation represented by ‘h’: Activation function is Sigmoid function.
Since we have multi-class output from the network, we are using Softmax activation instead of Sigmoid activation at the output layer (second layer) by using Pytorch chaining mechanism. The activation output of the final layer is the same as the predicted value of our network. The function will return this value outside. So that we can use this value to calculate the loss of the neuron.
#function to calculate loss of a function. #y_hat -> predicted & y -> actual def loss_fn(y_hat, y): return -(y_hat[range(y.shape[0]), y].log()).mean() #function to calculate accuracy of model def accuracy(y_hat, y): pred = torch.argmax(y_hat, dim=1) return (pred == y).float().mean()
Next, we have our loss function. In this case, instead of the mean square error, we are using the cross-entropy loss function. By using the cross-entropy loss we can find the difference between the predicted probability distribution and actual probability distribution to compute the loss of the network.
Train our feed-forward network
We will now train our data on the feed-forward network which we created. First, we will initialize all the weights present in the network using Xavier initialization. Xavier Initialization initializes the weights in your network by drawing them from a distribution with zero mean and a specific variance (by multiplying with 1/sqrt(n)),

Since we have only two input features, we are dividing the weights by 2 and then call the model
function on the training data with 10000 epochs and learning rate set to 0.2
#set the seed torch.manual_seed(0) #initialize the weights and biases using Xavier Initialization weights1 = torch.randn(2, 2) / math.sqrt(2) weights1.requires_grad_() bias1 = torch.zeros(2, requires_grad=True) weights2 = torch.randn(2, 4) / math.sqrt(2) weights2.requires_grad_() bias2 = torch.zeros(4, requires_grad=True) #set the parameters for training the model learning_rate = 0.2 epochs = 10000 X_train = X_train.float() Y_train = Y_train.long() loss_arr = [] acc_arr = [] #training the network for epoch in range(epochs): y_hat = model(X_train) #compute the predicted distribution loss = loss_fn(y_hat, Y_train) #compute the loss of the network loss.backward() #backpropagate the gradients loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) with torch.no_grad(): #update the weights and biases weights1 -= weights1.grad * learning_rate bias1 -= bias1.grad * learning_rate weights2 -= weights2.grad * learning_rate bias2 -= bias2.grad * learning_rate weights1.grad.zero_() bias1.grad.zero_() weights2.grad.zero_() bias2.grad.zero_()
For all the weights and biases, we are setting requires_grad = True
because we want to track all the operations performing on those tensors. After that, I have set the parameter values required for training the network and converted the X_train
to float because the default tensor type in PyTorch is a float tensor. Because we are using Y_train
as an index for another tensor while calculating the loss, I have converted it into a long
tensor.
For each epoch, we will loop through the entire training data and call model
function for the computation of forward pass. Once we compute the forward pass, we will apply the loss function on the output and call loss.backward()
to propagate the loss backward into the network. loss.backward()
updates the gradients of the model, in this case, weights
and bias
. We now use these gradients to update the weights and bias. We do this within the torch.no_grad()
context manager because we need to ensure that there is no further expansion of the computation graph.
Set the gradients to zero, so that we are ready for the next loop. Otherwise, our gradients would record a running tally of all the operations that had happened (i.e. loss.backward()
adds the gradients to whatever is already stored, rather than replacing them).

That’s it: we’ve created and trained a simple neural network entirely from scratch!. Let’s compute the training and validation accuracy of the model to evaluate the performance of the model and check for any scope of improvement by changing the number of epochs or learning rate.
Using NN.Functional
In this section, we will discuss how can refactor our code by taking advantage of PyTorch’s nn
classes to make it more concise and flexible. First, we will import the torch.nn.functional
into our namespace by using the following command.
import torch.nn.functional as F
This module contains a wide range of loss and activation functions. The only change we will do in our code is that instead of using the handwritten loss function we can use the inbuilt cross entropy function present in torch.nn.functional
loss = F.cross_entropy()
Putting it together
torch.manual_seed(0) weights1 = torch.randn(2, 2) / math.sqrt(2) weights1.requires_grad_() bias1 = torch.zeros(2, requires_grad=True) weights2 = torch.randn(2, 4) / math.sqrt(2) weights2.requires_grad_() bias2 = torch.zeros(4, requires_grad=True) learning_rate = 0.2 epochs = 10000 loss_arr = [] acc_arr = [] for epoch in range(epochs): y_hat = model(X_train) #compute the predicted distribution loss = F.cross_entropy(y_hat, Y_train) #just replace the loss function with built in function loss.backward() loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) with torch.no_grad(): weights1 -= weights1.grad * learning_rate bias1 -= bias1.grad * learning_rate weights2 -= weights2.grad * learning_rate bias2 -= bias2.grad * learning_rate weights1.grad.zero_() bias1.grad.zero_() weights2.grad.zero_() bias2.grad.zero_()
Let’s confirm that our loss and accuracy are the same as before by training the network with same number of epochs and learning rate.
- Loss of the network using handwritten loss function: 1.54
- Loss of the network using inbuilt F.cross_entropy: 1.411

Using NN.Parameter
Next up, we’ll use nn.Module
and nn.Parameter
, for a clearer and more concise training loop. We will write a class FirstNetwork
for our model which will subclass nn.Module
. In this case, we want to create a class that holds our weights, bias, and method for the forward step.
Import torch.nn as nn
class FirstNetwork(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) #wrap all the weights and biases inside nn.parameter() self.weights1 = nn.Parameter(torch.randn(2, 2) / math.sqrt(2)) self.bias1 = nn.Parameter(torch.zeros(2)) self.weights2 = nn.Parameter(torch.randn(2, 4) / math.sqrt(2)) self.bias2 = nn.Parameter(torch.zeros(4)) def forward(self, X): a1 = torch.matmul(X, self.weights1) + self.bias1 h1 = a1.sigmoid() a2 = torch.matmul(h1, self.weights2) + self.bias2 h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1) return h2
The __init__
function (constructor function) helps us to initialize the parameters of the network but in this case, we are wrapping the weights and biases inside nn.Parameter
. Since we are wrapping the weights and biases inside nn.Parameter
they are automatically added to the list of its parameters.
Since we’re now using an object instead of just using a function, we first have to instantiate our model:
#we first have to instantiate our model model = FirstNetwork()
Next, we will write our training loop inside a function called fit
that accepts the number of epochs and learning rate as its arguments. Inside the fit
method we will call our model object
to execute the forward pass, but behind the scenes, Pytorch will call our model
forward
method automatically.
def fit(epochs = 10000, learning_rate = 0.2): loss_arr = [] acc_arr = [] for epoch in range(epochs): y_hat = model(X_train) #forward pass loss = F.cross_entropy(y_hat, Y_train) #loss calculation loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) loss.backward() #backpropagation with torch.no_grad(): #updating the parameters for param in model.parameters(): param -= learning_rate * param.grad model.zero_grad() #setting the gradients to zero
In our training loop, instead of updating the values for each parameter by name, and manually zero out the grads for each parameter separately. Now we can take advantage of model.parameters() and model.zero_grad() (which are both defined by PyTorch for nn.Module
) and update all the parameters of the model in one shot, to make those steps more concise and less prone to the error of forgetting some of our parameters.
One important point to note from the programming standpoint is that now we have successfully decoupled the model and fit function. In fact, you can see that there is nothing about the model, the fit function knows. It applies the same logic to whatever model is defined.
Using NN.Linear and Optim
In the previous sections, we are manually defining and initializing self.weights
and self.bias
, and computing forward pass this process is abstracted out by using Pytorch class nn.Linear for a linear layer, which does all that for us.
class FirstNetwork_v1(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) self.lin1 = nn.Linear(2, 2) #automatically defines weights and biases self.lin2 = nn.Linear(2, 4) def forward(self, X): a1 = self.lin1(X) #computes the dot product and adds bias h1 = a1.sigmoid() a2 = self.lin2(h1) #computes dot product and adds bias h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1) return h2
torch.nn.Linear(in_features, out_featuers)
takes two mandatory parameters.
- in_features — size of each input sample
- out_features — size of each output sample
The way we achieve the abstraction is that in __init__
function, we will declare self.lin1 = nn.Linear(2,2)
because the size of input and output is the same for the first hidden layer which is 2. nn.Linear(2,2)
will automatically define weights of size (2,2) and bias of size 2. Similarly, for the second layer, we will declare another variable assigned to nn.Linear(2,4)
because there are two inputs and 4 outputs going through that layer.
Now our forward
method looks simple, we no longer need to compute the dot product and bias to it manually. We can simply call self.lin1()
and self.lin2()
. Instantiate our model and calculate the loss in the same way as before:
fn = FirstNetwork_v1() #object
We are still able to use our same fit
method as before.

Using NN.Optim
So far, we have been using Stochastic Gradient Descent in our training and updating parameters manually like this:
#updating the parameters
for param in model.parameters():
param -= learning_rate * param.grad
Pytorch also has a package torch.optim
with various optimization algorithms. We can use the step
method from our optimizer to take a forward step, instead of manually updating each parameter.
from torch import optim opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer
In this problem, we will be using optim.SGD()
— Stochastic Gradient Descent. The optimizer takes parameters of the model we are using and learning rate as its arguments. In fact, we can use the optim
to implement Nesterov accelerated gradient descent and Adam among various optimization algorithms present. Read documentation.
def fit_v1(epochs = 10000, learning_rate = 0.2, title = ""): loss_arr = [] acc_arr = [] opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer for epoch in range(epochs): y_hat = model(X_train) loss = F.cross_entropy(y_hat, Y_train) loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) loss.backward() opt.step() #updating each parameter. opt.zero_grad() #resets the gradient to 0
The only change in our training loop is that after loss.backward()
instead of manually updating each parameter, we will simply say:
opt.step()
opt.zero_grad()
We are using the step
method from our optimizer to take a forward step and then optim.zero_grad()
resets the gradient to 0 and we need to call it before computing the gradient for the next batch.

Using NN.Sequential
In this section, we will see another important feature of torch.nn
module which helps in simplifying our code nn.Sequential
. Sequential
object executes the series of transformations contained within it, in a sequential manner. To implement the nn.Sequential
we will define a custom network self.net
in __init__
the function.
class FirstNetwork_v2(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) self.net = nn.Sequential( #sequential operation nn.Linear(2, 2), nn.Sigmoid(), nn.Linear(2, 4), nn.Softmax()) def forward(self, X): return self.net(X)
In self.net
we are specifying the series of operations that our data goes through in the network, in a sequential manner. Now our forward
function looks very simple, it will just apply the self.net
on the input X.
We’ll clean up our fit
function so we can reuse it in the future.
model = FirstNetwork_v2() #object def fit_v2(x, y, model, opt, loss_fn, epochs = 10000): """Generic function for training a model """ for epoch in range(epochs): loss = loss_fn(model(x), y) loss.backward() opt.step() opt.zero_grad() return loss.item() #define loss loss_fn = F.cross_entropy #define optimizer opt = optim.SGD(model.parameters(), lr=0.2) #training model fit_v2(X_train, Y_train, model, opt, loss_fn)
Now our new fit function fit_v2
is fully independent of the model, optimizer, loss function, epochs, and input data. This gives us the flexibility to change any of these parameters without boring about our training loop, power of abstraction.
Moving the Network to GPU
In this final section, we will discuss how we can leverage GPU to train our model. First check that your GPU is working in Pytorch:
print(torch.cuda.is_available())
create a device object for the GPU so that we can reference it:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
Moving the inputs and model to GPU
#moving inputs to GPU X_train=X_train.to(device) Y_train=Y_train.to(device) model = FirstNetwork_v2() model.to(device) #moving the network to GPU #calculate time tic = time.time() print('Final loss', fit_v2(X_train, Y_train, model, opt, loss_fn)) toc = time.time() print('Time taken', toc - tic)
There you have it, we have successfully built our neural network for multi-class classification using Pytorch torch.nn
Module. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.
What’s Next?
If you want to take this step up the game and make it more complicated you can use the make_moons
function that generates two interleaving half circular data essentially gives you a non-linearly separable data. Also, you can add some Gaussian noise into the data to make it more complex for the neural network to arrive at a non-linearly separable decision boundary.
Even with the current data points, you can try out few scenarios:
- Try out a deeper neural network, eg. 2 hidden layers
- Try out different parameters in the optimizer (eg. try momentum, nestrov)
- Try out other optimization methods (eg. RMSProp and Adam) which are supported in
optim
- Try out different initialization methods which are supported in
nn.init
Conclusion
In this post, we have built a simple neuron network from scratch using Pytorch tensors and autograd. After that, we discussed different classes of torch.nn
that help us in create and train neural networks and, making our code shorter, more understandable, and/or more flexible. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message in LinkedIn citing this article.
Niranjan Kumar is working as a Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. He writes about the latest tools and technologies in the field of Deep Learning. He is one of the top writers in Artificial Intelligence at Medium. A Graduate of Praxis Business School, Niranjan Kumar holds a degree in Data Science. Feel free to contact him via LinkedIn for collaboration on projects