Feedforward neural networks are also known as **Multi-layered Network of Neurons** (MLN). These network of models are called feedforward because the information only travels forward in the neural network, through the input nodes then through the hidden layers (single or many layers) and finally through the output nodes.

Traditional models such as McCulloch Pitts, Perceptron and Sigmoid neuron models capacity is limited to linear functions. To handle the complex non-linear decision boundary between input and the output we are using the Multi-layered Network of Neurons.

**Outline**

In this post, we will discuss how to build a feed-forward neural network using Pytorch. We will do this incrementally using Pytorch `TORCH.NN`

module. The way we do that it is, first we will generate non-linearly separable data with two classes. Then we will build our simple feedforward neural network using PyTorch tensor functionality. After that, we will use abstraction features available in Pytorch `TORCH.NN`

module such as Functional, Sequential, Linear and Optim to make our neural network concise, flexible and efficient. Finally, we will move our network to CUDA and see how fast it performs.

**Note: This tutorial assumes you already have PyTorch installed in your local machine or know how to use Pytorch in Google Collab with CUDA support, and are familiar with the basics of tensor operations. **If you are not familiar with these concepts kindly refer to my previous post linked below.

**Rest of the article is structured as follows:**

- Import libraries
- Generate non-linearly separable data
- Feedforward network using tensors and auto-grad
- Train our feedforward network
- NN.Functional
- NN.Parameter
- NN.Linear and Optim
- NN.Sequential
- Moving the Network to GPU

If you want to skip the theory part and get into the code right away, Click here

**Import libraries**

Before we start building our network, first we need to import the required libraries. We are importing the `numpy`

to evaluate the matrix multiplication and dot product between two vectors, `matplotlib`

to visualize the data and from the`sklearn`

package, we are importing functions to generate data and evaluate the network performance. Importing `torch`

for all things related to Pytorch.

#required libraries import numpy as np import math import matplotlib.pyplot as plt import matplotlib.colors import time import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, mean_squared_error, log_loss from tqdm import tqdm_notebook from IPython.display import HTML import warnings from sklearn.preprocessing import OneHotEncoder from sklearn.datasets import make_blobs import torch warnings.filterwarnings('ignore')

**Generate non-linearly separable data **

In this section, we will see how to randomly generate non-linearly separable data using `sklearn`

.

#generate data using make_blobs function from sklearn. #centers = 4 indicates different types of classes data, labels = make_blobs(n_samples=1000, centers=4, n_features=2, random_state=0) print(data.shape, labels.shape) #visualize the data plt.scatter(data[:,0], data[:,1], c=labels, cmap=my_cmap) plt.show() #splitting the data into train and test X_train, X_val, Y_train, Y_val = train_test_split(data, labels, stratify=labels, random_state=0) print(X_train.shape, X_val.shape, labels.shape)

To generate data randomly we will use `make_blobs`

to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs `centers=4`

as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels.

Once we have our data ready, I have used the `train_test_split`

function to split the data for

and `training`

`validation`

in the ratio of 75:25.

**Feedforward network using tensors and auto-grad**

In this section, we will see how to build and train a simple neural network using Pytorch tensors and auto-grad. The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘**a**’ and post-activation is represented by ‘**h**’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

We will use `map`

function for the efficient conversion of numpy array to Pytorch `tensors`

.

#converting the numpy array to torch tensors X_train, Y_train, X_val, Y_val = map(torch.tensor, (X_train, Y_train, X_val, Y_val)) print(X_train.shape, Y_train.shape)

After converting the data to tensors, we need to write a function that helps us to compute the forward pass for the network.

#function for computing forward pass in the network def model(x): A1 = torch.matmul(x, weights1) + bias1 # (N, 2) x (2, 2) -> (N, 2) H1 = A1.sigmoid() # (N, 2) A2 = torch.matmul(H1, weights2) + bias2 # (N, 2) x (2, 4) -> (N, 4) H2 = A2.exp()/A2.exp().sum(-1).unsqueeze(-1) # (N, 4) #applying softmax at output layer. return H2

We will define a function `model`

which characterizes the forward pass. For each neuron present in the network, forward pass involves two steps:

- Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias.
- Activation represented by ‘h’: Activation function is Sigmoid function.

Since we have multi-class output from the network, we are using Softmax activation instead of Sigmoid activation at the output layer (second layer) by using Pytorch chaining mechanism. The activation output of the final layer is the same as the predicted value of our network. The function will return this value outside. So that we can use this value to calculate the loss of the neuron.

#function to calculate loss of a function. #y_hat -> predicted & y -> actual def loss_fn(y_hat, y): return -(y_hat[range(y.shape[0]), y].log()).mean() #function to calculate accuracy of model def accuracy(y_hat, y): pred = torch.argmax(y_hat, dim=1) return (pred == y).float().mean()

Next, we have our loss function. In this case, instead of the mean square error, we are using the cross-entropy loss function. By using the cross-entropy loss we can find the difference between the predicted probability distribution and actual probability distribution to compute the loss of the network.

**Train our feed-forward network**

We will now train our data on the feed-forward network which we created. First, we will initialize all the weights present in the network using Xavier initialization. Xavier Initialization initializes the weights in your network by drawing them from a distribution with zero mean and a specific variance (by multiplying with 1/sqrt(n)),

Since we have only two input features, we are dividing the weights by 2 and then call the `model`

function on the training data with 10000 epochs and learning rate set to 0.2

#set the seed torch.manual_seed(0) #initialize the weights and biases using Xavier Initialization weights1 = torch.randn(2, 2) / math.sqrt(2) weights1.requires_grad_() bias1 = torch.zeros(2, requires_grad=True) weights2 = torch.randn(2, 4) / math.sqrt(2) weights2.requires_grad_() bias2 = torch.zeros(4, requires_grad=True) #set the parameters for training the model learning_rate = 0.2 epochs = 10000 X_train = X_train.float() Y_train = Y_train.long() loss_arr = [] acc_arr = [] #training the network for epoch in range(epochs): y_hat = model(X_train) #compute the predicted distribution loss = loss_fn(y_hat, Y_train) #compute the loss of the network loss.backward() #backpropagate the gradients loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) with torch.no_grad(): #update the weights and biases weights1 -= weights1.grad * learning_rate bias1 -= bias1.grad * learning_rate weights2 -= weights2.grad * learning_rate bias2 -= bias2.grad * learning_rate weights1.grad.zero_() bias1.grad.zero_() weights2.grad.zero_() bias2.grad.zero_()

For all the weights and biases, we are setting `requires_grad = True`

because we want to track all the operations performing on those tensors. After that, I have set the parameter values required for training the network and converted the `X_train`

to float because the default tensor type in PyTorch is a float tensor. Because we are using `Y_train`

as an index for another tensor while calculating the loss, I have converted it into a `long`

tensor.

For each epoch, we will loop through the entire training data and call `model`

function for the computation of forward pass. Once we compute the forward pass, we will apply the loss function on the output and call `loss.backward()`

to propagate the loss backward into the network. `loss.backward()`

updates the gradients of the model, in this case, `weights`

and `bias`

. We now use these gradients to update the weights and bias. We do this within the `torch.no_grad()`

context manager because we need to ensure that there is no further expansion of the computation graph.

Set the gradients to zero, so that we are ready for the next loop. Otherwise, our gradients would record a running tally of all the operations that had happened (i.e. `loss.backward()`

adds the gradients to whatever is already stored, rather than replacing them).

That’s it: we’ve created and trained a simple neural network entirely from scratch!. Let’s compute the training and validation accuracy of the model to evaluate the performance of the model and check for any scope of improvement by changing the number of epochs or learning rate.

**Using NN.Functional**

In this section, we will discuss how can refactor our code by taking advantage of PyTorch’s `nn`

classes to make it more concise and flexible. First, we will import the `torch.nn.functional`

into our namespace by using the following command.

import torch.nn.functional as F

This module contains a wide range of loss and activation functions. The only change we will do in our code is that instead of using the handwritten loss function we can use the inbuilt cross entropy function present in `torch.nn.functional`

loss = F.cross_entropy()

**Putting it together**

torch.manual_seed(0) weights1 = torch.randn(2, 2) / math.sqrt(2) weights1.requires_grad_() bias1 = torch.zeros(2, requires_grad=True) weights2 = torch.randn(2, 4) / math.sqrt(2) weights2.requires_grad_() bias2 = torch.zeros(4, requires_grad=True) learning_rate = 0.2 epochs = 10000 loss_arr = [] acc_arr = [] for epoch in range(epochs): y_hat = model(X_train) #compute the predicted distribution loss = F.cross_entropy(y_hat, Y_train) #just replace the loss function with built in function loss.backward() loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) with torch.no_grad(): weights1 -= weights1.grad * learning_rate bias1 -= bias1.grad * learning_rate weights2 -= weights2.grad * learning_rate bias2 -= bias2.grad * learning_rate weights1.grad.zero_() bias1.grad.zero_() weights2.grad.zero_() bias2.grad.zero_()

Let’s confirm that our loss and accuracy are the same as before by training the network with same number of epochs and learning rate.

- Loss of the network using handwritten loss function: 1.54
- Loss of the network using inbuilt F.cross_entropy: 1.411

**Using NN.Parameter**

Next up, we’ll use `nn.Module`

and `nn.Parameter`

, for a clearer and more concise training loop. We will write a class `FirstNetwork`

for our model which will subclass `nn.Module`

. In this case, we want to create a class that holds our weights, bias, and method for the forward step.

Import torch.nn as nn

class FirstNetwork(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) #wrap all the weights and biases inside nn.parameter() self.weights1 = nn.Parameter(torch.randn(2, 2) / math.sqrt(2)) self.bias1 = nn.Parameter(torch.zeros(2)) self.weights2 = nn.Parameter(torch.randn(2, 4) / math.sqrt(2)) self.bias2 = nn.Parameter(torch.zeros(4)) def forward(self, X): a1 = torch.matmul(X, self.weights1) + self.bias1 h1 = a1.sigmoid() a2 = torch.matmul(h1, self.weights2) + self.bias2 h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1) return h2

The `__init__`

function (constructor function) helps us to initialize the parameters of the network but in this case, we are wrapping the weights and biases inside `nn.Parameter`

. Since we are wrapping the weights and biases inside `nn.Parameter`

they are automatically added to the list of its parameters.

Since we’re now using an object instead of just using a function, we first have to instantiate our model:

#we first have to instantiate our model model = FirstNetwork()

Next, we will write our training loop inside a function called `fit`

that accepts the number of epochs and learning rate as its arguments. Inside the `fit`

method we will call our model object

to execute the forward pass, but behind the scenes, Pytorch will call our `model`

`forward`

method automatically.

def fit(epochs = 10000, learning_rate = 0.2): loss_arr = [] acc_arr = [] for epoch in range(epochs): y_hat = model(X_train) #forward pass loss = F.cross_entropy(y_hat, Y_train) #loss calculation loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) loss.backward() #backpropagation with torch.no_grad(): #updating the parameters for param in model.parameters(): param -= learning_rate * param.grad model.zero_grad() #setting the gradients to zero

In our training loop, instead of updating the values for each parameter by name, and manually zero out the grads for each parameter separately. Now we can take advantage of model.parameters() and model.zero_grad() (which are both defined by PyTorch for `nn.Module`

) and update all the parameters of the model in one shot, to make those steps more concise and less prone to the error of forgetting some of our parameters.

**One important point to note from the programming standpoint is that now we have successfully decoupled the model and fit function. In fact, you can see that there is nothing about the model, the fit function knows. It applies the same logic to whatever model is defined.**

**Using NN.Linear and Optim**

In the previous sections, we are manually defining and initializing `self.weights`

and `self.bias`

, and computing forward pass this process is abstracted out by using Pytorch class nn.Linear for a linear layer, which does all that for us.

class FirstNetwork_v1(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) self.lin1 = nn.Linear(2, 2) #automatically defines weights and biases self.lin2 = nn.Linear(2, 4) def forward(self, X): a1 = self.lin1(X) #computes the dot product and adds bias h1 = a1.sigmoid() a2 = self.lin2(h1) #computes dot product and adds bias h2 = a2.exp()/a2.exp().sum(-1).unsqueeze(-1) return h2

`torch.nn.Linear(in_features, out_featuers)`

takes two mandatory parameters.

**in_features**— size of each input sample**out_features**— size of each output sample

The way we achieve the abstraction is that in `__init__`

function, we will declare `self.lin1 = nn.Linear(2,2)`

because the size of input and output is the same for the first hidden layer which is 2. `nn.Linear(2,2)`

will automatically define weights of size (2,2) and bias of size 2. Similarly, for the second layer, we will declare another variable assigned to `nn.Linear(2,4)`

because there are two inputs and 4 outputs going through that layer.

Now our `forward`

method looks simple, we no longer need to compute the dot product and bias to it manually. We can simply call `self.lin1()`

and `self.lin2()`

. Instantiate our model and calculate the loss in the same way as before:

fn = FirstNetwork_v1() #object

We are still able to use our same `fit`

method as before.

**Using NN.Optim**

So far, we have been using Stochastic Gradient Descent in our training and updating parameters manually like this:

`#updating the parameters`

`for param in model.parameters():`

`param -= learning_rate * param.grad`

Pytorch also has a package `torch.optim`

with various optimization algorithms. We can use the `step`

method from our optimizer to take a forward step, instead of manually updating each parameter.

from torch import optim opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer

In this problem, we will be using `optim.SGD()`

— Stochastic Gradient Descent. The optimizer takes parameters of the model we are using and learning rate as its arguments. In fact, we can use the `optim`

to implement Nesterov accelerated gradient descent and Adam among various optimization algorithms present. Read documentation.

def fit_v1(epochs = 10000, learning_rate = 0.2, title = ""): loss_arr = [] acc_arr = [] opt = optim.SGD(model.parameters(), lr=learning_rate) #define optimizer for epoch in range(epochs): y_hat = model(X_train) loss = F.cross_entropy(y_hat, Y_train) loss_arr.append(loss.item()) acc_arr.append(accuracy(y_hat, Y_train)) loss.backward() opt.step() #updating each parameter. opt.zero_grad() #resets the gradient to 0

The only change in our training loop is that after `loss.backward()`

instead of manually updating each parameter, we will simply say:

opt.step()

opt.zero_grad()

We are using the `step`

method from our optimizer to take a forward step and then `optim.zero_grad()`

resets the gradient to 0 and we need to call it before computing the gradient for the next batch.

**Using NN.Sequential**

In this section, we will see another important feature of `torch.nn`

module which helps in simplifying our code `nn.Sequential`

. `Sequential`

object executes the series of transformations contained within it, in a sequential manner. To implement the `nn.Sequential`

we will define a custom network `self.net`

in `__init__`

the function.

class FirstNetwork_v2(nn.Module): def __init__(self): super().__init__() torch.manual_seed(0) self.net = nn.Sequential( #sequential operation nn.Linear(2, 2), nn.Sigmoid(), nn.Linear(2, 4), nn.Softmax()) def forward(self, X): return self.net(X)

In `self.net`

we are specifying the series of operations that our data goes through in the network, in a sequential manner. Now our `forward`

function looks very simple, it will just apply the `self.net`

on the input X.

We’ll clean up our `fit`

function so we can reuse it in the future.

model = FirstNetwork_v2() #object def fit_v2(x, y, model, opt, loss_fn, epochs = 10000): """Generic function for training a model """ for epoch in range(epochs): loss = loss_fn(model(x), y) loss.backward() opt.step() opt.zero_grad() return loss.item() #define loss loss_fn = F.cross_entropy #define optimizer opt = optim.SGD(model.parameters(), lr=0.2) #training model fit_v2(X_train, Y_train, model, opt, loss_fn)

Now our new fit function `fit_v2`

is fully independent of the model, optimizer, loss function, epochs, and input data. This gives us the flexibility to change any of these parameters without boring about our training loop, power of abstraction.

**Moving the Network to GPU**

In this final section, we will discuss how we can leverage GPU to train our model. First check that your GPU is working in Pytorch:

print(torch.cuda.is_available())

create a device object for the GPU so that we can reference it:

device=torch.device("cuda")iftorch.cuda.is_available()elsetorch.device("cpu")

Moving the inputs and model to GPU

#moving inputs to GPU X_train=X_train.to(device) Y_train=Y_train.to(device) model = FirstNetwork_v2() model.to(device) #moving the network to GPU #calculate time tic = time.time() print('Final loss', fit_v2(X_train, Y_train, model, opt, loss_fn)) toc = time.time() print('Time taken', toc - tic)

There you have it, we have successfully built our neural network for multi-class classification using Pytorch `torch.nn`

Module. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it.

**What’s Next? **

If you want to take this step up the game and make it more complicated you can use the `make_moons`

function that generates two interleaving half circular data essentially gives you a non-linearly separable data. Also, you can add some Gaussian noise into the data to make it more complex for the neural network to arrive at a non-linearly separable decision boundary.

Even with the current data points, you can try out few scenarios:

- Try out a deeper neural network, eg. 2 hidden layers
- Try out different parameters in the optimizer (eg. try momentum, nestrov)
- Try out other optimization methods (eg. RMSProp and Adam) which are supported in
`optim`

- Try out different initialization methods which are supported in
`nn.init`

**Conclusion**

In this post, we have built a simple neuron network from scratch using Pytorch tensors and autograd. After that, we discussed different classes of `torch.nn`

that help us in create and train neural networks and, making our code shorter, more understandable, and/or more flexible. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message in LinkedIn citing this article.

Niranjan Kumar is working as a Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. He writes about the latest tools and technologies in the field of Deep Learning. He is one of the top writers in Artificial Intelligence at Medium. A Graduate of Praxis Business School, Niranjan Kumar holds a degree in Data Science. Feel free to contact him via LinkedIn for collaboration on projects