Optimization in machine learning is the process of updating weights and biases in the model to minimize the model’s overall loss. While backpropagating in the network, the weights and biases are updated.
In this article, we will learn why there is a need for such an optimization technique, what is gradient descent optimization and at the end, we will see how does it work with the regression model.
Why do we need such optimizations?
Let’s understand it through a Linear Regression model. The hypothesis of a regression model for data with n features and m data points is-
h(x) = m1x1 + m2x2 + m3x3 + ……………. + mnxn +mn+1xn+1
where mi is the weight of the independent variable xi and xn+1 = 1, therefore mn+1 is the bias term; to generalize this, we add an additional feature is having value = 1 for all data points, which is depicting the bias term.
For finding the weights matrix, i.e., w:
w = (xTx)-1xTy
where x is the independent variable matrix of dimension: m * (n+1)
y is the dependent variable matrix of dimension: m * 1
The computation cost for solving the above equation to get optimum weights is too high when the data is huge. For the inverse part of the equation, the complexity comes 0(n3), where n is the number of features. That’s why we need some optimization techniques.
What is Gradient Descent Optimization?
This optimization algorithm is used to find the value of parameters of a function that minimizes the function cost. This technique approximates the value of the parameters to the optimum values.
Let’s consider the hypothesis for a single feature linear regression be-
h(x) = mx + c
Let mean-squared error be the loss function given by:
Cost = (y – h(x))2
Where y is the true output.
We start with the random initial value of the parameter m. The motive is to move the initial value to the optimal value of m, as shown above. How can we do this?
This can be can with the help of slope at a particular point. Initially, we can see the slope at an initial point is negative, so we will reduce that slope from m to reach another point closer to the optimal one i.e.
m = m – d(cost)/d(m)
Directly subtracting the slope leads to a situation where m just move around its optimal value and never obtain it, as shown below:
We include a learning parameter that tells the step size used to update the weights in each iteration to solve this problem.
m = m – (alpha * d(cost)/d(m))
where alpha is the learning rate and will converge the value of m to the optimal one. The value of alpha lies between [0,1]. It should not be too high; if so, then there is no sense in using it. The standard value of the learning rate is 0.001.
See the converging of parameter m below:
The same process is used for the bias term c, the equation is:
c = c – (alpha * d(cost)/d(c))
In the real-time scenario, you don’t have to implement this optimization technique. All work is already available in python libraries like SK-learn, whose different models use gradient descent to optimize their weights.
Thank you for the read! I hope it was helpful.