You have heard of an optimization technique named Gradient Descent. Now, suppose you are dealing with a massive dataset having one million data points. Applying the basic gradient descent will be too slow to converge to global minima in this case. Because, for updating the weight of a model, it did a forward pass with whole data and then propagated backward to update it. So, in every epoch, weights are updated once.
This article will discuss different versions of Gradient Descent, which are far better, and decide which one is best to use. If you are not aware of gradient descent, go through the article link below:
Let’s discuss optimizers available in Keras
1. SGD (Stochastic Gradient Descent):
Stochastic Gradient descent is the same as normal GD, only the difference is, in each epoch, it does the forward propagation for one data point and backpropagates to update the weights, i.e., in each epoch, the number of cycles including the forward and backward propagation is equal to the number of entries in data.
It is also slow but takes less RAM than normal GD and computationally less expensive.
Mini-Batch SGD: The difference here is that each cycle of forwarding and backward propagation uses a batch of data of a particular size to update the weights. It is computationally more efficient than SGD.
SGD with momentum: The momentum concept in SGD helps it move in the relevant direction, damping the oscillations (smoothening or reducing noise) to reach global minima faster. In this researcher uses the concept of moving average, which takes the previous step into account. It uses the formula below to update the weights.
L is the loss function, the beta value lies in the range [0,1], deciding the importance of the previous step, and alpha is the learning rate.
Nesterov’s Accelerated gradient: Now, suppose you have a convex-shape bucket, and you want to through the ball through the slope of the bucket such that the ball reaches the bottom in minimum time. If the ball is not that smart, it will overshoot itself and doesn’t reaches the bottom at minimum time. The same happens in SGD with momentum.
So, what Nesterov did, he subtracted the previous step factor from the slope such that if the slopes change, it will adjust the momentum so that it doesn’t overshoot.
Visualization of SGD with momentum vs SGD with Nesterov’s accelerated momentum:
tf.keras.optimizers.SGD( learning_rate=0.01, momentum=0.0, nesterov=False )
2. Adagrad (Adaptive Gradient Descent):
We saw that in SGD, the learning rate value is fixed; it doesn’t change with steps. What if we dynamically change the learning rate? Firstly, it increases the speed of convergence, and it is best for sparse data. For frequently occurring features, it decreases the learning rate and vice-versa.
tf.keras.optimizers.Adagrad( learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-07, name="Adagrad",)
Nesterov’s accelerated gradient:
SGD with momentum:
In the next part, we will see the rest of the optimizers like adadelta, adam, and Rmsprop.
Thank you for the read! I hope it was helpful.