Optimizers in Keras Part – 1

Source: http://web.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture4.pdf

You have heard of an optimization technique named Gradient Descent. Now, suppose you are dealing with a massive dataset having one million data points. Applying the basic gradient descent will be too slow to converge to global minima in this case. Because, for updating the weight of a model, it did a forward pass with whole data and then propagated backward to update it. So, in every epoch, weights are updated once.

This article will discuss different versions of Gradient Descent, which are far better, and decide which one is best to use. If you are not aware of gradient descent, go through the article link below:

Let’s discuss optimizers available in Keras

1. SGD (Stochastic Gradient Descent):

Stochastic Gradient descent is the same as normal GD, only the difference is, in each epoch, it does the forward propagation for one data point and backpropagates to update the weights, i.e., in each epoch, the number of cycles including the forward and backward propagation is equal to the number of entries in data.

It is also slow but takes less RAM than normal GD and computationally less expensive.

Mini-Batch SGD: The difference here is that each cycle of forwarding and backward propagation uses a batch of data of a particular size to update the weights. It is computationally more efficient than SGD.

source: http://web.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture4.pdf

SGD with momentum: The momentum concept in SGD helps it move in the relevant direction, damping the oscillations (smoothening or reducing noise) to reach global minima faster. In this researcher uses the concept of moving average, which takes the previous step into account. It uses the formula below to update the weights.

source: https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

L is the loss function, the beta value lies in the range [0,1], deciding the importance of the previous step, and alpha is the learning rate.

source: https://cedar.buffalo.edu/~srihari/CSE676/8.3%20BasicOptimizn.pdf

Nesterov’s Accelerated gradient: Now, suppose you have a convex-shape bucket, and you want to through the ball through the slope of the bucket such that the ball reaches the bottom in minimum time. If the ball is not that smart, it will overshoot itself and doesn’t reaches the bottom at minimum time. The same happens in SGD with momentum.

So, what Nesterov did, he subtracted the previous step factor from the slope such that if the slopes change, it will adjust the momentum so that it doesn’t overshoot.

source: https://www.youtube.com/watch?v=uHOTRHqnakQ

Visualization of SGD with momentum vs SGD with Nesterov’s accelerated momentum:

Source: https://www.youtube.com/watch?v=uHOTRHqnakQ
Source: https://www.youtube.com/watch?v=uHOTRHqnakQ

Keras Implementation:

tf.keras.optimizers.SGD(
learning_rate=0.01, momentum=0.0, nesterov=False
)

2. Adagrad (Adaptive Gradient Descent):

We saw that in SGD, the learning rate value is fixed; it doesn’t change with steps. What if we dynamically change the learning rate? Firstly, it increases the speed of convergence, and it is best for sparse data. For frequently occurring features, it decreases the learning rate and vice-versa.

source: https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f

Keras implementation:

tf.keras.optimizers.Adagrad(
learning_rate=0.001,
 initial_accumulator_value=0.1,
 epsilon=1e-07,
 name="Adagrad",)

References:

Nesterov’s accelerated gradient:

https://paperswithcode.com/method/nesterov-accelerated-gradient#:~:text=_%7Bt%7D%24%24-,Like%20SGD%20with%20momentum%20%24%5Cgamma%24%20is%20usually%20set%20to,of%20the%20updated%20accumulated%20gradient/

SGD with momentum:

https://paperswithcode.com/method/sgd-with-momentum/

In the next part, we will see the rest of the optimizers like adadelta, adam, and Rmsprop.

Thank you for the read! I hope it was helpful.