BSR

Optimization

•

11 January 2022

Introduction to Gradient Descent Algorithm

In this series, we are going to learn one of the well-known optimization algorithms, namely Gradient Descent algorithm.

What is an Optimization Algorithm?

In the context of Deep Learning and Machine Learning, optimization algorithm is a method to minimize the cost function. The cost function is a function that measures how well the model is performing. In most cases, the cost function is the average difference between the predicted value and the actual value e.g. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). The goal of the optimization algorithm is to find the optimal parameters that minimize the cost function.

Remember not to confuse cost function and loss function. According to a post in Baldeung1, the loss function is the difference between the predicted value and the actual value for a single training example. While the cost function is the average of the loss function over all the training examples.

Most importantly, we will go through the series gradually, so that the readers will not get overwhelmed with lots of details, especially the mathematical notations and formulas. First, we are going to start from Batch Gradient Descent. Slowly, we are going to customize the algorithm bit by bit to get to the most advanced version of Gradient Descent, which is Adam.

Here is the list of the links to those posts:

  1. Batch Gradient Descent ✅
  2. Mini Batch Gradient Descent ✅
  3. Stochastic Gradient Descent ✅
  4. SGD with Momentum ✅
  5. SGD with Nesterov ✅
  6. AdaGrad ✅
  7. AdaDelta ✅
  8. RMSprop ✅
  9. Adam ✅
  10. Adamax (Coming soon)
  11. Nadam (Coming soon)

Footnotes

  1. https://www.baeldung.com/cs/cost-vs-loss-vs-objective-function ↩