Gradient Descent Algorithm: 10 Part(s)

In this series, we are going to learn one of the well-known optimization algorithms, namely Gradient Descent algorithm.

In the context of Deep Learning and Machine Learning, optimization algorithm is a method to minimize the cost function. The cost function is a function that measures how well the model is performing. In most cases, the cost function is the average difference between the predicted value and the actual value e.g. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). The goal of the optimization algorithm is to find the optimal parameters that minimize the cost function.

Remember not to confuse cost function and loss function.
According to a post in Baldeung^{1}, the loss function is the difference between the predicted value and the actual value for a single training example.
While the cost function is the average of the loss function over all the training examples.

Most importantly, we will go through the series gradually, so that the readers will not get overwhelmed with lots of details, especially the mathematical notations and formulas. First, we are going to start from Batch Gradient Descent. Slowly, we are going to customize the algorithm bit by bit to get to the most advanced version of Gradient Descent, which is Adam.

Here is the list of the links to those posts:

- Batch Gradient Descent âœ…
- Mini Batch Gradient Descent âœ…
- Stochastic Gradient Descent âœ…
- SGD with Momentum âœ…
- SGD with Nesterov âœ…
- AdaGrad âœ…
- AdaDelta âœ…
- RMSprop âœ…
- Adam âœ…
- Adamax (Coming soon)
- Nadam (Coming soon)