BSR

Optimization

13 February 2022

Mathematics of Gradient Descent

Introduction

Throughout this post, we are going to find out how the Gradient Descent algorithm can help us to find the best paremeters iteratively by minimizing the cost function of a model.

Note that the term loss Function and cost function are used interchangeably. We should know that loss unction is the error value of a single data point, while cost function is the average of all loss functions of all data points.

Example

Assume we have random data points like in the following graph.

We then make a regression line based on the following equation.

y=θ0+θ1x y = \theta_0 + \theta_1 x

At this point, we want to find the best value for θ0\theta_0 and θ1\theta_1. Assuming that we can't come up with the best answer, we can just randomly assign some random numbers to them. Now, let's set θ0=0\theta_0 = 0 and θ1=5\theta_1 = -5, and we are going to get a regression line in the following graph.

The problem is that we can't simply guess random numbers and plug them in to θ0\theta_0 and θ1\theta_1 over and over again. Clearly, we need a way to automate this.

Let's make a 2D graph showing the MSE graph where θ0=0\theta_0 = 0 and 10θ113-10 \leq \theta_1 \leq 13.

In the graph above, when θ0=0\theta_0 = 0 and θ1=5\theta_1 = -5, the MSE value is 217.42217.42 shown as a red dot.

I intentionally make the range wider to show everyone that the cost function (MSE) looks like an exponential function. It's not easy to tell where the minimum point is at this moment, but it could be around 3θ13-3 \leq \theta_1 \leq 3. Therefore, we are not going to guess the suitable value θ1\theta_1 one by one.

The ideal scenario is that we can have the red dot going down the valley slowly by itself.

However, we don't want the red dot to bounce here and there. If so, it means that the learning rate is too high. Don't worry if you are not sure what learning rate is since I have a seperate post explaning what it is.

In the next section, we are going to discuss how we can help our little friend The Little Red Dot going down the valley. Brace yourself because it is about to go down. I swear, no pun intended.

Mathematics of Gradient Descent

Since we want to minimize the MSE value of the Linear Regression model, we want to find the best value for the intercept θ0\theta_0 and the coefficient θ1\theta_1 so that the regression line that is located as close to most data points as possible.

The following is the cost function, or the objective function, that we want to minimize.

minθJ(θ)=1Ni=1N(f(x)yi)2 \min_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^N (f(x) - y_i)^2

where NN is the number of data points, f(x)f(x) is the predicted value, and yiy_i is the actual value.

To acheive this, we have to update parameters using the following equation.

θi=θiαθiJ(θ) \theta_{i} = \theta_i - \alpha \cdot \nabla_{\theta_i} J(\theta)

where θi\theta_i is ii-the parameter, or the parameter we want to update, and α\alpha is the learning rate.

Some textbooks would write the equation above as

θ=θαθJ(θ)\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)

θ\nabla_{\theta} and θ\frac{\partial}{\partial \theta} mean the gradient of the cost function w.r.t. the parameters.

Note that the gradient of the cost function θiJ(θ)\nabla_{\theta_i} J(\theta) points to the direction of the steepest ascent of the cost function. In order to move in the opposite direction of the gradient to minimize the cost functionm, we subtract θi\theta_i with the gradient of the cost function. Thus, the minimum cost function can be reached.

Let's partial differentiate the cost function that we want to minimize.

θiJ(θ)=θi(1Ni=1N(f(x)yi)2)=1Nθii=1N(f(x)yi)2 \begin{aligned} \nabla_{\theta_i} J(\theta) &= \frac{\partial}{\partial \theta_i} (\frac{1}{N} \sum_{i=1}^N (f(x) - y_i)^2) \\ &= \frac{1}{N} \frac{\partial}{\partial \theta_i} \sum_{i=1}^N (f(x) - y_i)^2 \end{aligned}

Solving the equality above with Power Rule, we then have

θiJ(θ)=2Ni=1N(f(x)yi)θi(f(x)yi) \frac{\partial}{\partial \theta_i} J(\theta) = \frac{2}{N} \sum_{i=1}^N (f(x) - y_i) \frac{\partial}{\partial \theta_i} (f(x) - y_i)

Since we want to update the θ0\theta_0 and θ1\theta_1 coefficients, we need to find the partial derivative of the cost function with respect to those coefficients.

θ0J(θ)=2Ni=1N(f(x)yi)θ0(θ0+θ1xyi)=2Ni=1N(f(x)yi) \begin{aligned} \frac{\partial}{\partial \theta_0} J(\theta) &= \frac{2}{N} \sum_{i=1}^N (f(x) - y_i) \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_1 x - y_i) \\ &= \frac{2}{N} \sum_{i=1}^N (f(x) - y_i) \end{aligned} θ1J(θ)=2Ni=1N(f(x)yi)x \frac{\partial}{\partial \theta_1} J(\theta) = \frac{2}{N} \sum_{i=1}^N (f(x) - y_i) x

We can remove the scalar 22 from the two equations above by dividing the cost function, in this case the MSE equation, by 22 [1]. Multiplying the cost function with a scalar will not affect how it's going to reach the minimum MSE value.

This new modified cost function is called One Half Mean Squared Error.

J(θ)=12Ni=1N(y^iyi)2 J(\theta) = \frac{1}{2N} \sum_{i=1}^N (\hat{y}_i - y_i)^2

Deriving the new cost function above, the scalar 22 will be cancelled out from the partial derivations with respect to θ0\theta_0 and θ1\theta_1,

θiJ(θ)=1Ni=1N(f(x)yi)θiJ(θ)=1Ni=1N(f(x)yi)x \begin{aligned} \nabla_{\theta_i} J(\theta) = \frac{1}{N} \sum_{i=1}^N (f(x) - y_i) \\ \nabla_{\theta_i} J(\theta) = \frac{1}{N} \sum_{i=1}^N (f(x) - y_i) x \end{aligned}

Plugging each of the equation above into the update rules with respect to those coefficients, we get

θ0=θ0α1Ni=1N(f(x)yi)θ1=θ1α1Ni=1N(f(x)yi)x \begin{aligned} \theta_0 = \theta_0 - \alpha \cdot \frac{1}{N} \sum_{i=1}^N (f(x) - y_i) \\ \theta_1 = \theta_1 - \alpha \cdot \frac{1}{N} \sum_{i=1}^N (f(x) - y_i) x \end{aligned}

The two equations above will help us to approximate the minimum value of the cost function by updating θ0\theta_0 and θ1\theta_1 over time.

Conclusion

Here are the takeaway keypoints from this entire post.

  1. Gradient Descent is a technique that helps a model get better at future predictions.
  2. The cost function minθJ(θ)=1Ni=1N(y^iyi)2 \min_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2
  3. The update rule for θ0\theta_0 θ0=θ0α1Ni=1N(y^iyi) \theta_0 = \theta_0 - \alpha \cdot \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)
  4. The update rule for θ1\theta_1 θ1=θ1α1Ni=1N(y^iyi)xi \theta_1 = \theta_1 - \alpha \cdot \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)x_i
  5. The weights θ0\theta_0 and θ1\theta_1 are updated iteratively.
  6. The learning rate α\alpha is a hyperparameter that controls the speed of the gradient descent on any slopes.
  7. High learning rate may cause the model's MSE to bounce here and there or to converge at a suboptimial minimum.
  8. Low learning rate will cause the model to stuck in a suboptimial minimum.

In the next part, we are going to implement Batch Gradient Descent using the mathematical equations that we have derived. See you in the next one.

References

  1. Chris McCormick. Gradient Descent Derivation. https://mccormickml.com/2014/03/04/gradient-descent-derivation.
  2. G. Sanchez & E. Marzban. All Models are Wrong. https://allmodelsarewrong.github.io/gradient.html.