## Optimized Gradient Descent

Putting some turbo boost into our gradient descent code

Series

Series

Linear Regression + another version of Gradient Descent in Python

BR

Bijon Setyawan Raya

April 04, 2022

8 mins

Introduction

Linear Regression

Mathematics of Gradient Descent

Batch Gradient Descent

Mini Batch Gradient Descent

Stochastic Gradient Descent

Previously, we have learned that BGD updates $\beta_0$ and $\beta_1$ only after it has seen the entire dataset. As for MBGD, it only updates them after it has seen a fraction of of the entire dataset.

In this post, we are going to implement another variation of Gradient Descent called Stochastic Gradient Descent, and now I am going to call it SGD throughout this post.

SGD is an optimization algorithm that is derived from BGD and MBGD. Since we are dealing with Linear Regresison, this algorithm helps us to find the best fit line for our data. In other words, we want to find the best value for $\beta_0$ and $\beta_1$ so that the line sits right in a position where it's close to most data points. As the regression line moves toward where most data points sit, the lost function, in this case the Mean Square Error value, will decrease.

Before we start, we should get ourselves familiar with the mathematical part of this algorithm.

The whole idea of SGD is that it updates the parameters each time we see a random data points.

So, we are going to do a litle review to refresh our memory from the first post in this series.

Remember that the cost function that we are going to minimized in BGD is

$J(\beta_0, \beta_1) = \frac{1}{N} \sum_{i=1}^N (f(x) - y_1)^2$

However, since SGD updates $\beta_0$ and $\beta_1$ after seeing a random data points, we do not have to divide the summation by $N$.

Thus, the cost function will be

$J(\beta_0, \beta_1) = (f(x) - y_1)^2$

The update rules stay the same

$\Theta_{i} = \beta_i - \alpha \cdot \frac{\partial}{\partial \beta_i} J(\beta_0, \beta_1)$

where $\beta_i$ is the coefficients we want to update and $\alpha$ is the learning rate.

Applying Power Rule to the cost function, we have

$\frac{\partial}{\partial \beta_i} J(\beta_0, \beta_1) = 2(f(x) - y_i) \frac{\partial}{\partial \beta_i} (f(x) - y_i)$

Let's partially derive the cost function with respect to $\beta_0$ and $\beta_1$.

$\begin{aligned}
\frac{\partial}{\partial \beta_0} J(\beta_0, \beta_1)
&= 2 (f(x) - y_i) \frac{\partial}{\partial \beta_0} (\beta_0 + \beta_1 x - y_i) \\
&= 2 (f(x) - y_i)
\end{aligned}$

$\frac{\partial}{\partial \beta_1} J(\beta_0, \beta_1) = 2(f(x) - y_i)x$

Remember that the scalar 2 in the partial derivative equations above can be cancelled out by dividing the SGD cost function by 2.
We called the new customized cost function **One Half Mean Squared Error**.

$J(\beta_0, \beta_1) = \frac{1}{2}(f(x) - y_1)^2$

Thus the partial derivatives with respect to $\beta_0$ and $\beta_1$ will be

$\begin{aligned}
\frac{\partial}{\partial \beta_0} J(\beta_0, \beta_1) = (f(x) - y_i) \\
\frac{\partial}{\partial \beta_1} J(\beta_0, \beta_1) = (f(x) - y_i) x
\end{aligned}$

Plugging each of the equation above into the update rules with respect to those coefficients, we get

$\begin{aligned}
\beta_0 = \beta_0 - \alpha \cdot (f(x) - y_i) \\
\beta_1 = \beta_1 - \alpha \cdot (f(x) - y_i)x
\end{aligned}$

The equations above is a variation of Gradient Descent algorithm that helps us to approximate the minimum value of the cost function by updating $\beta_0$ and $\beta_1$ iteratively.

We are going to need two functions to implement SGD.

`predict(...)`

which received a single data point and returns the predicted value of the data point.`sgd(...)`

which received a list of data points and returns the updated coefficients presented in a dataframe.

```
def predict(intercept, coefficient, x):
return intercept + coefficient * x
def sgd(x, y, df, epochs, alpha = 0.01):
intercept = 2.0
coefficient = -7.5
sum_error = 0
rand = np.random.randint(0, len(x))
prediction = predict(intercept, coefficient, x[rand])
sum_error = sum_error + (((prediction - y[rand]) ** 2) / 2)
df.loc[0] = [intercept, coefficient, sum_error]
for i in range(1,epochs):
# get random index
rand = np.random.randint(0, len(x))
# get random x and y
x_i, y_i = x[rand], y[rand]
# get prediction
prediction = predict(intercept, coefficient, x_i)
b0_error = prediction - y_i
b1_error = (prediction - y_i) * x_i
intercept = intercept - alpha * b0_error
coefficient = coefficient - alpha * b1_error
sum_error = (((prediction - y_i) ** 2) / 2)
df.loc[i] = [intercept, coefficient, sum_error]
return df
```

From the name of the optimization itself, it behaves in a stochastic ways, meaning we pick a random data point and update the intercept and the coefficient iteratively. That's exactly what the 16th and 18th lines of the code above are doing.

Apart from that, the rest of the code looks exactly the same like the ones in BGD and MBGD.

- O. Artem.
*Stochastic, Batch, and Mini-Batch Gradient Descent*. Source https://towardsdatascience.com/stochastic-batch-and-mini-batch-gradient-descent-demystified-8b28978f7f5 - P. Sushant.
*Batch, Mini Batch, and Stochastic Gradient Descent*. Source https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a - Geeksforgeeks.
*Difference between Batch Gradient Descent and Stochastic Gradient Descent*. Source https://www.geeksforgeeks.org/difference-between-batch-gradient-descent-and-stochastic-gradient-descent/ - Sweta.
*Batch, Mini Batch, and Stochastic Gradient Descent*. Source https://sweta-nit.medium.com/batch-mini-batch-and-stochastic-gradient-descent-e9bc4cacd461 - R. Sebastian.
*Gradient Descent and Stochastic Gradient Descent*. Source https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/ - Geeksforgeeks.
*ML | Mini-Batch Gradient Descent with Python*. Source https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/