Stochastic Gradient Descent

"Linear Regression + another version of Gradient Descent in Python"
04 April 2022

Previously, we have learned that BGD updates β0\beta_0 and β1\beta_1 only after it has seen the entire dataset. As for MBGD, it only updates them after it has seen a fraction of of the entire dataset.

In this post, we are going to implement another variation of Gradient Descent called Stochastic Gradient Descent, and now I am going to call it SGD throughout this post.

Introduction

SGD is an optimization algorithm that is derived from BGD and MBGD. Since we are dealing with Linear Regresison, this algorithm helps us to find the best fit line for our data. In other words, we want to find the best value for β0\beta_0 and β1\beta_1 so that the line sits right in a position where it's close to most data points. As the regression line moves toward where most data points sit, the lost function, in this case the Mean Square Error value, will decrease.

Before we start, we should get ourselves familiar with the mathematical part of this algorithm.

Mathematics of SGD

The whole idea of SGD is that it updates the parameters each time we see a random data points.

So, we are going to do a litle review to refresh our memory from the first post in this series.

Remember that the cost function that we are going to minimized in BGD is

J(β0,β1)=1Ni=1N(f(x)y1)2 J(\beta_0, \beta_1) = \frac{1}{N} \sum_{i=1}^N (f(x) - y_1)^2

However, since SGD updates β0\beta_0 and β1\beta_1 after seeing a random data points, we do not have to divide the summation by NN.

Thus, the cost function will be

J(β0,β1)=(f(x)y1)2 J(\beta_0, \beta_1) = (f(x) - y_1)^2

The update rules stay the same

Θi=βiαβiJ(β0,β1) \Theta_{i} = \beta_i - \alpha \cdot \frac{\partial}{\partial \beta_i} J(\beta_0, \beta_1)

where βi\beta_i is the coefficients we want to update and α\alpha is the learning rate.

Applying Power Rule to the cost function, we have

βiJ(β0,β1)=2(f(x)yi)βi(f(x)yi) \frac{\partial}{\partial \beta_i} J(\beta_0, \beta_1) = 2(f(x) - y_i) \frac{\partial}{\partial \beta_i} (f(x) - y_i)

Let's partially derive the cost function with respect to β0\beta_0 and β1\beta_1.

β0J(β0,β1)=2(f(x)yi)β0(β0+β1xyi)=2(f(x)yi) \begin{aligned} \frac{\partial}{\partial \beta_0} J(\beta_0, \beta_1) &= 2 (f(x) - y_i) \frac{\partial}{\partial \beta_0} (\beta_0 + \beta_1 x - y_i) \\ &= 2 (f(x) - y_i) \end{aligned}
β1J(β0,β1)=2(f(x)yi)x \frac{\partial}{\partial \beta_1} J(\beta_0, \beta_1) = 2(f(x) - y_i)x

Remember that the scalar 2 in the partial derivative equations above can be cancelled out by dividing the SGD cost function by 2. We called the new customized cost function One Half Mean Squared Error.

J(β0,β1)=12(f(x)y1)2 J(\beta_0, \beta_1) = \frac{1}{2}(f(x) - y_1)^2

Thus the partial derivatives with respect to β0\beta_0 and β1\beta_1 will be

β0J(β0,β1)=(f(x)yi)β1J(β0,β1)=(f(x)yi)x \begin{aligned} \frac{\partial}{\partial \beta_0} J(\beta_0, \beta_1) = (f(x) - y_i) \\ \frac{\partial}{\partial \beta_1} J(\beta_0, \beta_1) = (f(x) - y_i) x \end{aligned}

Plugging each of the equation above into the update rules with respect to those coefficients, we get

β0=β0α(f(x)yi)β1=β1α(f(x)yi)x \begin{aligned} \beta_0 = \beta_0 - \alpha \cdot (f(x) - y_i) \\ \beta_1 = \beta_1 - \alpha \cdot (f(x) - y_i)x \end{aligned}

The equations above is a variation of Gradient Descent algorithm that helps us to approximate the minimum value of the cost function by updating β0\beta_0 and β1\beta_1 iteratively.

Coding

We are going to need two functions to implement SGD.

  1. predict(...) which received a single data point and returns the predicted value of the data point.
  2. sgd(...) which received a list of data points and returns the updated coefficients presented in a dataframe.
def predict(intercept, coefficient, x):
    return intercept + coefficient * x
 
def sgd(x, y, df, epochs, alpha = 0.01):
    intercept = 2.0
    coefficient = -7.5
    sum_error = 0
 
    rand = np.random.randint(0, len(x))
    prediction = predict(intercept, coefficient, x[rand])
    sum_error = sum_error + (((prediction - y[rand]) ** 2) / 2)
    df.loc[0] = [intercept, coefficient, sum_error]
 
    for i in range(1,epochs):
        # get random index
        rand = np.random.randint(0, len(x))
        # get random x and y
        x_i, y_i = x[rand], y[rand]
        # get prediction
        prediction = predict(intercept, coefficient, x_i)
        b0_error = prediction - y_i
        b1_error = (prediction - y_i) * x_i
 
        intercept = intercept - alpha * b0_error
        coefficient = coefficient - alpha * b1_error
 
        sum_error = (((prediction - y_i) ** 2) / 2)
        df.loc[i] = [intercept, coefficient, sum_error]
    return df

From the name of the optimization itself, it behaves in a stochastic ways, meaning we pick a random data point and update the intercept and the coefficient iteratively. That's exactly what the 16th and 18th lines of the code above are doing.

Apart from that, the rest of the code looks exactly the same like the ones in BGD and MBGD.

References

  1. O. Artem. Stochastic, Batch, and Mini-Batch Gradient Descent. Source https://towardsdatascience.com/stochastic-batch-and-mini-batch-gradient-descent-demystified-8b28978f7f5
  2. P. Sushant. Batch, Mini Batch, and Stochastic Gradient Descent. Source https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a
  3. Geeksforgeeks. Difference between Batch Gradient Descent and Stochastic Gradient Descent. Source https://www.geeksforgeeks.org/difference-between-batch-gradient-descent-and-stochastic-gradient-descent/
  4. Sweta. Batch, Mini Batch, and Stochastic Gradient Descent. Source https://sweta-nit.medium.com/batch-mini-batch-and-stochastic-gradient-descent-e9bc4cacd461
  5. R. Sebastian. Gradient Descent and Stochastic Gradient Descent. Source https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/
  6. Geeksforgeeks. ML | Mini-Batch Gradient Descent with Python. Source https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/