Optimized Gradient Descent
Putting some turbo boost into our gradient descent code
Series
Series
Linear Regression + another version of Gradient Descent in Python
Bijon Setyawan Raya
April 04, 2022
8 mins
Introduction
Linear Regression
Mathematics of Gradient Descent
Batch Gradient Descent
Mini Batch Gradient Descent
Stochastic Gradient Descent
Previously, we have learned that BGD updates and only after it has seen the entire dataset. As for MBGD, it only updates them after it has seen a fraction of of the entire dataset.
In this post, we are going to implement another variation of Gradient Descent called Stochastic Gradient Descent, and now I am going to call it SGD throughout this post.
SGD is an optimization algorithm that is derived from BGD and MBGD. Since we are dealing with Linear Regresison, this algorithm helps us to find the best fit line for our data. In other words, we want to find the best value for and so that the line sits right in a position where it's close to most data points. As the regression line moves toward where most data points sit, the lost function, in this case the Mean Square Error value, will decrease.
Before we start, we should get ourselves familiar with the mathematical part of this algorithm.
The whole idea of SGD is that it updates the parameters each time we see a random data points.
So, we are going to do a litle review to refresh our memory from the first post in this series.
Remember that the cost function that we are going to minimized in BGD is
However, since SGD updates and after seeing a random data points, we do not have to divide the summation by .
Thus, the cost function will be
The update rules stay the same
where is the coefficients we want to update and is the learning rate.
Applying Power Rule to the cost function, we have
Let's partially derive the cost function with respect to and .
Remember that the scalar 2 in the partial derivative equations above can be cancelled out by dividing the SGD cost function by 2. We called the new customized cost function One Half Mean Squared Error.
Thus the partial derivatives with respect to and will be
Plugging each of the equation above into the update rules with respect to those coefficients, we get
The equations above is a variation of Gradient Descent algorithm that helps us to approximate the minimum value of the cost function by updating and iteratively.
We are going to need two functions to implement SGD.
predict(...)
which received a single data point and returns the predicted value of the data point.sgd(...)
which received a list of data points and returns the updated coefficients presented in a dataframe.def predict(intercept, coefficient, x):
return intercept + coefficient * x
def sgd(x, y, df, epochs, alpha = 0.01):
intercept = 2.0
coefficient = -7.5
sum_error = 0
rand = np.random.randint(0, len(x))
prediction = predict(intercept, coefficient, x[rand])
sum_error = sum_error + (((prediction - y[rand]) ** 2) / 2)
df.loc[0] = [intercept, coefficient, sum_error]
for i in range(1,epochs):
# get random index
rand = np.random.randint(0, len(x))
# get random x and y
x_i, y_i = x[rand], y[rand]
# get prediction
prediction = predict(intercept, coefficient, x_i)
b0_error = prediction - y_i
b1_error = (prediction - y_i) * x_i
intercept = intercept - alpha * b0_error
coefficient = coefficient - alpha * b1_error
sum_error = (((prediction - y_i) ** 2) / 2)
df.loc[i] = [intercept, coefficient, sum_error]
return df
From the name of the optimization itself, it behaves in a stochastic ways, meaning we pick a random data point and update the intercept and the coefficient iteratively. That's exactly what the 16th and 18th lines of the code above are doing.
Apart from that, the rest of the code looks exactly the same like the ones in BGD and MBGD.