Gradient Descent Algorithm: 11 Part(s)
In the [Batch Gradient Descent] post, we have discussed that the intercept and the coefficient are updated after the algorithm has seen the entire dataset.
In this post, we will discuss the Mini-Batch Gradient Descent (MBGD) algorithm. MBGD is quite similar to BGD, but the only difference is that the parameters are updated after seeing a subset of the dataset.
The parameter update rule is expressed as
where
The gradient of the cost function w.r.t. to the intercept and the coefficient are expressed as the following.
where is the batch size.
Notice that the gradient of the cost function w.r.t. to the intercept is the prediction error.
For more details, please refer to the Mathematics of Gradient Descent post.
First, define the predict
and create_batches
functions.
def predict(intercept, coefficient, dataset):
return np.array([intercept + coefficient * x for x in dataset])
def create_batches(x, y, batch_size):
x_batches = np.array_split(x, len(x) // batch_size)
y_batches = np.array_split(y, len(y) // batch_size)
return x_batches, y_batches
Second, split the dataset into mini batches.
x_batches, y_batches = create_batches(x, y, batch_size)
Third, determine the prediction error of each mini batch and the gradient of the cost function w.r.t the intercept and the coefficient .
predictions = predict(intercept, coefficient, batch_x)
error = predictions - y
intercept_gradient = np.sum(error) / batch_size
coefficient_gradient = np.sum(error * x) / batch_size
Lastly, update the intercept and the coefficient .
intercept = intercept - alpha * intercept_gradient
coefficient = coefficient - alpha * coefficient_gradient
The change of the regression line over time with 64 batch size The effect of batch sizes on the cost function
From the graph above, we can see that the cost function line is less noisy, or smoother, when the batch size is larger. Thus, 50 to 256 is a good range for the batch size. However, it really depends on the hardware of the machine and the size of the dataset.
The pathway of the cost function over the 2D MSE contour
Unlike BGD, we can see that the MBGD loss function pathway follows a zig-zag pattern while traversing the valley of the MSE contour.
def mbgd(x, y, epochs, df, batch_size, alpha = 0.01):
intercept, coefficient = 2.0, -7.5
x_batches, y_batches = create_batches(x, y, batch_size)
predictions = predict(intercept, coefficient, x_batches[0])
error = predictions - y_batches[0]
mse = np.sum(error ** 2) / (2 * batch_size)
df.loc[0] = [intercept, coefficient, mse]
index = 1
for _ in range(epochs):
for batch_x, batch_y in zip(x_batches, y_batches):
predictions = predict(intercept, coefficient, batch_x)
error = predictions - batch_y
intercept_gradient = np.sum(error) / batch_size
coefficient_gradient = np.sum(error * batch_x) / batch_size
intercept = intercept - alpha * intercept_gradient
coefficient = coefficient - alpha * coefficient_gradient
mse = np.sum(error ** 2) / (2 * batch_size)
df.loc[index] = [intercept, coefficient, mse]
index += 1
return df