Mini Batch Gradient Descent

"Linear Regression + Mini-Batch Gradient Descent in Python"
03 March 2022

Previously, we have learned that Batch Gradient Descent only updates the parameters only after it has seen the entire dataset. However, in this post, we are going to implement Mini-Batch Gradient Descent which updates the parameters after it has seen a fraction of a dataset. Not only that, but we are also going to find out the best batch size to make the algorithm converge faster.

From now on, I will use two abbreviations, namely BGD for Batch Gradient Descent and Mini-Batch Gradient Descent, throughout this post.


In this implementation, we are going to use some of the codes from the previous posts. However, we are going to customize BGD function so the parameters are updated after it has seen a fraction of the dataset.

First, we need a function that splits the dataset into mini-batches.

def create_batches(x, y, batch_size):
  x_batches = np.array_split(x, len(x) // batch_size)
  y_batches = np.array_split(y, len(y) // batch_size)
  return x_batches, y_batches

Second, we need a customized version of BGD function that updates the required parameters after seeing a small number of data points. Since there are 150 data points in the Iris dataset, we will use a batch size of 64 and there will be two batches in total.

The following code is the customized bgd() function that updates the parameters after it has seen a fraction of the dataset.

def mbgd(x, y, epochs, df, batch_size = 64, alpha = 0.01):
  intercept, coefficient = 2.0, -7.5
  # first sum_error
  predictions = predict(intercept, coefficient, x)
  sum_error = np.sum((predictions - y) ** 2) / (2 * batch_size)
  df.loc[0] = [-1, intercept, coefficient, sum_error]
  index = 1
  x_batches, y_batches = create_batches(x, y, batch_size)
  for epoch in range(epochs):
    for x, y in zip(x_batches, y_batches):
      sum_error = 0.0
      predictions = predict(intercept, coefficient, x)
      b0_error = (1/batch_size) * np.sum(predictions - y)
      b1_error = (1/batch_size) * np.sum((predictions - y) * x)
      intercept = intercept - alpha * b0_error
      coefficient = coefficient - alpha * b1_error
      sum_error = sum_error + np.sum((predictions - y) ** 2) / (2 * batch_size)
      df.loc[index] = [int(epoch), intercept, coefficient, sum_error]
      index += 1
  return df

First, the initial point is set to be (2.0, -7.5) because I want the initial guess to be far from the convergence point.

Once we have the initial guess, the batches will be created just before the for-loop, and the parameter updates will occur inside the for-loop.

Mean Squared Error

In the previous section, I just explained on how to implement MBGD in Python. Now, let's find out what we can get from this experimentation.

epochs = 5000
mbgd_loss = pd.DataFrame(
  columns=['epoch', 'intercept', 'coefficient', 'sum_error']
mbgd_loss = mbgd(
  epochs = epochs,
  df = mbgd_loss
mbgd_loss['epoch'] = mbgd_loss['epoch'].astype(int)

Althought the size of the batches matters in MBGD, I will explain it later in the next part and we will stick to size 64 this time.

MSE values produced by MBGD with different batch sizes
MSE values produced by MBGD with different batch sizes

From the picture, we can see that the MSE graphs are thick because the graph goes up and down, following a zig-zag pattern, unlike the one produced by BGD which goes down in a straight line.

In addition to the zig-zag pattern, the time complexity increases by half as the number of batch sizes decreases, but the MSE graphs gets closer and closer to zero. The difference is more obvious shown in the first graph and the fifth graph.

Note that the number of updates increases when the number of batch size decreases. Take example of batch size of 88. Since the number of data points is 150, the number of updates is 150/8=18\lfloor 150 / 8 \rfloor = 18 per epoch. In total, there will be 1818 batches ×\times 500500 epochs =9000= 9000 updates.

Best Batch Size

In the previous part, showing the MSE values from different batch sizes doesn't tell us much if our model is getting better. So, we are going to use the Total of Sum Squares to find the best batch size.

TSS=i=1N(y^iyi)2 \text{TSS} = \sum_{i=1}^{N} (\hat{y}_i - y_i)^2

where yiy_i represents the predicted value and y^i\hat{y}_i represents the actual value.

There are 4 batch sizes that I want to observe, namely 8, 16, 32, and 64.

TSS values produced by MBGD with different batch sizes
TSS values produced by MBGD with different batch sizes

The TSS graphs provide us a clearer picture on how the regression model react to the batch size. Similar to the MSE graphs with different batch sizes, the TSS graphs get closer and closer to zero as the number of batch sizes decreases

On the contrary, the larger batch sizes will produce more accurate model without noise. Yet it comes with the expense of reaching the convergence point slower.

To answer what is the best batch size, it depends a lot on the hardware you have. Let's say you have multiple latest and advanced GPUs, then you can try to use 32 or 64. These two sizes are recommended since those are how much the memory in CPUs and GPUs can store at a time.

Not only it depends only the hardware you have, you should consider how your model reacts to the number of batch sizes. So, there is no clear answer to this question. I personally think it's all down to trial and error.

Regression Lines

There will be two graphs, one contains how the regression line movement pattern over nn number of updates, and the other is the animation of the regression line movements.

Regression Lines
Regression Lines

I am sure since the cost function follows a zig-zag pattern, the regression line moves in a zig-zag pattern as well like a graph shown above.

Regression Line Movements
Regression Line Movements

However, it's not noticeable the pattern movement in the animation. If somebody can explain me in details on why this is happening, do not hesitate to send me an email.

Now, let's see how the movement of the intercept and the coefficient variabels on a contour map.

The intercept and the coefficient variable on a contour map
The intercept and the coefficient variable on a contour map

If you compare the contour map in the BGD page, you will notice that the contour map made by BGD is much more smooth. Unlike the one in BGD, the size of the line above is thicker, and it looks like is walking up the stairs.


There are several key points to be noted in this post.

  1. MBGD requires less iterations to reach the convergence point compared to BGD.
  2. The number of parameter updates in MBGD is the same with BGD, except the size of mini batches.
  3. The size of mini batches might affect MBGD's efficiency.
  4. The closer the batch size is to the number of data points, the thinner the cost function line gets.
  5. The MSE line produced by MBGD follows a zig-zag pattern, and that's why it looks thicker.
  6. The ideal batch size is around 32 or 64.

If you want to know how to implement MBGD in more details, please click here to check the Python notebook on Kaggle. I also include some code on how to generate those graphs.

The code might look unorganized since I have been busy with my thesis, so the code might change from time to time. See ya at the next post!


  1. O. Artem. Stochastic, Batch, and Mini-Batch Gradient Descent. Source
  2. P. Sushant. Batch, Mini Batch, and Stochastic Gradient Descent. Source
  3. Geeksforgeeks. Difference between Batch Gradient Descent and Stochastic Gradient Descent. Source
  4. Sweta. Batch, Mini Batch, and Stochastic Gradient Descent. Source
  5. B. Jason. A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size. Source
  6. Geeksforgeeks. ML | Mini-Batch Gradient Descent with Python. Source

Other Resources