Series

Batch Gradient Descent

Linear Regression + Batch Gradient Descent in Python


  • Bijon Setyawan Raya

  • February 16, 2022

    15 mins


    Linear Regression + Batch Gradient Descent in Python

    Gradient Descent Algorithm (6 Parts)


    Setting Up The Dataset#

    In this example, we are going to include the Iris Dataset from UCI Machine Learning Repository imported from scikit-learn. There are two features in the dataset that we are going to analyse, namely sepal_length and petal_width shown in the highlighted lines.

    from sklearn.datasets import load_iris
    
    iris = load_iris()
    features = iris.data
    target = iris.target
    
    sepal_length = np.array(features[:, 0])
    petal_width = np.array(features[:, 3])
    
    species_names = list()
    
    for i in target:
        if i == 0:
        species_names.append('setosa')
        elif i == 1:
        species_names.append('versicolor')
        else:
        species_names.append('virginica')
    

    Before we implement Batch Gradient Descent in Python, we need to set a baseline to compare against our own implementation. So, we are going to train our dataset into the Linear Regression built-in function made by scikit-learn.

    First, let's fit our dataset to LinearRegression() model that we imported from sklearn.linear_model.

    linreg = LinearRegression()
    
    linreg.fit(
        X = sepal_length.reshape(-1,1),
        y = petal_width.reshape(-1,1)
    )
    
    print("Intercept: ",linreg.intercept_[0])
    # Intercept: -3.200215
    print("First coefficient:", linreg.coef_[0][0])
    # First coeficient: 0.75291757
    

    Once we have the intercept and the coefficient values, let's make a regression line to see if the line is close to most data points.

    sns.scatterplot(
        x = sepal_length,
        y = petal_width,
        hue = species_names
    )
    
    plt.plot(
        sepal_length,
        linreg.intercept_[0] +
        linreg.coef_[0][0] * features[:, 0],
        color='red'
    )
    
    The iris dataset regression line with Scikit
    The iris dataset regression line with Scikit

    Clearly, the line is indeed very close to the most data points and we want to see the MSE of this regression line.

    linreg_predictions = linreg.predict(sepal_length.reshape(-1,1))
    linreg_mse = mean_squared_error(linreg_predictions, petal_width)
    print(f"The MSE is {linreg_mse}")
    # The MSE is 0.19101500769427357
    

    From the result we got from sklearn, the best regression line is

    y=3.200215+0.75291757x y = -3.200215 + 0.75291757 \cdot x

    with MSE value around 0.1910.191. The equation above is going to be our base line for this experiment to determine how good our own Gradient Descent implementation.

    Remember that in the first part of this series, we have customized the cost function, which is the MSE, simply by multiplying it by half and named it One Half Mean Squared Error.

    J(Θ0,Θ1)=12Ni=1N(y^iyi)2 J(\Theta_0, \Theta_1) = \frac{1}{2N} \sum_{i=1}^N (\hat{y}_i - y_i)^2

    We have also acquired two equations that are reponsible for updating β0\beta_0 and β1\beta_1. Namely,

    β0=β0αNi=1N(f(x)yi)β1=β1αNi=1N(f(x)yi)x \begin{aligned} \beta_0 = \beta_0 - \frac{\alpha}{N} \sum_{i=1}^N (f(x) - y_i) \\ \beta_1 = \beta_1 - \frac{\alpha}{N} \sum_{i=1}^N (f(x) - y_i) x \end{aligned}

    Now, let's translate these three equations into Python code.

    def bgd(x, y, epochs, df, alpha = 0.01):
        intercept, coefficient = 2.0, -7.5
    
        # first sum_error
        predictions = predict(intercept, coefficient, x)
        sum_error = np.sum((predictions - y) ** 2) / (2 * len(x))
        df.loc[0] = [intercept, coefficient, sum_error]
    
        for epoch in range(1, epochs):
            predictions = predict(intercept, coefficient, x)
            b0_error = (1/len(x)) * np.sum(predictions - y)
            b1_error = (1/len(x)) * np.sum((predictions - y) * x)
            intercept = intercept - alpha * b0_error
            coefficient = coefficient - alpha * b1_error
            sum_error = sum_error + np.sum((predictions - y) ** 2) / (2 * len(x))
            df.loc[epoch] = [intercept, coefficient, sum_error]
            sum_error = 0
        return df
    

    The highlighted codes is where the parameters update happens.

    Once the parameters were updated, we then calculate the cost function for each iteration and save it into the dataframe we created.

    bgd_loss = pd.DataFrame(columns=['intercept', 'coefficient', 'sum_error'])
    bgd_loss = bgd(sepal_length, petal_width, epochs = 10_000, df = bgd_loss)
    

    Below is the figure of the regression lines tend to look like at the 1,000th, the 5,000th, and the 10,000th iterations.

    BGD Loss Function Graph
    BGD Loss Function Graph

    After 10,000 iterations, the MSE value of our own Gradient Descent is 0.1950.195 which is quite close to our baseline, which is 0.1910.191.

    Combining everything, here is how the regression line changes over time.

    Regression line changes over time
    Regression line changes over time

    Let's animate the movement of the regression lines.

    Regression line animation
    Regression line animation

    Now, let's see how the movement of the intercept and the coefficient variabels on a contour map.

    The movement of the intercept and the coefficient variabels on a contour map
    The movement of the intercept and the coefficient variabels on a contour map

    Here are some keypoints for Batch Gradient Descent:

    1. Batch Gradient Descent only updates the parameters once after considering all the data points.
    2. Since the parameters are updated once after considering all the data points, it takes longer time for the algorithm to converge.
    3. Not only does Batch Gradient Descent takes a significant amount time to converge due to the large number of data points, but it also takes up a lot of computational resources.
    4. Batch Gradient Descent is not the best algorithm for large datasets.
    5. Since the gradients we get over time are pretty much the same, then it may stuck in a local minima. Noisy gradients would allow the algorithm to escape local minima.

    Since I wanted to present you an easy and succint explanation of Batch Gradient Descent, so I decided to not to include all the codes for the sake of simplicity. If you want to know how to implement it in more details, please click here to check the Python notebook on Kaggle.

    1. M. Jack. 3D Gradient Descent in Python. Source https://jackmckew.dev/3d-gradient-descent-in-python.html
    2. T. Arseny. Gradient Descent From Scratch. Source https://towardsdatascience.com/gradient-descent-from-scratch-e8b75fa986cc
    3. O. Artem. Stochastic, Batch, and Mini-Batch Gradient Descent. Source https://towardsdatascience.com/stochastic-batch-and-mini-batch-gradient-descent-demystified-8b28978f7f5
    4. P. Sushant. Batch, Mini Batch, and Stochastic Gradient Descent. Source https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a
    5. Geeksforgeeks. Difference between Batch Gradient Descent and Stochastic Gradient Descent. Source https://www.geeksforgeeks.org/difference-between-batch-gradient-descent-and-stochastic-gradient-descent/
    6. Sweta. Batch, Mini Batch, and Stochastic Gradient Descent. Source https://sweta-nit.medium.com/batch-mini-batch-and-stochastic-gradient-descent-e9bc4cacd461
    7. Geeksforgeeks. ML | Mini-Batch Gradient Descent with Python. Source https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/

    Related Posts