•

February 1, 2022

Linear Regression

Continuous prediction made from scratch

Regression Algorithms (Series)

What is Linear Regression?

Linear Regression is one of many supervised machine learning algorithms, and it is mosly used to predict the value of a continuous variable, as well as to do forecasting. In other words, it can be used:

to see if one variable can be used to predict another variable.
to see if one variable is correlated or dependent with another variable.

However, Linear Regression also comes with some limitations, such as:

It assumes that the relationship between the independent variable and the dependent variable is linear. However, in reality, the relationship between the independent variable and the dependent variable is not always linear.
It assumes that the independent variables are not correlated with each other. However, in reality, the independent variables are not always independent.
It’s sensitive to outliers. Meaning that the presence of outliers can affect the regression line.

From the graph above, we can see that the relationship between $x$ and $y$ is linear since the blue line starts from the bottom left to the top right. That line is called a regression line, and it can be expressed using the following equation.

\hat{y} = \theta_0 + \theta_1x

where $\theta_0$ is the intercept and $\theta_1$ is the first coefficient. In high school or college, we are used to seeing the equation above in the following form to calculate the distance between one point to another.

\hat{y} = b + ax

In this post, we are going to use two features in the Iris dataset from sklearn-learn, petal width and sepal length. Plotting it will give us the following visualization.

You should know that the intercept or $\theta_0$ is the starting point of the regression line. Whether the line is going up or down depends on the $\theta_1$ and the data. If $\theta_0 = 0$ , it means that our regression line will start from $0$ .

Expressing the equation like we did above is quite cryptic for people who don’t have strong mathematical background. Since we are using the Iris Dataset, we can translate the equation into a more readable form.

\text{petalwidth} = \theta_0 + \theta_1 \times \text{sepallength}

From the translation above can tell us the relationship between those two variables. Now you will be wondering their correlations whether sepal_length and petal_width are correlated or inversely correlated. First, let’s translate what the two graphs below are trying to tell us.

sepal_length and petal_width are said to be correlated when sepal_length increases, the petal_length also increases. Conversely, sepal_length and petal_width are said to be inversely correlated when sepal_length increases, but the petal_width decreases.

With a regression line, it can help us to predict the $y$ value given a single $x$ value. However, most predictions made by the regression line are not always accurate since its ability to predict depends heavily on $\theta_0$ and $\theta_1$ . If the values $\theta_0$ and $\theta_1$ are not tweaked correctly, the regression line will sit right far from most data points.

Estimating the Intercept and Coefficient

Previously, I mentioned that $\hat{y} = \theta_0 + \theta_1x$ is used to express the regression line. To be fair, we can’t just look at the data and say, “Ah ha! I can tell that $\theta_0$ is $0$ and $\theta_1$ is $0$ .” For most of the time, it’s not feasible to keep on guessing those values. Thus, it’s better to use an iterative method such as Gradient Descent algorithm. What the Gradient Descent algorithm does is to update the $\theta_0$ and $\theta_1$ values based on the cost function and the learning rate.

This example is just a simple linear model, we are going to use the following equations to update intercept and coefficient:

\begin{gather*} \theta_0 = \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1, \dots, \theta_p) \\ \theta_1 = \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1, \dots, \theta_p)x_1 \\ \cdots \\ \theta_p = \theta_p - \alpha \frac{\partial}{\partial \theta_n} J(\theta_0, \theta_1, \dots, \theta_p)x_p \\ \end{gather*}

where $\alpha$ is the learning rate, $\theta_p$ is the $p$ -th parameter, $J$ is the cost function, and $x_p$ is the $p$ -th feature.

Since we only have $\theta_0$ and $\theta_1$ , we can simplify the equation above to:

\begin{gather*} \theta_0 = \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) \\ \theta_1 = \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)x_1 \\ \end{gather*}

If you are want to know more about Gradient Descent algorithm, you can read the Gradient Descent series here. The series covers the intuition behind Gradient Descent, the math behind it, and its implementation in Python from scratch.

Example

In this section, there will be two examples of regression lines. One with the regression line is far from most data points, and the other one is close to most data points.

Figure: Most data points are far from the regression line

When the regression line, which is indicated by the green line, sees $x = 0$ then it predicts $\hat{y} = -3$ . In reality, $y$ should be $0.5$ when $x = 0$ . Meaning that the predicted value is far from the actual value.

There are many ways to measure the quality of regression lines, such as:

R Squared
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)

However, we are going to use Mean Squared Error (MSE) this time.

MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

where $n$ is the number of data points, $\hat{y}_i$ is the predicted value, and $y_i$ is the actual value.

Let’s calculate the MSE for the graph above with the data points above.

\begin{aligned} MSE &= \frac{47.59}{5} \\ &= 9.518 \end{aligned}

Let’s see another example where the data points are close to the regression line.

Figure: Most data points are close from the regression line

Let’s calculate the MSE for this example to see if the MSE is small when the regression line is close to most data points.

\begin{aligned} MSE &= \frac{1.74}{5} \\ &= 0.348 \end{aligned}

From these two examples, we can see that when the regression line is close to most data points, the MSE is small. Conversonly, when the regression line is far from most data points, the MSE is large.

Is having a small MSE enough to say that the regression line is good? Futher investigation is needed to answer this question. However, I am not gonna cover it in this post.

Python Implementation

First prepare the dataset. We are going to use the Iris dataset from sklearn-learn with two features, petal_width and sepal_length.

1
from sklearn.datasets import load_iris
2

3
iris = load_iris()
4
sepal_length = iris.data[:, 0]
5
petal_width = iris.data[:, 3]
6
target = iris.target
7

8
species_dict = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
9
species_name = [species_dict[i] for i in target]

Note that, the regression line is calculated using the following equation.

\hat{y} = \theta_0 + \theta_1x

1
b, x = 0, 0
2

3
regression_line = [
4
  b + sepal_length[i] * x for i in range(len(sepal_length))
5
]

where $\theta_0$ is the intercept and $\theta_1$ is the first coefficient. In this case, we are going to replace $\theta_0$ with b and $\theta_1$ with x. Without any assumption, we are going to start with b=0 and x=0.

1
sns.scatterplot(x = sepal_length, y = petal_width, hue=species_name)
2
plt.plot(sepal_length, y, c='r')
3
plt.title('Iris Dataset: Sepal Length vs Petal Width')
4
plt.xlabel('Sepal Length')
5
plt.ylabel('Petal Width')
6
plt.legend()
7
plt.show()

Figure: Regression line with b=0 and x=0

Looking at the plot above, we can tell that the regression line is far from most data points. Thus, we need to automatically tweak b and x to get a better regression line using the Gradient Descent algorithm.

1
def linear_regression(x, y, epochs, alpha = 0.01):
2
    intercept, coefficient = 0, 0
3
    length = len(x)
4

5
    for _ in range(1, epochs):
6
        predictions = predict(intercept, coefficient, x)
7
        errors = predictions - y
8
        # gradient descent algorithm happens here
9
        intercept = intercept - alpha * np.sum(errors) / length
10
        coefficient = coefficient - alpha * np.sum(errors * x) / length
11
        # ends here
12
        mse_history.append(mean_squared_error(errors))
13
        intercept_history.append(intercept)
14
        coefficient_history.append(coefficient)
15

16
    return intercept, coefficient

After running the Gradient Descent algorithm above for $10,000$ times, we get the following regression line where $b=-2.71$ and $x=0.67$ .

Figure: Regression line with b=-2.71 and x=0.67

You would notice there are three lists: mse_history, intercept_history, and coefficient_history. I made them just to see how the MSE, intercept, and coefficient change over time.

Figure: MSE, Intercept, and Coefficient over time

I have also plot the MSE, intercept, and coefficient in the 3D space to see how they change over time.

Figure: MSE, Intercept, and Coefficient in 3D space

Note that the blue x marks the lowest MSE value calculated with the estimated intercept and coefficient we got from the Gradient Descent algorithm.

To see how I made this graph, you could check out the source code here.

Conclusion

Linear Regression is a supervised learning algorithm that is used to predict the value of a continuous variable.
The equation of the regression line is $y = \theta_0 + \theta_1 \times x$ .
The regression line is said to be good when it is close to most data points.
The quality of the regression line can be measured using evaluation metrics, one of them is Mean Squared Error (MSE).
The smaller the MSE, the closer the regression line is to most data points. Conversely, the larger the MSE, the farther the regression line is to most data points.
$b$ and $x$ can be estimated using Gradient Descent algorithm.

Code

Simple Implementation

1
mse_history = list()
2
intercept_history = list()
3
coefficient_history = list()
4

5
def predict(intercept, coefficient, data):
6
    return intercept + np.dot(coefficient, data)
7

8
def mean_squared_error(errors):
9
    return np.mean(np.square(errors))
10

11
def linear_regression(x, y, epochs, alpha = 0.01):
12
    intercept, coefficient = 0, 0
13
    length = len(x)
14

15
    for _ in range(1, epochs):
16
        predictions = predict(intercept, coefficient, x)
17
        errors = predictions - y
18
        # gradient descent algorithm happens here
19
        intercept = intercept - alpha * np.sum(errors) / length
20
        coefficient = coefficient - alpha * np.sum(errors * x) / length
21
        # ends here
22
        mse_history.append(mean_squared_error(errors))
23
        intercept_history.append(intercept)
24
        coefficient_history.append(coefficient)
25

26
    return intercept, coefficient
27

28
b, x = linar_regression(sepal_length, petal_width, 10000)

Printing b and x will give us -2.717366489030271 and 0.6718570469763597.

You could find the source code here.

Scikit-Learn Implementation

1
import numpy as np
2

3
class LinearRegression:
4
    def __init__(self):
5
        self.iterations = 10_000
6
        self.learning_rate = 0.01
7
        self.intercept = 0
8
        self.coefficients = None
9
        self.X = None
10
        self.y = None
11
        self.length = 0
12
        self.loss_history = list()
13

14
    def _intercept(self):
15
        return self.intercept
16

17
    def _coefficients(self):
18
        return self.coefficients
19

20
    def _loss_history(self):
21
        return self.loss_history
22

23
    def mean_squared_error(self, predictions):
24
        return np.sum(np.square(predictions - self.y)) / self.length
25

26
    def predict(self, X):
27
        return self.intercept + np.dot(X, self.coefficients)
28

29
    def update_params(self, predictions):
30
        error = predictions - self.y
31
        self.intercept -= self.learning_rate * np.sum(error) / self.length
32
        self.coefficients -= self.learning_rate * (np.dot(self.X.T, error) / self.length)
33

34
    def fit(self, X, y):
35
        self.X = np.array(X)
36
        if len(self.X.shape) == 1:
37
            # To support 1D data, otherwise
38
            # self.coefficients = np.zeros(self.X.shape[1]) will fail
39
            self.X = self.X.reshape(-1, 1)
40
        self.y = np.array(y)
41
        self.length = len(self.y)
42
        self.coefficients = np.zeros(self.X.shape[1])
43

44
        for _ in range(self.iterations):
45
            predictions = self.predict(self.X)
46
            self.update_params(predictions)
47
            self.loss_history.append(self.mean_squared_error(predictions))
48

49
lin_reg = LinearRegression()
50
lin_reg.fit(sepal_length, petal_width)
51
print(f"intercept: {lin_reg._intercept()}, coefficients: {lin_reg._coefficients()}")

Printing the intercept and the coefficient values will give us: intercept: -2.7174583375849535 and coefficients: [0.67187247].

You could find the source code here.

References

Wikipedia. Gradient Descent. Gradient Descent.
Wikipedia. Linear Regression. Linear Regression.