Adagrad

Remember from the SGD with Nesterov post, we could minimize the cost function effeciently with less oscillations by taking consideration of the future gradient. The pathway to the local minima resembles a ball going down a hill in real life, much better than SGD with Momentum. However, the learning rate is still fixed for all parameters.

In this post, we will discuss the Adagrad optimization algorithm, which could help us to adapt the learning rate for each parameter. Meaning, each parameter will have its own learning rate. In addition to that, parameters that are updated frequently will experience smaller updates. While the parameters that are updated infrequently will experience larger updates.

Mathematics of Adagrad

Since we only have two parameters, we are going to need

g_{t,0}

to represent the gradient of the cost function with respect to the intercept, and

g_{t,1}

to represent the gradient of the cost function with respect to the coefficient. These two can be expressed as follows:

Notice in the parameter update rule, Adagrad eliminates the need to manually tune the learning rate for each parameter by dividing the learning rate by the square root of the sum of the squares of the gradients up to time

t

. In most cases, the learning rate

\alpha

value can be set to

0.01

From the equation above, we could see that the learning rate is divided by

\sqrt{G_{t,ii} + \epsilon}

. Meaning that, the learning rate will decrease rapidly as

G_{t,ii}

increases. Note that

G_{t,ii}

increases as the number of iterations increases. Thus, the parameters that are updated infrequently will experience larger updates and vice versa.

Implementation of Adagrad

First, calculate the intercept and the coefficient gradients. Notice that

g_{t,0}

is just the prediction error at time

t

1
error = prediction - y[random_index]
2

3
intercept_gradient = error
4
coefficient_gradient = error * x[random_index]

Second, calculate the sum of the squares of the gradients up to time

t

G_{t,ii}

and accumulate it over time.

1
accumulated_squared_intercept_gradient += intercept_gradient ** 2
2
accumulated_squared_coefficient_gradient += coefficient_gradient ** 2

1
intercept -= (learning_rate / np.sqrt(accumulated_squared_intercept + eps)) * intercept_gradient
2
coefficient -= (learning_rate / np.sqrt(accumulated_squared_coefficient + eps)) * coefficient_gradient

Conclusion

You would notice that SGD have reached the bottom of the valley faster than Adagrad in less than

100

iterations.

Unlike SGD, Adagrad required more iterations to reach the bottom of the valley. The reason is Adagrad’s aggressive learning rate decay over time. In other words, the learning rate decreases as the number of iterations increases.

Let’s look at the following part in the Adagrad’s parameter update equation:

Remember that

G_{t,ii}

represents accumulated_squared_intercept_gradient and accumulated_squared_coefficient_gradient up to time

t

. As the number of iterations increases, the accumulated sum increases, and the learning rate would decrease significantly over time.

For Adagrad to reach the bottom of the valley faster, epochs should be set to

10,000

Code

1
def adagrad(x, y, df, epochs = 100, learning_rate = 0.01, eps=1e-8):
2
  intercept, coefficient = -0.5, -0.75
3
  accumulated_squared_intercept = 0.0
4
  accumulated_squared_coefficient = 0.0
5

6
  random_index = np.random.randint(len(features))
7
  prediction = predict(intercept, coefficient, x[random_index])
8
  mse = ((prediction - y[random_index]) ** 2) / 2
9
  df.loc[0] = [intercept, coefficient, mse]
10

11
  for epoch in range(1, epochs + 1):
12
    random_index = np.random.randint(len(features))
13
    prediction = predict(intercept, coefficient, x[random_index])
14
    error = prediction - y[random_index]
15

16
    intercept_gradient = error
17
    coefficient_gradient = error * x[random_index]
18

19
    accumulated_squared_intercept += intercept_gradient ** 2
20
    accumulated_squared_coefficient += coefficient_gradient ** 2
21

22
    intercept -= (learning_rate / np.sqrt(accumulated_squared_intercept + eps)) * intercept_gradient
23
    coefficient -= (learning_rate / np.sqrt(accumulated_squared_coefficient + eps)) * coefficient_gradient
24

25
    mse = (error ** 2) / 2
26
    df.loc[epoch] = [intercept, coefficient, mse]
27

28
  return df

Adagrad

Gradient Descent Algorithms (Series)

Introduction

Mathematics of Adagrad

Implementation of Adagrad

Conclusion

Code

References