BSR

Optimization

01 May 2024

Adagrad

Introduction

Remember from the SGD with Nesterov post, we could minimize the cost function effeciently with less oscillations by taking consideration of the future gradient. The pathway to the local minima resembles a ball going down a hill in real life, much better than SGD with Momentum. However, the learning rate is still fixed for all parameters.

In this post, we will discuss the Adagrad optimization algorithm, which could help us to adapt the learning rate for each parameter. Meaning, each parameter will have its own learning rate. In addition to that, parameters that are updated frequently will experience smaller updates. While the parameters that are updated infrequently will experience larger updates.

Mathematics of Adagrad

The parameter update rule is expressed as

θt+i,i=θtαGt,ii+ϵgt\theta_{t+i, i} = \theta_t - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \odot g_t
gt,i=θtJ(θt,i)g_{t,i} = \nabla_{\theta_t} J(\theta_{t, i})

where

Since we only have two parameters, we are going to need gt,0g_{t,0} to represent the gradient of the cost function with respect to the intercept, and gt,1g_{t,1} to represent the gradient of the cost function with respect to the coefficient. These two can be expressed as follows:

gt,0=θ0J(θ)=θ0J(θ)=12θ0(y^iyi)2=y^iyi\begin{aligned} g_{t,0} &= \nabla_{\theta_0} J(\theta) \\ &= \frac{\partial}{\partial \theta_0} J(\theta) \\ &= \frac{1}{2} \frac{\partial}{\partial \theta_0} (\hat{y}_i - y_i)^2 \\ &= \hat{y}_i - y_i \end{aligned}
gt,1=θ1J(θ)=θ1J(θ)=12θ1(y^iyi)2=(y^iyi)xi\begin{aligned} g_{t,1} &= \nabla_{\theta_1} J(\theta) \\ &= \frac{\partial}{\partial \theta_1} J(\theta) \\ &= \frac{1}{2} \frac{\partial}{\partial \theta_1} (\hat{y}_i - y_i)^2 \\ &= (\hat{y}_i - y_i)x_i \end{aligned}

Notice in the parameter update rule, Adagrad eliminates the need to manually tune the learning rate for each parameter by dividing the learning rate by the square root of the sum of the squares of the gradients up to time tt. In most cases, the learning rate α\alpha value can be set to 0.010.01.

From the equation above, we could see that the learning rate is divided by Gt,ii+ϵ\sqrt{G_{t,ii} + \epsilon}. Meaning that, the learning rate will decrease rapidly as Gt,iiG_{t,ii} increases. Note that Gt,iiG_{t,ii} increases as the number of iterations increases. Thus, the parameters that are updated infrequently will experience larger updates and vice versa.

Implementation of Adagrad

First, calculate the intercept and the coefficient gradients. Notice that gt,0g_{t,0} is just the prediction error at time tt.

error = prediction - y[random_index]
 
intercept_gradient = error
coefficient_gradient = error * x[random_index]

Second, calculate the sum of the squares of the gradients up to time tt Gt,iiG_{t,ii} and accumulate it over time.

accumulated_squared_intercept_gradient += intercept_gradient ** 2
accumulated_squared_coefficient_gradient += coefficient_gradient ** 2

Finally, update the intercept and the coefficient.

intercept -= (learning_rate / np.sqrt(accumulated_squared_intercept + eps)) * intercept_gradient
coefficient -= (learning_rate / np.sqrt(accumulated_squared_coefficient + eps)) * coefficient_gradient

Conclusion

Pathways of SGD and Adagrad along the 2D MSE contour.Pathways of SGD and Adagrad along the 2D MSE contour.

You would notice that SGD have reached the bottom of the valley faster than Adagrad in less than 100100 iterations.

Unlike SGD, Adagrad required more iterations to reach the bottom of the valley. The reason is Adagrad's aggressive learning rate decay over time. In other words, the learning rate decreases as the number of iterations increases.

Let's look at the following part in the Adagrad's parameter update equation:

αGt,ii+ϵ\frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}}

Remember that Gt,iiG_{t,ii} represents accumulated_squared_intercept_gradient and accumulated_squared_coefficient_gradient up to time tt. As the number of iterations increases, the accumulated sum increases, and the learning rate would decrease significantly over time.

For Adagrad to reach the bottom of the valley faster, epochs should be set to 10,00010,000.

Code

def adagrad(x, y, df, epochs = 100, learning_rate = 0.01, eps=1e-8):
  intercept, coefficient = -0.5, -0.75
  accumulated_squared_intercept = 0.0
  accumulated_squared_coefficient = 0.0
 
  random_index = np.random.randint(len(features))
  prediction = predict(intercept, coefficient, x[random_index])
  mse = ((prediction - y[random_index]) ** 2) / 2
  df.loc[0] = [intercept, coefficient, mse]
 
  for epoch in range(1, epochs + 1):
    random_index = np.random.randint(len(features))
    prediction = predict(intercept, coefficient, x[random_index])
    error = prediction - y[random_index]
 
    intercept_gradient = error
    coefficient_gradient = error * x[random_index]
 
    accumulated_squared_intercept += intercept_gradient ** 2
    accumulated_squared_coefficient += coefficient_gradient ** 2
 
    intercept -= (learning_rate / np.sqrt(accumulated_squared_intercept + eps)) * intercept_gradient
    coefficient -= (learning_rate / np.sqrt(accumulated_squared_coefficient + eps)) * coefficient_gradient
 
    mse = (error ** 2) / 2
    df.loc[epoch] = [intercept, coefficient, mse]
 
  return df

References

  1. Sebastian Ruder. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).
  2. Rachel Ward, Xiaoxia Wu, and Leon Bottou. "Adagrad with SGD: Efficient Learning of Descent Directions." arXiv:1802.09568 (2018).
  3. John Duchi, Elad Hazan, and Yoram Singer. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research (2011).