Adam

Adaptive Moment Estimation (Adam) is an optimization algorithm that is inspired from Adagrad and RMSprop optimization algorithms. Remember that Adagrad and RMSprop have their own limitations.

In the case of Adagrad, the learning rate diminishes over time and becomes too small since the algorithm takes account all of the previous gradients. Thus making the model stops learning.

Even though RMSprop solves the problem of Adagrad by taking account only the average of the previous gradients, it still stuffers from the same problem as Adagrad.

To solve the limitations of Adagrad and RMSprop, Adam introduces the concept of momentum. Let’s see how Adam works.

Mathematics of Adam

It’s worth mentioning is that the first moment estimate

m_t

and the second moment estimate

v_t

work just like Momentum to maintain directionality since they both take account of the previous gradients. By accumulating previous gradients, Adam accelerates convergence especially in the area with small gradients.

The reason why the first moment estimate

m_t

has to be corrected is that the first moment estimate

m_t

is biased to smaller values at the beginning of the training when it is zero or close to zero. The bias could lead to overly aggresive parameter updates, instability, or slow convergence. Similarly with the second moment estimate

v_t

Since we only have two parameters, we are going to need

g_{t,0}

to represent the gradient of the cost function with respect to the intercept, and

g_{t,1}

to represent the gradient of the cost function with respect to the coefficient. These two can be expressed as follows:

Implementation of Adam

First, calculate the intercept and the coefficient gradient. Notice that the intercept gradient

g_{t,0}

is the prediction error.

1
error = prediction - y[random_index]
2
intercept_gradient = error
3
coefficient_gradient = error * x[random_index]

Second, calculate the first moment estimate

m_t

and the second moment estimate

v_t

1
momentum = momentum_decay_rate * momentum + (1 - momentum_decay_rate) * intercept_gradient
2
variance = variance_decay_rate * variance + (1 - variance_decay_rate) * (coefficient_gradient ** 2)

Third, correct the first moment estimate

m_t

and the second moment estimate

v_t

1
corrected_momentum = momentum / (1 - momentum_decay_rate ** epoch)
2
corrected_variance = variance / (1 - variance_decay_rate ** epoch)

1
intercept = intercept - learning_rate * corrected_momentum / (np.sqrt(corrected_variance) + eps)
2
coefficient = coefficient - learning_rate * corrected_momentum / (np.sqrt(corrected_variance) + eps)

Conclusion

From the figure above, we can see that the pathway Adam took a direct path down the hill compared to the other two. That proves that Adam can accelerate convergence especially in the area with small gradients and avoid frequent updates with the help of Momentum.

Code

1
def adam(x, y, df, epochs=1000, learning_rate=0.01, eps=1e-8):
2
  intercept, coefficient = -0.5, -0.75
3
  momentum_decay_rate, variance_decay_rate = 0.9, 0.999
4
  momentum, variance = 0.0, 0.0
5

6
  random_index = np.random.randint(len(x))
7
  prediction = predict(intercept, coefficient, x[random_index])
8
  error = prediction - y[random_index]
9
  df.loc[0] = [intercept, coefficient, momentum, variance, (error ** 2) / 2]
10

11
  for epoch in range(1, epochs + 1):
12
    random_index = np.random.randint(len(x))
13
    prediction = predict(intercept, coefficient, x[random_index])
14
    error = prediction - y[random_index]
15

16
    intercept_gradient = error
17
    coefficient_gradient = error * x[random_index]
18

19
    momentum = momentum_decay_rate * momentum + (1 - momentum_decay_rate) * intercept_gradient
20
    variance = variance_decay_rate * variance + (1 - variance_decay_rate) * (coefficient_gradient ** 2)
21

22
    corrected_momentum = momentum / (1 - momentum_decay_rate ** epoch)
23
    corrected_variance = variance / (1 - variance_decay_rate ** epoch)
24

25
    intercept = intercept - learning_rate * corrected_momentum / (np.sqrt(corrected_variance) + eps)
26
    coefficient = coefficient - learning_rate * corrected_momentum / (np.sqrt(corrected_variance) + eps)
27

28
    df.loc[epoch] = [intercept, coefficient, corrected_momentum, corrected_variance, (error ** 2) / 2]
29

30
  return df

Adam

Gradient Descent Algorithms (Series)

Introduction

Mathematics of Adam

Implementation of Adam

Conclusion

Code

References