Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.
The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,
$u_t=\beta_2^\infty v_{t-1} + (1-\beta_2^\infty v_{t-1})|g_t|^\infty=max(\beta_2 \cdot v_{t-1},|g_t|)$
We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,
$\theta_{t+1}=\theta_t- \eta \cdot \frac{m_t}{u_t}$
Again, here
$\eta$
is the base learning rate and
$m_t$
is the momentum similar to as discussed in Adam.

## Betas

The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,
$V_{dw}=\beta_1 \cdot V_{dw}+(1-\beta_1)\cdot \partial w\\u_t=\beta_2^\infty v_{t-1} + (1-\beta_2^\infty v_{t-1})|g_t|^\infty$
Here
$\beta_1$
and
$\beta_2$
are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

## Code Implementation

1
# importing the library
2
import torch
3
import torch.nn as nn
4
5
x = torch.randn(10, 3)
6
y = torch.randn(10, 2)
7
8
# Build a fully connected layer.
9
linear = nn.Linear(3, 2)
10
11
# Build MSE loss function and optimizer.
12
criterion = nn.MSELoss()
13
14
15
16
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
17
18
# Forward pass.
19
pred = linear(x)
20
21
# Compute loss.
22
loss = criterion(pred, y)
23
print('loss:', loss.item())
24
25
optimizer.step()
Copied!