AdaMax
Adam with infinity norm.
Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.
The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,
ut=ฮฒ2โˆžvtโˆ’1+(1โˆ’ฮฒ2โˆžvtโˆ’1)โˆฃgtโˆฃโˆž=max(ฮฒ2โ‹…vtโˆ’1,โˆฃgtโˆฃ)u_t=\beta_2^\infty v_{t-1} + (1-\beta_2^\infty v_{t-1})|g_t|^\infty=max(\beta_2 \cdot v_{t-1},|g_t|)
We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,
ฮธt+1=ฮธtโˆ’ฮทโ‹…mtut\theta_{t+1}=\theta_t- \eta \cdot \frac{m_t}{u_t}
Again, here
ฮท\eta
is the base learning rate and
mtm_t
is the momentum similar to as discussed in Adam.

Major Parameters

Betas

The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,
Vdw=ฮฒ1โ‹…Vdw+(1โˆ’ฮฒ1)โ‹…โˆ‚wut=ฮฒ2โˆžvtโˆ’1+(1โˆ’ฮฒ2โˆžvtโˆ’1)โˆฃgtโˆฃโˆžV_{dw}=\beta_1 \cdot V_{dw}+(1-\beta_1)\cdot \partial w\\u_t=\beta_2^\infty v_{t-1} + (1-\beta_2^\infty v_{t-1})|g_t|^\infty
Here
ฮฒ1\beta_1
and
ฮฒ2\beta_2
are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

Code Implementation

1
# importing the library
2
import torch
3
import torch.nn as nn
4
โ€‹
5
x = torch.randn(10, 3)
6
y = torch.randn(10, 2)
7
โ€‹
8
# Build a fully connected layer.
9
linear = nn.Linear(3, 2)
10
โ€‹
11
# Build MSE loss function and optimizer.
12
criterion = nn.MSELoss()
13
โ€‹
14
# Optimization method using Adamax
15
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002,
16
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
17
โ€‹
18
# Forward pass.
19
pred = linear(x)
20
โ€‹
21
# Compute loss.
22
loss = criterion(pred, y)
23
print('loss:', loss.item())
24
โ€‹
25
optimizer.step()
Copied!
โ€‹
Last modified 4mo ago