AdamW
A theoretically improved Adam optimizer
AdamW is very similar to Adam. It only differs in the way how the weight decay is implemented. The way how it's implemented in Adam came from the good old vanilla SGD optimizers which isn't mathematically correct. AdamW fixes this implementation mistake.
The authors of the original AdamW paper claimed that they were able to solve the generalization issues of the Adam solver with their modification. Empirically speaking, however, it seems that the right hyperparameter settings have a bigger impact than the choice between Adam and AdamW, but AdamW generalizes a bit better.

Code implementation

PyTorch
TensorFlow
1
import torch
2
โ€‹
3
# N is batch size; D_in is input dimension;
4
# H is hidden dimension; D_out is output dimension.
5
N, D_in, H, D_out = 64, 1000, 100, 10
6
โ€‹
7
# Create random Tensors to hold inputs and outputs.
8
x = torch.randn(N, D_in)
9
y = torch.randn(N, D_out)
10
โ€‹
11
# Use the nn package to define our model and loss function.
12
model = torch.nn.Sequential(
13
torch.nn.Linear(D_in, H),
14
torch.nn.ReLU(),
15
torch.nn.Linear(H, D_out),
16
)
17
loss_fn = torch.nn.MSELoss(reduction='sum')
18
โ€‹
19
# Use the optim package to define an Optimizer that will update the weights of
20
# the model for us. Here we will use AdamW; the optim package contains many other
21
# optimization algorithms. The first argument to the Adam constructor tells the
22
# optimizer which Tensors it should update.
23
learning_rate = 1e-4
24
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01, amsgrad=False)
25
โ€‹
26
for t in range(500):
27
# Forward pass: compute predicted y by passing x to the model.
28
y_pred = model(x)
29
โ€‹
30
# Compute and print loss.
31
loss = loss_fn(y_pred, y)
32
print(t, loss.item())
33
34
# Before the backward pass, use the optimizer object to zero all of the
35
# gradients for the Tensors it will update (which are the learnable weights
36
# of the model)
37
optimizer.zero_grad()
38
โ€‹
39
# Backward pass: compute gradient of the loss with respect to model parameters
40
loss.backward()
41
โ€‹
42
# Calling the step function on an Optimizer makes an update to its parameters
43
optimizer.step()
Copied!
1
# TensorFlow Addons is a repository of contributions that conform to well- established API patterns
2
# But implement new functionality not available in core TensorFlow.
3
!pip install tensorflow-addons
4
โ€‹
5
# importing the library
6
import tensorflow as tf
7
import tensorflow_addons as tfa
8
โ€‹
9
opt = tfa.optimizers.AdamW(learning_rate=0.1,weight_decay=0.01, amsgrad=False)
10
var1 = tf.Variable(10.0)
11
loss = lambda: (var1 ** 2)/2.0 # d(loss)/d(var1) == var1
12
step_count = opt.minimize(loss, [var1]).numpy()
13
โ€‹
14
var1.numpy()
Copied!

Further resources

Last modified 5mo ago