ASGD
Stochastic gradient descent with averaged parameters
Average Stochastic Gradient Descent, abbreviated as ASGD, averages the weights that are calculated in every iteration.
wt+1=wtโˆ’ฮทโˆ‡Q(wt)w_{t+1}=w_t-\eta \nabla Q(w_t)
where
wtw_t
being the weight tensor ,
ฮท\eta
being the base learning rate and
โˆ‡Q(wt)\nabla Q(w_t)
being the gradient of the objective function evaluated at
wtw_t
.
With the given update rule SGD assigns calculated weight to the model. But with ASGD assigns the following averaged weight
wโ€พ\overline{w}
,
wโ€พ=1Nโˆ‘t=1Nwt\overline{w}=\frac{1}{N} \sum_{t=1}^Nw_t
where
wtw_t
is the weight tensor calculated in iteration 't'.
Such averaging is used when the data is noisy.

Major Parameters

Lambda

It is the decay term for the past weights used in the average.

Alpha

It is the power value that is used to update the learning rate.

TO

It is the optimization step at which the averaging is started. If the required number of iteration is lower than the TO value, then the averaging will not happen.

Code Implementation

1
# importing the library
2
import torch
3
import torch.nn as nn
4
โ€‹
5
x = torch.randn(10, 3)
6
y = torch.randn(10, 2)
7
โ€‹
8
# Build a fully connected layer.
9
linear = nn.Linear(3, 2)
10
โ€‹
11
# Build MSE loss function and optimizer.
12
criterion = nn.MSELoss()
13
โ€‹
14
# Optimization method using ASGD
15
optimizer = torch.optim.ASGD(linear.parameters(), lr=0.01, lambd=0.0001,
16
alpha=0.75, t0=1000000.0, weight_decay=0)
17
โ€‹
18
# Forward pass.
19
pred = linear(x)
20
โ€‹
21
# Compute loss.
22
loss = criterion(pred, y)
23
print('loss:', loss.item())
24
โ€‹
25
optimizer.step()
Copied!
โ€‹
Last modified 3mo ago