ASGD
Stochastic gradient descent with averaged parameters
Average Stochastic Gradient Descent, abbreviated as ASGD, averages the weights that are calculated in every iteration.
$w_{t+1}=w_t-\eta \nabla Q(w_t)$
where
$w_t$
being the weight tensor ,
$\eta$
being the base learning rate and
$\nabla Q(w_t)$
being the gradient of the objective function evaluated at
$w_t$
.
With the given update rule SGD assigns calculated weight to the model. But with ASGD assigns the following averaged weight
$\overline{w}$
,
$\overline{w}=\frac{1}{N} \sum_{t=1}^Nw_t$
where
$w_t$
is the weight tensor calculated in iteration 't'.
Such averaging is used when the data is noisy.

• Lambda
• Alpha
• TO

## Lambda

It is the decay term for the past weights used in the average.

## Alpha

It is the power value that is used to update the learning rate.

## TO

It is the optimization step at which the averaging is started. If the required number of iteration is lower than the TO value, then the averaging will not happen.

## Code Implementation

1
# importing the library
2
import torch
3
import torch.nn as nn
4
5
x = torch.randn(10, 3)
6
y = torch.randn(10, 2)
7
8
# Build a fully connected layer.
9
linear = nn.Linear(3, 2)
10
11
# Build MSE loss function and optimizer.
12
criterion = nn.MSELoss()
13
14
# Optimization method using ASGD
15
optimizer = torch.optim.ASGD(linear.parameters(), lr=0.01, lambd=0.0001,
16
alpha=0.75, t0=1000000.0, weight_decay=0)
17
18
# Forward pass.
19
pred = linear(x)
20
21
# Compute loss.
22
loss = criterion(pred, y)
23
print('loss:', loss.item())
24
25
optimizer.step()
Copied!