Adagrad , short for adaptive gradient, is a gradient based optimizer that automatically tunes its learning rate in the training process. The learning rate is updated parameter wise, i.e. we have a different learning rate for each of the parameters.
The parameters associated with frequently occurring features have small updates (low learning rate), and the parameters associated with seldom occurring features have bigger updates (high learning rate).
Mathematically Adagrad can be formulated as,
Where is the gradient of the objective function with respect to the parameter
The parameter is updated as follows,
Here is the parameter to be updated, is the sum of the square of all the gradient till time t. We can see that the learning rate is adjusted according to the previous encountered gradients. is the Base Learning Rate.
Learning rate decay
It is a technique where a large learning rate is adopted in the beginning of the training process and then it is decayed by the certain factor after pre-defined epochs. Higher learning rate decay suggests that the initial learning rate will decay more in the epochs.
# importing the libraryimport torchimport torch.nn as nnx = torch.randn(10, 3)y = torch.randn(10, 2)# Build a fully connected layer.linear = nn.Linear(3, 2)# Build MSE loss function and optimizer.criterion = nn.MSELoss()# Optimization method using Adagradoptimizer = torch.optim.Adagrad(linear.parameters(), lr=0.01, lr_decay=0, weight_decay=0,eps=1e-10)# Forward pass.pred = linear(x)# Compute loss.loss = criterion(pred, y)print('loss:', loss.item())optimizer.step()