RMSprop
Gradient based optimization technique with adaptive learning rate
RMSprop is another optimization technique where there is a different learning rate for each parameter. The learning rate is varied by calculating the exponential moving average of the gradient squared and using it to further update the parameter.
Mathematically, the exponential moving average of the gradient squared is given as follows,
Sdw=ฮฑโ‹…Sdw+(1โˆ’ฮฑ)โ‹…โˆ‚w2S_{dw}=\alpha \cdot S_{dw} +(1-\alpha) \cdot \partial w^2
Here
ww
is one of the parameters and
ฮฒ\beta
is the smoothing constant.
Value of
ฮฑ\alpha
is usually set to 0.99.
Then using
SdwS_{dw}
to update the parameter,
w=wโˆ’ฮทโ‹…โˆ‚wSdww=w-\eta \cdot \frac{\partial w}{\sqrt{S_{dw}}}
Here
ฮท\eta
is the Base Learning Rate.
Notice some implications of such an update. If the change of w with respect to objective function was very high, then the update would decrease since we are dividing with a high value. Similarly, if the change of w with respect to the objective function was low then the update would be higher.
Let us clarify with a help of contour lines,
โ€‹
Comparison of stochastic gradient descent and RMSprop
We can see that higher gradient in the vertical direction and the lower in the horizontal direction is slowing down the overall search process of the optimal solution. The use of RMSprop solves this problem by finding a better search โ€œtrajectoryโ€.

Major Parameters

Alpha

Alpha is the same value at the
ฮฑ\alpha
in the aforementioned formula. Lower the value of alpha, lower the number of previous weights taken into account for the calculation of exponential moving average of squared gradients.

Centered

If "Centered" is activated, then the RMS prop is calculated by normalizing the gradients with the variance. If not, then the uncentered second moment, as in the aforementioned formula, is used.
Centering might help in the training process but will be slightly more computationally expensive.

Code Implementation

1
# importing the library
2
import torch
3
import torch.nn as nn
4
โ€‹
5
x = torch.randn(10, 3)
6
y = torch.randn(10, 2)
7
โ€‹
8
# Build a fully connected layer.
9
linear = nn.Linear(3, 2)
10
โ€‹
11
# Build MSE loss function and optimizer.
12
criterion = nn.MSELoss()
13
โ€‹
14
# Optimization method using RMSprop
15
optimizer = torch.optim.RMSProp(linear.parameters(), lr=0.01, alpha=0.99,
16
eps=1e-08, weight_decay=0, momentum=0, centered=False)
17
โ€‹
18
# Forward pass.
19
pred = linear(x)
20
โ€‹
21
# Compute loss.
22
loss = criterion(pred, y)
23
print('loss:', loss.item())
24
โ€‹
25
optimizer.step()
Copied!

โ€‹

Last modified 4mo ago