Epsilon Coefficient
A constant used to improve stability
We shall make use of Adam optimization to briefly explain the epsilon coefficient. For the Adam optimizer, we know that the first and second moments are calculated via;
$V_{dw}=\beta_1 \cdot V_{dw}+(1-\beta_1)\cdot \partial w\\S_{dw}=\beta_2 S_{dw}+(1-\beta_2)\cdot \partial w^2$
$\partial w$
is the derivative of the loss function with respect to a parameter.
$V_{dw}$
is the running average of the decaying gradients(momentum term) and
$S_{dw}$
is the decaying average of the gradients.
And the parameter updates are done as;
$\theta_{k+1}=\theta_k-\eta \cdot \frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}}+\epsilon}$
The epsilon in the aforementioned update is the epsilon coefficient.
Note that when the bias-corrected
$S_{dw}$
gets close to zero, the denominator is undefined. Hence, the update is arbitrary. To rectify this, we use a small epsilon such that it stabilizes this numeric.
The standard value of the epsilon is 1e-08.

Epsilon in code example

1
import torch
2
3
# N is batch size; D_in is input dimension;
4
# H is hidden dimension; D_out is output dimension.
5
N, D_in, H, D_out = 64, 1000, 100, 10
6
7
# Create random Tensors to hold inputs and outputs.
8
x = torch.randn(N, D_in)
9
y = torch.randn(N, D_out)
10
11
# Use the nn package to define our model and loss function.
12
model = torch.nn.Sequential(
13
torch.nn.Linear(D_in, H),
14
torch.nn.ReLU(),
15
torch.nn.Linear(H, D_out),
16
)
17
loss_fn = torch.nn.MSELoss(reduction='sum')
18
19
# Use the optim package to define an Optimizer that will update the weights of
20
# the model for us. Here we will use Adam; the optim package contains many other
21
# optimization algorithms. The first argument to the Adam constructor tells the
22
# optimizer which Tensors it should update.
23
learning_rate = 1e-4
24
25
#setting the amsgrad to be true
26
#setting the epsilon to be 1e-08
27
#note that we are using Adam in our example
28
for t in range(500):
29
# Forward pass: compute predicted y by passing x to the model.
30
y_pred = model(x)
31
32
# Compute and print loss.
33
loss = loss_fn(y_pred, y)
34
print(t, loss.item())
35
36
# Before the backward pass, use the optimizer object to zero all of the
37
# gradients for the Tensors it will update (which are the learnable weights
38
# of the model)
39
40
41
# Backward pass: compute gradient of the loss with respect to model parameters
42
loss.backward()
43
44
# Calling the step function on an Optimizer makes an update to its parameters
45
optimizer.step()
Copied!