Momentum (SGD)

Increasing the learning rate when the SGD gradient keeps going in the same direction

Momentum speeds up the SGD optimizer to reach the local minimum quicker. If we move in the same direction in the loss landscape, the optimizer will take bigger steps on the loss landscape.

A nice side effect of momentum is that it smooths the way SGD takes when the gradients of each iteration point into different directions.

Image Source at the end of the page

Momentum is implemented by adding a term to the weight update rule of the SGD together with a parameter γ ranging between 0 and 1. But never ever set it to 1, then bad things will happen!

A typical value for the momentum parameter that proved to be robust is 0.9.

Without dampening, momentum might cause you to miss the minimum because momentum increased the step size so much that the optimizer 'jumps over' it.

Further Resources