Model families ๐พ

Model architectures ๐

Metrics ๐

Solvers / Optimizers ๐งฎ

Training parameters

Augmentations

Deployment

Warm-Up

Starting the optimizer slow to increase stability

Intuition

Using a too large learning rate may result in numerical instability especially at the very beginning of the training, where parameters are randomly initialized. The warmup strategy increases the learning rate from 0 to the initial learning rate linearly during the initial **N** epochs or **m** batches.

In some cases initializing the parameters is not sufficient to guarantee a good solution. This particularly is a problem for some advanced network designs that may lead to unstable optimization problems. We could address this by choosing a sufficiently small learning rate to prevent divergence in the beginning. Unfortunately, this means that progress is slow. Conversely, a large learning rate initially leads to divergence.

A rather simple fix for this dilemma is to use a warmup period during which the learning rate *increases* to its initial maximum and to cool down the rate until the end of the optimization process. Warmup steps are just a few updates with a low learning rate before/at the beginning of training. After this *warmup*, you use the regular learning rate (schedule) to train your model to convergence.

In Hasty's Model Playground, If you set the Last Epoch as 1000 for an iteration of 10,000 epochs, using the Warmup factor value for the first 1000 iterations the model will learn the corpus with minimal learning rate than the rate which you've specified in the model. From the 1001th iteration, model will use the previously defined base learning rate

Code implementation

PyTorch

1

import torch

2

from torch.optim.lr_scheduler import StepLR, ExponentialLR

3

from torch.optim.sgd import SGD

4

โ

5

from warmup_scheduler import GradualWarmupScheduler

6

โ

7

โ

8

if __name__ == '__main__':

9

model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))]

10

optim = SGD(model, 0.1)

11

โ

12

# scheduler_warmup is chained with schduler_steplr

13

scheduler_steplr = StepLR(optim, step_size=10, gamma=0.1)

14

scheduler_warmup = GradualWarmupScheduler(optim, multiplier=1, total_epoch=5, after_scheduler=scheduler_steplr)

15

โ

16

# this zero gradient update is needed to avoid a warning message, issue #8.

17

optim.zero_grad()

18

optim.step()

19

โ

20

for epoch in range(1, 20):

21

scheduler_warmup.step(epoch)

22

print(epoch, optim.param_groups[0]['lr'])

23

โ

24

optim.step() # backward pass (update network)

Copied!

Further resources

Last modified 9mo ago

Copy link