Run this notebook online: or Colab:

# 11.11. Learning Rate Scheduling¶

So far we primarily focused on optimization algorithms for how to update the weight vectors rather than on the rate at which they are being updated. Nonetheless, adjusting the learning rate is often just as important as the actual algorithm. There are a number of aspects to consider:

• Most obviously the magnitude of the learning rate matters. If it is too large, optimization diverges, if it is too small, it takes too long to train or we end up with a suboptimal result. We saw previously that the condition number of the problem matters (see e.g., Section 11.6 for details). Intuitively it is the ratio of the amount of change in the least sensitive direction vs. the most sensitive one.

• Secondly, the rate of decay is just as important. If the learning rate remains large we may simply end up bouncing around the minimum and thus not reach optimality. Section 11.5 discussed this in some detail and we analyzed performance guarantees in Section 11.4. In short, we want the rate to decay, but probably more slowly than $$\mathcal{O}(t^{-\frac{1}{2}})$$ which would be a good choice for convex problems.

• Another aspect that is equally important is initialization. This pertains both to how the parameters are set initially (review Section 4.8 for details) and also how they evolve initially. This goes under the moniker of warmup, i.e., how rapidly we start moving towards the solution initially. Large steps in the beginning might not be beneficial, in particular since the initial set of parameters is random. The initial update directions might be quite meaningless, too.

• Lastly, there are a number of optimization variants that perform cyclical learning rate adjustment. This is beyond the scope of the current chapter. We recommend the reader to review details in [Izmailov et al., 2018], e.g., how to obtain better solutions by averaging over an entire path of parameters.

Given the fact that there is a lot of detail needed to manage learning rates, most deep learning frameworks have tools to deal with this automatically. In the current chapter we will review the effects that different schedules have on accuracy and also show how this can be managed efficiently via a learning rate scheduler.

In DJL we will be referring to these as learning rate trackers.

## 11.11.1. Toy Problem¶

We begin with a toy problem that is cheap enough to compute easily, yet sufficiently nontrivial to illustrate some of the key aspects. For that we pick a slightly modernized version of LeNet (relu instead of sigmoid activation, MaxPooling rather than AveragePooling), as applied to Fashion-MNIST. Moreover, we hybridize the network for performance. Since most of the code is standard we just introduce the basics without further detailed discussion. See Section 6 for a refresher as needed.

%load ../utils/djl-imports


import ai.djl.basicdataset.cv.classification.*;
import org.apache.commons.lang3.ArrayUtils;

SequentialBlock net = new SequentialBlock();

.setKernelShape(new Shape(5, 5))
.setFilters(1)
.build());
net.add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2)));
.setKernelShape(new Shape(5, 5))
.setFilters(1)
.build());

SequentialBlock {
Conv2d
ReLU
maxPool2d
Conv2d
batchFlatten
ReLU
Linear
ReLU
Linear
ReLU
Linear
}

int batchSize = 256;
RandomAccessDataset trainDataset = FashionMnist.builder()
.optUsage(Dataset.Usage.TRAIN)
.setSampling(batchSize, false)
.optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
.build();

RandomAccessDataset testDataset = FashionMnist.builder()
.optUsage(Dataset.Usage.TEST)
.setSampling(batchSize, false)
.optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
.build();

double[] trainLoss;
double[] testAccuracy;
double[] epochCount;
double[] trainAccuracy;

public static void train(RandomAccessDataset trainIter, RandomAccessDataset testIter,
int numEpochs, Trainer trainer) throws IOException, TranslateException {
epochCount = new double[numEpochs];

for (int i = 0; i < epochCount.length; i++) {
epochCount[i] = (i + 1);
}

double avgTrainTimePerEpoch = 0;
Map<String, double[]> evaluatorMetrics = new HashMap<>();

trainer.setMetrics(new Metrics());

EasyTrain.fit(trainer, numEpochs, trainIter, testIter);

Metrics metrics = trainer.getMetrics();

trainer.getEvaluators().stream()
.forEach(evaluator -> {
evaluatorMetrics.put("train_epoch_" + evaluator.getName(), metrics.getMetric("train_epoch_" + evaluator.getName()).stream()
.mapToDouble(x -> x.getValue().doubleValue()).toArray());
evaluatorMetrics.put("validate_epoch_" + evaluator.getName(), metrics.getMetric("validate_epoch_" + evaluator.getName()).stream()
.mapToDouble(x -> x.getValue().doubleValue()).toArray());
});

avgTrainTimePerEpoch = metrics.mean("epoch");

trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");
trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");
testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");

System.out.printf("loss %.3f," , trainLoss[numEpochs-1]);
System.out.printf(" train acc %.3f," , trainAccuracy[numEpochs-1]);
System.out.printf(" test acc %.3f\n" , testAccuracy[numEpochs-1]);
System.out.printf("%.1f examples/sec \n", trainIter.size() / (avgTrainTimePerEpoch / Math.pow(10, 9)));
}


Let us have a look at what happens if we invoke this algorithm with default settings, such as a learning rate of $$0.3$$ and train for $$30$$ iterations. Note how the training accuracy keeps on increasing while progress in terms of test accuracy stalls beyond a point. The gap between both curves indicates overfitting.

float lr = 0.3f;
int numEpochs = Integer.getInteger("MAX_EPOCH", 10);

Model model = Model.newInstance("Modern LeNet");
model.setBlock(net);

Loss loss = Loss.softmaxCrossEntropyLoss();
Tracker lrt = Tracker.fixed(lr);
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();

DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(sgd) // Optimizer

Trainer trainer = model.newTrainer(config);
trainer.initialize(new Shape(1, 1, 28, 28));

train(trainDataset, testDataset, numEpochs, trainer);

INFO Training on: 4 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.053 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 1 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 2 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 3 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 4 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 5 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 6 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 7 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 8 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 9 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 10 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

loss 2.304, train acc 0.100, test acc 0.100
8581.8 examples/sec

public void plotMetrics() {
String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];

Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");

StringColumn.create("lossLabel", lossLabel)
);

display(LinePlot.create("", data, "epoch", "metrics", "lossLabel"));
}

plotMetrics();


## 11.11.2. Trackers¶

One way of adjusting the learning rate is to set it explicitly at each step. We could adjust it downward after every epoch (or even after every minibatch), e.g., in a dynamic manner in response to how optimization is progressing.

We, however, can’t directly change the learning rate with the trainer after it has already been created. What we can do instead is create a tracker to do this for us.

When invoked with the number of updates it returns the appropriate value of the learning rate. Let us define a simple one that sets the learning rate to $$\eta = \eta_0 (t + 1)^{-\frac{1}{2}}$$.

public class SquareRootTracker {
float lr;
public SquareRootTracker() {
this(0.1f);
}
public SquareRootTracker(float learningRate) {
this.lr = learningRate;
}
public float getNewLearningRate(int numUpdate) {
return lr * (float) Math.pow(numUpdate + 1, -0.5);
}
}


Note: This is not a drop in replacement for a standard Learning Rate Tracker (LRT). This is just a simple example to give a better understanding of how they work.

Let us plot its behavior over a range of values.

public Figure plotLearningRate(int[] epochs, float[] learningRates) {

String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];

Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");

IntColumn.create("epoch", epochs),
DoubleColumn.create("learning rate", learningRates)
);

return LinePlot.create("Learning Rate vs. Epoch", data, "epoch", "learning rate");
}

SquareRootTracker tracker = new SquareRootTracker();

int[] epochs = new int[numEpochs];
float[] learningRates = new float[numEpochs];
for (int i = 0; i < numEpochs; i++) {
epochs[i] = i;
learningRates[i] = tracker.getNewLearningRate(i);
}

plotLearningRate(epochs, learningRates);


Now let us see how this plays out for training on Fashion-MNIST. We can’t actually do it directly, but we can see how the curve would look theoretically.

This looks like it works quite a bit better than before. Two things stand out: the curve was rather more smooth than previously. Secondly, there was less overfitting. Unfortunately it is not a well-resolved question as to why certain strategies lead to less overfitting in theory. There is some argument that a smaller stepsize will lead to parameters that are closer to zero and thus simpler. However, this does not explain the phenomenon entirely since we do not really stop early but simply reduce the learning rate gently.

## 11.11.3. Policies¶

While we cannot possibly cover the entire variety of learning rate trackers, we attempt to give a brief overview of popular policies below. Common choices are polynomial decay and piecewise constant schedules. Beyond that, cosine learning rate schedules have been found to work well empirically on some problems. Lastly, on some problems it is beneficial to warm up the optimizer prior to using large learning rates.

### 11.11.3.1. Factor Tracker¶

One alternative to a polynomial decay would be a multiplicative one, that is $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$ for $$\alpha \in (0, 1)$$. To prevent the learning rate from decaying beyond a reasonable lower bound the update equation is often modified to $$\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)$$.

public class DemoFactorTracker {
float baseLr;
float stopFactorLr;
float factor;
public DemoFactorTracker(float factor, float stopFactorLr, float baseLr) {
this.factor = factor;
this.stopFactorLr = stopFactorLr;
this.baseLr = baseLr;
}
public DemoFactorTracker() {
this(1f, (float) 1e-7, 0.1f);
}
public float getNewLearningRate(int numUpdate) {
return lr * (float) Math.pow(numUpdate + 1, -0.5);
}
}

DemoFactorTracker tracker = new DemoFactorTracker(0.9f, (float) 1e-2, 2);

numEpochs = 50;
int[] epochs = new int[numEpochs];
float[] learningRates = new float[numEpochs];
for (int i = 0; i < numEpochs; i++) {
epochs[i] = i;
learningRates[i] = tracker.getNewLearningRate(i);
}

plotLearningRate(epochs, learningRates);


This can also be accomplished by a built-in scheduler in DJL via the LearningRateTracker.factorTracker() builder. It takes a few more parameters, such as warmup period, warmup mode (linear or constant), the maximum number of desired updates, etc.; Going forward we will use the built-in schedulers as appropriate and only explain their functionality here.

### 11.11.3.2. Multi Factor Scheduler¶

A common strategy for training deep networks is to keep the learning rate piecewise constant and to decrease it by a given amount every so often. That is, given a set of times when to decrease the rate, such as $$s = \{5, 10, 20\}$$ decrease $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$ whenever $$t \in s$$. Assuming that the values are halved at each step we can implement this as follows.

MultiFactorTracker tracker = Tracker.multiFactor()
.setSteps(new int[]{5, 30})
.optFactor(0.5f)
.setBaseValue(0.5f)
.build();

numEpochs = 10;
int[] epochs = new int[numEpochs];
float[] learningRates = new float[numEpochs];
for (int i = 0; i < numEpochs; i++) {
epochs[i] = i;
learningRates[i] = tracker.getNewValue(i);
}

plotLearningRate(epochs, learningRates);


The intuition behind this piecewise constant learning rate schedule is that one lets optimization proceed until a stationary point has been reached in terms of the distribution of weight vectors. Then (and only then) do we decrease the rate such as to obtain a higher quality proxy to a good local minimum. The example below shows how this can produce ever slightly better solutions.

int numEpochs = Integer.getInteger("MAX_EPOCH", 10);

Model model = Model.newInstance("Modern LeNet");
model.setBlock(net);

Loss loss = Loss.softmaxCrossEntropyLoss();
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(tracker).build();

DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(sgd) // Optimizer

Trainer trainer = model.newTrainer(config);
trainer.initialize(new Shape(1, 1, 28, 28));

train(trainDataset, testDataset, numEpochs, trainer);
plotMetrics();

INFO Training on: 4 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.021 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 1 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 2 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 3 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 4 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 5 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 6 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 7 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 8 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 9 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 10 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

loss 2.303, train acc 0.100, test acc 0.100
10711.8 examples/sec

String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];

Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");

StringColumn.create("lossLabel", lossLabel)
);

LinePlot.create("", data, "epoch", "metrics", "lossLabel");


### 11.11.3.3. Cosine Tracker¶

A rather perplexing heuristic was proposed by [Loshchilov & Hutter, 2016]. It relies on the observation that we might not want to decrease the learning rate too drastically in the beginning and moreover, that we might want to “refine” the solution in the end using a very small learning rate. This results in a cosine-like tracker with the following functional form for learning rates in the range $$t \in [0, T]$$.

(11.11.1)$\eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2} \left(1 + \cos(\pi t/T)\right)$

Here $$\eta_0$$ is the initial learning rate, $$\eta_T$$ is the target rate at time $$T$$. Furthermore, for $$t > T$$ we simply pin the value to $$\eta_T$$ without increasing it again. In the following example, we set the max update step $$T = 20$$.

public class DemoCosineTracker {
float baseLr;
float finalLr;
int maxUpdate;
public DemoCosineTracker() {
this(0.5f, 0.01f, 20);
}
public DemoCosineTracker(float baseLr, float finalLr, int maxUpdate) {
this.baseLr = baseLr;
this.finalLr = finalLr;
this.maxUpdate = maxUpdate;
}
public float getNewLearningRate(int numUpdate) {
if (numUpdate > maxUpdate) {
return finalLr;
}
// Scale the curve to smoothly transition
float step = (baseLr - finalLr) / 2 * (1 + (float) Math.cos(Math.PI * numUpdate / maxUpdate));
return finalLr + step;
}
}

DemoCosineTracker tracker = new DemoCosineTracker(0.5f, 0.01f, 20);

int[] epochs = new int[numEpochs];
float[] learningRates = new float[numEpochs];
for (int i = 0; i < numEpochs; i++) {
epochs[i] = i;
learningRates[i] = tracker.getNewLearningRate(i);
}

plotLearningRate(epochs, learningRates);


In the context of computer vision this schedule can lead to improved results. Note, though, that such improvements are not guaranteed (as can be seen below).

CosineTracker cosineTracker = Tracker.cosine()
.setBaseValue(0.5f)
.optFinalValue(0.01f)
.build();

Loss loss = Loss.softmaxCrossEntropyLoss();
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(cosineTracker).build();

DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(sgd) // Optimizer

Trainer trainer = model.newTrainer(config);
trainer.initialize(new Shape(1, 1, 28, 28));

train(trainDataset, testDataset, numEpochs, trainer);

INFO Training on: 4 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.020 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 1 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 2 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 3 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 4 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 5 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 6 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 7 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 8 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 9 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 10 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

loss 2.303, train acc 0.096, test acc 0.100
10714.5 examples/sec


### 11.11.3.4. Warmup¶

In some cases initializing the parameters is not sufficient to guarantee a good solution. This particularly a problem for some advanced network designs that may lead to unstable optimization problems. We could address this by choosing a sufficiently small learning rate to prevent divergence in the beginning. Unfortunately this means that progress is slow. Conversely, a large learning rate initially leads to divergence.

A rather simple fix for this dilemma is to use a warmup period during which the learning rate increases to its initial maximum and to cool down the rate until the end of the optimization process. For simplicity one typically uses a linear increase for this purpose. This leads to a schedule of the form indicated below.

public class CosineWarmupTracker {
float baseLr;
float finalLr;
int maxUpdate;
int warmUpSteps;
float warmUpBeginValue;
float warmUpFinalValue;

public CosineWarmupTracker() {
this(0.5f, 0.01f, 20, 5);
}

public CosineWarmupTracker(float baseLr, float finalLr, int maxUpdate, int warmUpSteps) {
this.baseLr = baseLr;
this.finalLr = finalLr;
this.maxUpdate = maxUpdate;
this.warmUpSteps = 5;
this.warmUpBeginValue = 0f;
}

public float getNewLearningRate(int numUpdate) {
if (numUpdate <= warmUpSteps) {
return getWarmUpValue(numUpdate);
}
if (numUpdate > maxUpdate) {
return finalLr;
}
// Scale the cosine curve to fit smoothly with the warmup steps
float step = (baseLr - finalLr) / 2 * (1 +
(float) Math.cos(Math.PI * (numUpdate - warmUpSteps) / (maxUpdate - warmUpSteps)));
return finalLr + step;
}

public float getWarmUpValue(int numUpdate) {
// Linear warmup
return warmUpBeginValue + (baseLr - warmUpBeginValue) * numUpdate / warmUpSteps;
}
}

CosineWarmupTracker tracker = new CosineWarmupTracker(0.5f, 0.01f, 20, 5);

int[] epochs = new int[numEpochs];
float[] learningRates = new float[numEpochs];
for (int i = 0; i < numEpochs; i++) {
epochs[i] = i;
learningRates[i] = tracker.getNewLearningRate(i);
}

plotLearningRate(epochs, learningRates);


Note that the network converges better initially (in particular observe the performance during the first 5 epochs).

Additionally, we still use a total of 20 max updates, but the 1st 5 are dedicated to the warmup steps. The cosine curve will then be squeezed into the 15 steps relative to the earlier 20 steps.

CosineTracker cosineTracker = Tracker.cosine()
.setBaseValue(0.5f)
.optFinalValue(0.01f)
.build();

WarmUpTracker warmupCosine = Tracker.warmUp()
.optWarmUpSteps(5)
.setMainTracker(cosineTracker)
.build();

Loss loss = Loss.softmaxCrossEntropyLoss();
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(warmupCosine).build();

DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(sgd) // Optimizer

Trainer trainer = model.newTrainer(config);
trainer.initialize(new Shape(1, 1, 28, 28));

train(trainDataset, testDataset, numEpochs, trainer);
plotMetrics();

INFO Training on: 4 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.029 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 1 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 2 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 3 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 4 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 5 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 6 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 7 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 8 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 9 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
Validating:  100% |████████████████████████████████████████|

INFO Epoch 10 finished.
INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30

loss 2.303, train acc 0.096, test acc 0.100
10677.1 examples/sec


Warmup can be applied to any scheduler (not just cosine). For a more detailed discussion of learning rate schedules and many more experiments see also [Gotmare et al., 2018]. In particular they find that a warmup phase limits the amount of divergence of parameters in very deep networks. This makes intuitively sense since we would expect significant divergence due to random initialization in those parts of the network that take the most time to make progress in the beginning.

## 11.11.4. Summary¶

• Decreasing the learning rate during training can lead to improved accuracy and (most perplexingly) reduced overfitting of the model.

• A piecewise decrease of the learning rate whenever progress has plateaued is effective in practice. Essentially this ensures that we converge efficiently to a suitable solution and only then reduce the inherent variance of the parameters by reducing the learning rate.

• Cosine schedulers are popular for some computer vision problems.

• A warmup period before optimization can prevent divergence.

• Optimization serves multiple purposes in deep learning. Besides minimizing the training objective, different choices of optimization algorithms and learning rate scheduling can lead to rather different amounts of generalization and overfitting on the test set (for the same amount of training error).

## 11.11.5. Exercises¶

1. Experiment with the optimization behavior for a given fixed learning rate. What is the best model you can obtain this way?

2. How does convergence change if you change the exponent of the decrease in the learning rate?

3. Apply the cosine scheduler to large computer vision problems, e.g., training ImageNet. How does it affect performance relative to other schedulers?

4. How long should warmup last?

5. Can you connect optimization and sampling? Start by using results from [Welling & Teh, 2011] on Stochastic Gradient Langevin Dynamics.