Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_optimization/lr-scheduler.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_optimization/lr-scheduler.ipynb

.. _sec_scheduler:

Learning Rate Scheduling
========================


So far we primarily focused on optimization *algorithms* for how to
update the weight vectors rather than on the *rate* at which they are
being updated. Nonetheless, adjusting the learning rate is often just as
important as the actual algorithm. There are a number of aspects to
consider:

-  Most obviously the *magnitude* of the learning rate matters. If it is
   too large, optimization diverges, if it is too small, it takes too
   long to train or we end up with a suboptimal result. We saw
   previously that the condition number of the problem matters (see
   e.g., :numref:`sec_momentum` for details). Intuitively it is the
   ratio of the amount of change in the least sensitive direction vs.
   the most sensitive one.
-  Secondly, the rate of decay is just as important. If the learning
   rate remains large we may simply end up bouncing around the minimum
   and thus not reach optimality. :numref:`sec_minibatch_sgd`
   discussed this in some detail and we analyzed performance guarantees
   in :numref:`sec_sgd`. In short, we want the rate to decay, but
   probably more slowly than :math:`\mathcal{O}(t^{-\frac{1}{2}})` which
   would be a good choice for convex problems.
-  Another aspect that is equally important is *initialization*. This
   pertains both to how the parameters are set initially (review
   :numref:`sec_numerical_stability` for details) and also how they
   evolve initially. This goes under the moniker of *warmup*, i.e., how
   rapidly we start moving towards the solution initially. Large steps
   in the beginning might not be beneficial, in particular since the
   initial set of parameters is random. The initial update directions
   might be quite meaningless, too.
-  Lastly, there are a number of optimization variants that perform
   cyclical learning rate adjustment. This is beyond the scope of the
   current chapter. We recommend the reader to review details in
   :cite:`Izmailov.Podoprikhin.Garipov.ea.2018`, e.g., how to obtain
   better solutions by averaging over an entire *path* of parameters.

Given the fact that there is a lot of detail needed to manage learning
rates, most deep learning frameworks have tools to deal with this
automatically. In the current chapter we will review the effects that
different schedules have on accuracy and also show how this can be
managed efficiently via a *learning rate scheduler*.

In DJL we will be referring to these as learning rate trackers.

Toy Problem
-----------

We begin with a toy problem that is cheap enough to compute easily, yet
sufficiently nontrivial to illustrate some of the key aspects. For that
we pick a slightly modernized version of LeNet (``relu`` instead of
``sigmoid`` activation, MaxPooling rather than AveragePooling), as
applied to Fashion-MNIST. Moreover, we hybridize the network for
performance. Since most of the code is standard we just introduce the
basics without further detailed discussion. See :numref:`chap_cnn` for
a refresher as needed.

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils
    %load ../utils/Functions.java
    %load ../utils/GradDescUtils.java
    %load ../utils/Accumulator.java
    %load ../utils/StopWatch.java
    
    %load ../utils/Training.java
    %load ../utils/TrainingChapter11.java

.. code:: java

    import ai.djl.basicdataset.cv.classification.*;
    import org.apache.commons.lang3.ArrayUtils;

.. code:: java

    SequentialBlock net = new SequentialBlock();
    
    net.add(Conv2d.builder()
            .setKernelShape(new Shape(5, 5))
            .optPadding(new Shape(2, 2))
            .setFilters(1)
            .build());
    net.add(Activation.reluBlock());
    net.add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2)));
    net.add(Conv2d.builder()
            .setKernelShape(new Shape(5, 5))
            .setFilters(1)
            .build());
    net.add(Blocks.batchFlattenBlock());
    net.add(Activation.reluBlock());
    net.add(Linear.builder().setUnits(120).build());
    net.add(Activation.reluBlock());
    net.add(Linear.builder().setUnits(84).build());
    net.add(Activation.reluBlock());
    net.add(Linear.builder().setUnits(10).build());


.. parsed-literal::
    :class: output

    SequentialBlock {
    	Conv2d
    	ReLU
    	maxPool2d
    	Conv2d
    	batchFlatten
    	ReLU
    	Linear
    	ReLU
    	Linear
    	ReLU
    	Linear
    }


.. code:: java

    int batchSize = 256;
    RandomAccessDataset trainDataset = FashionMnist.builder()
            .optUsage(Dataset.Usage.TRAIN)
            .setSampling(batchSize, false)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();
    
    RandomAccessDataset testDataset = FashionMnist.builder()
            .optUsage(Dataset.Usage.TEST)
            .setSampling(batchSize, false)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();

.. code:: java

    double[] trainLoss;
    double[] testAccuracy;
    double[] epochCount;
    double[] trainAccuracy;
    
    public static void train(RandomAccessDataset trainIter, RandomAccessDataset testIter,
                                 int numEpochs, Trainer trainer) throws IOException, TranslateException {
        epochCount = new double[numEpochs];
    
        for (int i = 0; i < epochCount.length; i++) {
            epochCount[i] = (i + 1);
        }
    
        double avgTrainTimePerEpoch = 0;
        Map<String, double[]> evaluatorMetrics = new HashMap<>();
    
        trainer.setMetrics(new Metrics());
    
        EasyTrain.fit(trainer, numEpochs, trainIter, testIter);
    
        Metrics metrics = trainer.getMetrics();
    
        trainer.getEvaluators().stream()
                .forEach(evaluator -> {
                    evaluatorMetrics.put("train_epoch_" + evaluator.getName(), metrics.getMetric("train_epoch_" + evaluator.getName()).stream()
                            .mapToDouble(x -> x.getValue().doubleValue()).toArray());
                    evaluatorMetrics.put("validate_epoch_" + evaluator.getName(), metrics.getMetric("validate_epoch_" + evaluator.getName()).stream()
                            .mapToDouble(x -> x.getValue().doubleValue()).toArray());
                });
    
        avgTrainTimePerEpoch = metrics.mean("epoch");
    
        trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");
        trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");
        testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");
    
        System.out.printf("loss %.3f," , trainLoss[numEpochs-1]);
        System.out.printf(" train acc %.3f," , trainAccuracy[numEpochs-1]);
        System.out.printf(" test acc %.3f\n" , testAccuracy[numEpochs-1]);
        System.out.printf("%.1f examples/sec \n", trainIter.size() / (avgTrainTimePerEpoch / Math.pow(10, 9)));
    }

Let us have a look at what happens if we invoke this algorithm with
default settings, such as a learning rate of :math:`0.3` and train for
:math:`30` iterations. Note how the training accuracy keeps on
increasing while progress in terms of test accuracy stalls beyond a
point. The gap between both curves indicates overfitting.

.. code:: java

    float lr = 0.3f;
    int numEpochs = Integer.getInteger("MAX_EPOCH", 10);
    
    Model model = Model.newInstance("Modern LeNet");
    model.setBlock(net);
    
    Loss loss = Loss.softmaxCrossEntropyLoss();
    Tracker lrt = Tracker.fixed(lr);
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();
    
    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
            .optOptimizer(sgd) // Optimizer
            .addEvaluator(new Accuracy()) // Model Accuracy
            .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
    Trainer trainer = model.newTrainer(config);
    trainer.initialize(new Shape(1, 1, 28, 28));
    
    train(trainDataset, testDataset, numEpochs, trainer);


.. parsed-literal::
    :class: output

    INFO Training on: 4 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.053 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 4 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 5 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 6 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 7 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 8 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 9 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 10 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    loss 2.304, train acc 0.100, test acc 0.100
    8581.8 examples/sec 


.. code:: java

    public void plotMetrics() {
        String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
    
        Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
        Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
        Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                        trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");
    
        Table data = Table.create("Data").addColumns(
            DoubleColumn.create("epoch", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
            DoubleColumn.create("metrics", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),
            StringColumn.create("lossLabel", lossLabel)
        );
    
        display(LinePlot.create("", data, "epoch", "metrics", "lossLabel"));
    }
    
    plotMetrics();


.. raw:: html

    <img id="7dc215d21ef843f38a0156ad5700ed8b_img"></img>
    <div id="7dc215d21ef843f38a0156ad5700ed8b"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_7dc215d21ef843f38a0156ad5700ed8b = document.getElementById('7dc215d21ef843f38a0156ad5700ed8b');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'metrics',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["2.3041308","2.3038523","2.3038523","2.3038523","2.3038523","2.3038523","2.3038523","2.3038523","2.3038523","2.3038523"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.10036667","0.0998","0.0998","0.0998","0.0998","0.0998","0.0998","0.0998","0.0998","0.0998"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_7dc215d21ef843f38a0156ad5700ed8b, data, layout);
    })</script>


Trackers
--------

One way of adjusting the learning rate is to set it explicitly at each
step. We could adjust it downward after every epoch (or even after every
minibatch), e.g., in a dynamic manner in response to how optimization is
progressing.

We, however, can't directly change the learning rate with the trainer
after it has already been created. What we can do instead is create a
tracker to do this for us.

When invoked with the number of updates it returns the appropriate value
of the learning rate. Let us define a simple one that sets the learning
rate to :math:`\eta = \eta_0 (t + 1)^{-\frac{1}{2}}`.

.. code:: java

    public class SquareRootTracker {
        float lr;
        public SquareRootTracker() {
            this(0.1f);
        }
        public SquareRootTracker(float learningRate) {
            this.lr = learningRate;
        }
        public float getNewLearningRate(int numUpdate) {
            return lr * (float) Math.pow(numUpdate + 1, -0.5);
        }
    }

Note: This is not a drop in replacement for a standard Learning Rate
Tracker (LRT). This is just a simple example to give a better
understanding of how they work.

Let us plot its behavior over a range of values.

.. code:: java

    public Figure plotLearningRate(int[] epochs, float[] learningRates) {
        
        String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
    
        Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
        Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
        Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                        trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");
    
        Table data = Table.create("Data").addColumns(
                    IntColumn.create("epoch", epochs),
                    DoubleColumn.create("learning rate", learningRates)
        );
    
        return LinePlot.create("Learning Rate vs. Epoch", data, "epoch", "learning rate");
    }

.. code:: java

    SquareRootTracker tracker = new SquareRootTracker();
    
    int[] epochs = new int[numEpochs];
    float[] learningRates = new float[numEpochs];
    for (int i = 0; i < numEpochs; i++) {
        epochs[i] = i;
        learningRates[i] = tracker.getNewLearningRate(i);
    }
    
    plotLearningRate(epochs, learningRates);


.. raw:: html

    <img id="5f3b9db907b84baebe30220e1e8bfa9d_img"></img>
    <div id="5f3b9db907b84baebe30220e1e8bfa9d"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_5f3b9db907b84baebe30220e1e8bfa9d = document.getElementById('5f3b9db907b84baebe30220e1e8bfa9d');
    var layout = {
        title: 'Learning Rate vs. Epoch',
        height: 600,
        width: 800,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'learning rate',
        },
    
    };
    
    var trace0 =
    {
    x: ["0","1","2","3","4","5","6","7","8","9"],
    y: ["0.10000000149011612","0.0707106813788414","0.05773502588272095","0.05000000074505806","0.04472136124968529","0.040824830532073975","0.03779644891619682","0.0353553406894207","0.03333333507180214","0.03162277862429619"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: '',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_5f3b9db907b84baebe30220e1e8bfa9d, data, layout);
    })</script>


Now let us see how this plays out for training on Fashion-MNIST. We
can't actually do it directly, but we can see how the curve would look
theoretically.

This looks like it works quite a bit better than before. Two things
stand out: the curve was rather more smooth than previously. Secondly,
there was less overfitting. Unfortunately it is not a well-resolved
question as to why certain strategies lead to less overfitting in
*theory*. There is some argument that a smaller stepsize will lead to
parameters that are closer to zero and thus simpler. However, this does
not explain the phenomenon entirely since we do not really stop early
but simply reduce the learning rate gently.

Policies
--------

While we cannot possibly cover the entire variety of learning rate
trackers, we attempt to give a brief overview of popular policies below.
Common choices are polynomial decay and piecewise constant schedules.
Beyond that, cosine learning rate schedules have been found to work well
empirically on some problems. Lastly, on some problems it is beneficial
to warm up the optimizer prior to using large learning rates.

Factor Tracker
~~~~~~~~~~~~~~

One alternative to a polynomial decay would be a multiplicative one,
that is :math:`\eta_{t+1} \leftarrow \eta_t \cdot \alpha` for
:math:`\alpha \in (0, 1)`. To prevent the learning rate from decaying
beyond a reasonable lower bound the update equation is often modified to
:math:`\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)`.

.. code:: java

    public class DemoFactorTracker {
        float baseLr;
        float stopFactorLr;
        float factor;
        public DemoFactorTracker(float factor, float stopFactorLr, float baseLr) {
            this.factor = factor;
            this.stopFactorLr = stopFactorLr;
            this.baseLr = baseLr;
        }
        public DemoFactorTracker() {
            this(1f, (float) 1e-7, 0.1f);
        }
        public float getNewLearningRate(int numUpdate) {
            return lr * (float) Math.pow(numUpdate + 1, -0.5);
        }
    }

.. code:: java

    DemoFactorTracker tracker = new DemoFactorTracker(0.9f, (float) 1e-2, 2);
    
    numEpochs = 50;
    int[] epochs = new int[numEpochs];
    float[] learningRates = new float[numEpochs];
    for (int i = 0; i < numEpochs; i++) {
        epochs[i] = i;
        learningRates[i] = tracker.getNewLearningRate(i);
    }
    
    plotLearningRate(epochs, learningRates);


.. raw:: html

    <img id="c6b28709d2314fba82e3062250a27650_img"></img>
    <div id="c6b28709d2314fba82e3062250a27650"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_c6b28709d2314fba82e3062250a27650 = document.getElementById('c6b28709d2314fba82e3062250a27650');
    var layout = {
        title: 'Learning Rate vs. Epoch',
        height: 600,
        width: 800,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'learning rate',
        },
    
    };
    
    var trace0 =
    {
    x: ["0","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49"],
    y: ["0.30000001192092896","0.2121320366859436","0.17320507764816284","0.15000000596046448","0.13416408002376556","0.12247449904680252","0.11338934302330017","0.1060660183429718","0.10000000894069672","0.09486833214759827","0.09045340865850449","0.08660253882408142","0.08320502936840057","0.08017837256193161","0.0774596706032753","0.07500000298023224","0.0727606862783432","0.0707106813788414","0.06882472336292267","0.06708204001188278","0.06546536833047867","0.06396021693944931","0.06255432218313217","0.06123724952340126","0.06000000238418579","0.05883484333753586","0.057735029608011246","0.056694671511650085","0.05570860207080841","0.054772257804870605","0.05388159304857254","0.0530330091714859","0.05222329869866371","0.05144957825541496","0.050709255039691925","0.05000000447034836","0.04931969568133354","0.04866642877459526","0.04803844541311264","0.04743416607379913","0.04685213044285774","0.046291008591651917","0.04574957489967346","0.04522670432925224","0.04472136124968529","0.04423258826136589","0.04375949874520302","0.04330126941204071","0.04285714775323868","0.04242641106247902"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: '',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_c6b28709d2314fba82e3062250a27650, data, layout);
    })</script>


This can also be accomplished by a built-in scheduler in DJL via the
``LearningRateTracker.factorTracker()`` builder. It takes a few more
parameters, such as warmup period, warmup mode (linear or constant), the
maximum number of desired updates, etc.; Going forward we will use the
built-in schedulers as appropriate and only explain their functionality
here.

Multi Factor Scheduler
~~~~~~~~~~~~~~~~~~~~~~

A common strategy for training deep networks is to keep the learning
rate piecewise constant and to decrease it by a given amount every so
often. That is, given a set of times when to decrease the rate, such as
:math:`s = \{5, 10, 20\}` decrease
:math:`\eta_{t+1} \leftarrow \eta_t \cdot \alpha` whenever
:math:`t \in s`. Assuming that the values are halved at each step we can
implement this as follows.

.. code:: java

    MultiFactorTracker tracker = Tracker.multiFactor()
            .setSteps(new int[]{5, 30})
            .optFactor(0.5f)
            .setBaseValue(0.5f)
            .build();
    
    numEpochs = 10;
    int[] epochs = new int[numEpochs];
    float[] learningRates = new float[numEpochs];
    for (int i = 0; i < numEpochs; i++) {
        epochs[i] = i;
        learningRates[i] = tracker.getNewValue(i);
    }
    
    plotLearningRate(epochs, learningRates);


.. raw:: html

    <img id="667440076dd14b1e988fa3189e979253_img"></img>
    <div id="667440076dd14b1e988fa3189e979253"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_667440076dd14b1e988fa3189e979253 = document.getElementById('667440076dd14b1e988fa3189e979253');
    var layout = {
        title: 'Learning Rate vs. Epoch',
        height: 600,
        width: 800,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'learning rate',
        },
    
    };
    
    var trace0 =
    {
    x: ["0","1","2","3","4","5","6","7","8","9"],
    y: ["0.5","0.5","0.5","0.5","0.5","0.5","0.25","0.25","0.25","0.25"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: '',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_667440076dd14b1e988fa3189e979253, data, layout);
    })</script>


The intuition behind this piecewise constant learning rate schedule is
that one lets optimization proceed until a stationary point has been
reached in terms of the distribution of weight vectors. Then (and only
then) do we decrease the rate such as to obtain a higher quality proxy
to a good local minimum. The example below shows how this can produce
ever slightly better solutions.

.. code:: java

    int numEpochs = Integer.getInteger("MAX_EPOCH", 10);
    
    Model model = Model.newInstance("Modern LeNet");
    model.setBlock(net);
    
    Loss loss = Loss.softmaxCrossEntropyLoss();
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(tracker).build();
    
    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
            .optOptimizer(sgd) // Optimizer
            .addEvaluator(new Accuracy()) // Model Accuracy
            .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
    Trainer trainer = model.newTrainer(config);
    trainer.initialize(new Shape(1, 1, 28, 28));
    
    train(trainDataset, testDataset, numEpochs, trainer);
    plotMetrics();


.. parsed-literal::
    :class: output

    INFO Training on: 4 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.021 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 4 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 5 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 6 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 7 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 8 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 9 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 10 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    loss 2.303, train acc 0.100, test acc 0.100
    10711.8 examples/sec 


.. raw:: html

    <img id="d2a3fea69bd64834a53142d12e9dd736_img"></img>
    <div id="d2a3fea69bd64834a53142d12e9dd736"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_d2a3fea69bd64834a53142d12e9dd736 = document.getElementById('d2a3fea69bd64834a53142d12e9dd736');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'metrics',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["2.3032598","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.09975","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_d2a3fea69bd64834a53142d12e9dd736, data, layout);
    })</script>


.. code:: java

    String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
    
    Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
    Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
    Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                    trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");
    
    Table data = Table.create("Data").addColumns(
                DoubleColumn.create("epoch", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
                DoubleColumn.create("metrics", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),
                StringColumn.create("lossLabel", lossLabel)
    );
    
    LinePlot.create("", data, "epoch", "metrics", "lossLabel");


.. raw:: html

    <img id="5819c57dbe6e461ca1282695e6eb1662_img"></img>
    <div id="5819c57dbe6e461ca1282695e6eb1662"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_5819c57dbe6e461ca1282695e6eb1662 = document.getElementById('5819c57dbe6e461ca1282695e6eb1662');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'metrics',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["2.3032598","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666","2.3031666"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.09975","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995","0.0995"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_5819c57dbe6e461ca1282695e6eb1662, data, layout);
    })</script>


Cosine Tracker
~~~~~~~~~~~~~~

A rather perplexing heuristic was proposed by
:cite:`Loshchilov.Hutter.2016`. It relies on the observation that we
might not want to decrease the learning rate too drastically in the
beginning and moreover, that we might want to "refine" the solution in
the end using a very small learning rate. This results in a cosine-like
tracker with the following functional form for learning rates in the
range :math:`t \in [0, T]`.

.. math:: \eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2} \left(1 + \cos(\pi t/T)\right)

Here :math:`\eta_0` is the initial learning rate, :math:`\eta_T` is the
target rate at time :math:`T`. Furthermore, for :math:`t > T` we simply
pin the value to :math:`\eta_T` without increasing it again. In the
following example, we set the max update step :math:`T = 20`.

.. code:: java

    public class DemoCosineTracker {
        float baseLr;
        float finalLr;
        int maxUpdate;
        public DemoCosineTracker() {
            this(0.5f, 0.01f, 20);
        }
        public DemoCosineTracker(float baseLr, float finalLr, int maxUpdate) {
            this.baseLr = baseLr;
            this.finalLr = finalLr;
            this.maxUpdate = maxUpdate;
        }
        public float getNewLearningRate(int numUpdate) {
            if (numUpdate > maxUpdate) {
                return finalLr;
            }
            // Scale the curve to smoothly transition
            float step = (baseLr - finalLr) / 2 * (1 + (float) Math.cos(Math.PI * numUpdate / maxUpdate));
            return finalLr + step;
        }
    }

.. code:: java

    DemoCosineTracker tracker = new DemoCosineTracker(0.5f, 0.01f, 20);
    
    int[] epochs = new int[numEpochs];
    float[] learningRates = new float[numEpochs];
    for (int i = 0; i < numEpochs; i++) {
        epochs[i] = i;
        learningRates[i] = tracker.getNewLearningRate(i);
    }
    
    plotLearningRate(epochs, learningRates);


.. raw:: html

    <img id="2ae74eaa5871425bb5d318ae79ff75bf_img"></img>
    <div id="2ae74eaa5871425bb5d318ae79ff75bf"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_2ae74eaa5871425bb5d318ae79ff75bf = document.getElementById('2ae74eaa5871425bb5d318ae79ff75bf');
    var layout = {
        title: 'Learning Rate vs. Epoch',
        height: 600,
        width: 800,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'learning rate',
        },
    
    };
    
    var trace0 =
    {
    x: ["0","1","2","3","4","5","6","7","8","9"],
    y: ["0.5","0.4969836473464966","0.4880088269710541","0.473296582698822","0.45320916175842285","0.42824116349220276","0.3990073800086975","0.3662276566028595","0.33070915937423706","0.2933264374732971"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: '',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_2ae74eaa5871425bb5d318ae79ff75bf, data, layout);
    })</script>


In the context of computer vision this schedule *can* lead to improved
results. Note, though, that such improvements are not guaranteed (as can
be seen below).

.. code:: java

    CosineTracker cosineTracker = Tracker.cosine()
                                .setBaseValue(0.5f)
                                .optFinalValue(0.01f)
                                .setMaxUpdates(20)
                                .build();
    
    Loss loss = Loss.softmaxCrossEntropyLoss();
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(cosineTracker).build();
    
    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
            .optOptimizer(sgd) // Optimizer
            .addEvaluator(new Accuracy()) // Model Accuracy
            .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
    Trainer trainer = model.newTrainer(config);
    trainer.initialize(new Shape(1, 1, 28, 28));
    
    train(trainDataset, testDataset, numEpochs, trainer);


.. parsed-literal::
    :class: output

    INFO Training on: 4 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.020 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 4 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 5 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 6 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 7 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 8 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 9 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 10 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    loss 2.303, train acc 0.096, test acc 0.100
    10714.5 examples/sec 


Warmup
~~~~~~

In some cases initializing the parameters is not sufficient to guarantee
a good solution. This particularly a problem for some advanced network
designs that may lead to unstable optimization problems. We could
address this by choosing a sufficiently small learning rate to prevent
divergence in the beginning. Unfortunately this means that progress is
slow. Conversely, a large learning rate initially leads to divergence.

A rather simple fix for this dilemma is to use a warmup period during
which the learning rate *increases* to its initial maximum and to cool
down the rate until the end of the optimization process. For simplicity
one typically uses a linear increase for this purpose. This leads to a
schedule of the form indicated below.

.. code:: java

    public class CosineWarmupTracker {
        float baseLr;
        float finalLr;
        int maxUpdate;
        int warmUpSteps;
        float warmUpBeginValue;
        float warmUpFinalValue;
        
        public CosineWarmupTracker() {
            this(0.5f, 0.01f, 20, 5);
        }
        
        public CosineWarmupTracker(float baseLr, float finalLr, int maxUpdate, int warmUpSteps) {
            this.baseLr = baseLr;
            this.finalLr = finalLr;
            this.maxUpdate = maxUpdate;
            this.warmUpSteps = 5;
            this.warmUpBeginValue = 0f;
        }
        
        public float getNewLearningRate(int numUpdate) {
            if (numUpdate <= warmUpSteps) {
                return getWarmUpValue(numUpdate);
            }
            if (numUpdate > maxUpdate) {
                return finalLr;
            }
            // Scale the cosine curve to fit smoothly with the warmup steps
            float step = (baseLr - finalLr) / 2 * (1 + 
                (float) Math.cos(Math.PI * (numUpdate - warmUpSteps) / (maxUpdate - warmUpSteps)));
            return finalLr + step;
        }
        
        public float getWarmUpValue(int numUpdate) {
            // Linear warmup
            return warmUpBeginValue + (baseLr - warmUpBeginValue) * numUpdate / warmUpSteps;
        }
    }

.. code:: java

    CosineWarmupTracker tracker = new CosineWarmupTracker(0.5f, 0.01f, 20, 5);
    
    int[] epochs = new int[numEpochs];
    float[] learningRates = new float[numEpochs];
    for (int i = 0; i < numEpochs; i++) {
        epochs[i] = i;
        learningRates[i] = tracker.getNewLearningRate(i);
    }
    
    plotLearningRate(epochs, learningRates);


.. raw:: html

    <img id="9dce7cfbfc5f432c9dbcd547e0db35f1_img"></img>
    <div id="9dce7cfbfc5f432c9dbcd547e0db35f1"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_9dce7cfbfc5f432c9dbcd547e0db35f1 = document.getElementById('9dce7cfbfc5f432c9dbcd547e0db35f1');
    var layout = {
        title: 'Learning Rate vs. Epoch',
        height: 600,
        width: 800,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'learning rate',
        },
    
    };
    
    var trace0 =
    {
    x: ["0","1","2","3","4","5","6","7","8","9"],
    y: ["0.0","0.10000000149011612","0.20000000298023224","0.30000001192092896","0.4000000059604645","0.5","0.4946461617946625","0.4788186252117157","0.45320916175842285","0.41893699765205383"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: '',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_9dce7cfbfc5f432c9dbcd547e0db35f1, data, layout);
    })</script>


Note that the network converges better initially (in particular observe
the performance during the first 5 epochs).

Additionally, we still use a total of 20 max updates, but the 1st 5 are
dedicated to the warmup steps. The cosine curve will then be squeezed
into the 15 steps relative to the earlier 20 steps.

.. code:: java

    CosineTracker cosineTracker = Tracker.cosine()
            .setBaseValue(0.5f)
            .optFinalValue(0.01f)
            .setMaxUpdates(15)
            .build();
    
    WarmUpTracker warmupCosine = Tracker.warmUp()
            .optWarmUpSteps(5)
            .setMainTracker(cosineTracker)
            .build();
    
    Loss loss = Loss.softmaxCrossEntropyLoss();
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(warmupCosine).build();
    
    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
            .optOptimizer(sgd) // Optimizer
            .addEvaluator(new Accuracy()) // Model Accuracy
            .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
    Trainer trainer = model.newTrainer(config);
    trainer.initialize(new Shape(1, 1, 28, 28));
    
    train(trainDataset, testDataset, numEpochs, trainer);
    plotMetrics();


.. parsed-literal::
    :class: output

    INFO Training on: 4 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.029 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 4 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 5 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 6 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 7 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 8 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 9 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 10 finished.
    INFO Train: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30
    INFO Validate: Accuracy: 0.10, SoftmaxCrossEntropyLoss: 2.30


.. parsed-literal::
    :class: output

    loss 2.303, train acc 0.096, test acc 0.100
    10677.1 examples/sec 


.. raw:: html

    <img id="b4755afe6d724ce0809ddf0d22379706_img"></img>
    <div id="b4755afe6d724ce0809ddf0d22379706"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_b4755afe6d724ce0809ddf0d22379706 = document.getElementById('b4755afe6d724ce0809ddf0d22379706');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'metrics',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["2.3032575","2.3027117","2.302654","2.3026478","2.3026521","2.3026528","2.302652","2.3026526","2.3026528","2.3026528"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.0997","0.1","0.09865","0.09745","0.09616667","0.097","0.09661666","0.09621666","0.0964","0.096366666"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1","0.1"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_b4755afe6d724ce0809ddf0d22379706, data, layout);
    })</script>


Warmup can be applied to any scheduler (not just cosine). For a more
detailed discussion of learning rate schedules and many more experiments
see also :cite:`Gotmare.Keskar.Xiong.ea.2018`. In particular they find
that a warmup phase limits the amount of divergence of parameters in
very deep networks. This makes intuitively sense since we would expect
significant divergence due to random initialization in those parts of
the network that take the most time to make progress in the beginning.

Summary
-------

-  Decreasing the learning rate during training can lead to improved
   accuracy and (most perplexingly) reduced overfitting of the model.
-  A piecewise decrease of the learning rate whenever progress has
   plateaued is effective in practice. Essentially this ensures that we
   converge efficiently to a suitable solution and only then reduce the
   inherent variance of the parameters by reducing the learning rate.
-  Cosine schedulers are popular for some computer vision problems.
-  A warmup period before optimization can prevent divergence.
-  Optimization serves multiple purposes in deep learning. Besides
   minimizing the training objective, different choices of optimization
   algorithms and learning rate scheduling can lead to rather different
   amounts of generalization and overfitting on the test set (for the
   same amount of training error).

Exercises
---------

1. Experiment with the optimization behavior for a given fixed learning
   rate. What is the best model you can obtain this way?
2. How does convergence change if you change the exponent of the
   decrease in the learning rate?
3. Apply the cosine scheduler to large computer vision problems, e.g.,
   training ImageNet. How does it affect performance relative to other
   schedulers?
4. How long should warmup last?
5. Can you connect optimization and sampling? Start by using results
   from :cite:`Welling.Teh.2011` on Stochastic Gradient Langevin
   Dynamics.