Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_optimization/momentum.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_optimization/momentum.ipynb

.. _sec_momentum:

Momentum
========


In :numref:`sec_sgd` we reviewed what happens when performing
stochastic gradient descent, i.e., when performing optimization where
only a noisy variant of the gradient is available. In particular, we
noticed that for noisy gradients we need to be extra cautious when it
comes to choosing the learning rate in the face of noise. If we decrease
it too rapidly, convergence stalls. If we are too lenient, we fail to
converge to a good enough solution since noise keeps on driving us away
from optimality.

Basics
------

In this section, we will explore more effective optimization algorithms,
especially for certain types of optimization problems that are common in
practice.

Leaky Averages
~~~~~~~~~~~~~~

The previous section saw us discussing minibatch SGD as a means for
accelerating computation. It also had the nice side-effect that
averaging gradients reduced the amount of variance.

.. math::

   \mathbf{g}_t = \partial_{\mathbf{w}} \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} f(\mathbf{x}_{i}, \mathbf{w}_{t-1}) = \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \mathbf{g}_{i, t-1}.

Here we used
:math:`\mathbf{g}_{ii} = \partial_{\mathbf{w}} f(\mathbf{x}_i, \mathbf{w}_t)`
to keep the notation simple. It would be nice if we could benefit from
the effect of variance reduction even beyond averaging gradients on a
mini-batch. One option to accomplish this task is to replace the
gradient computation by a "leaky average":

.. math:: \mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}

for some :math:`\beta \in (0, 1)`. This effectively replaces the
instantaneous gradient by one that's been averaged over multiple *past*
gradients. :math:`\mathbf{v}` is called *momentum*. It accumulates past
gradients similar to how a heavy ball rolling down the objective
function landscape integrates over past forces. To see what is happening
in more detail let us expand :math:`\mathbf{v}_t` recursively into

.. math::

   \begin{aligned}
   \mathbf{v}_t = \beta^2 \mathbf{v}_{t-2} + \beta \mathbf{g}_{t-1, t-2} + \mathbf{g}_{t, t-1}
   = \ldots, = \sum_{\tau = 0}^{t-1} \beta^{\tau} \mathbf{g}_{t-\tau, t-\tau-1}.
   \end{aligned}

Large :math:`\beta` amounts to a long-range average, whereas small
:math:`\beta` amounts to only a slight correction relative to a gradient
method. The new gradient replacement no longer points into the direction
of steepest descent on a particular instance any longer but rather in
the direction of a weighted average of past gradients. This allows us to
realize most of the benefits of averaging over a batch without the cost
of actually computing the gradients on it. We will revisit this
averaging procedure in more detail later.

The above reasoning formed the basis for what is now known as
*accelerated* gradient methods, such as gradients with momentum. They
enjoy the additional benefit of being much more effective in cases where
the optimization problem is ill-conditioned (i.e., where there are some
directions where progress is much slower than in others, resembling a
narrow canyon). Furthermore, they allow us to average over subsequent
gradients to obtain more stable directions of descent. Indeed, the
aspect of acceleration even for noise-free convex problems is one of the
key reasons why momentum works and why it works so well.

As one would expect, due to its efficacy momentum is a well-studied
subject in optimization for deep learning and beyond. See e.g., the
beautiful `expository article <https://distill.pub/2017/momentum/>`__ by
:cite:`Goh.2017` for an in-depth analysis and interactive animation.
It was proposed by :cite:`Polyak.1964`. :cite:`Nesterov.2018` has a
detailed theoretical discussion in the context of convex optimization.
Momentum in deep learning has been known to be beneficial for a long
time. See e.g., the discussion by
:cite:`Sutskever.Martens.Dahl.ea.2013` for details.

An Ill-conditioned Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~

To get a better understanding of the geometric properties of the
momentum method we revisit gradient descent, albeit with a significantly
less pleasant objective function. Recall that in :numref:`sec_gd` we
used :math:`f(\mathbf{x}) = x_1^2 + 2 x_2^2`, i.e., a moderately
distorted ellipsoid objective. We distort this function further by
stretching it out in the :math:`x_1` direction via

.. math:: f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2.

As before :math:`f` has its minimum at :math:`(0, 0)`. This function is
*very* flat in the direction of :math:`x_1`. Let us see what happens
when we perform gradient descent as before on this new function. We pick
a learning rate of :math:`0.4`.

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils
    %load ../utils/Functions.java
    %load ../utils/GradDescUtils.java
    %load ../utils/Accumulator.java
    %load ../utils/StopWatch.java
    %load ../utils/Training.java
    %load ../utils/TrainingChapter11.java

.. code:: java

    import org.apache.commons.lang3.ArrayUtils;

.. code:: java

    float eta = 0.4f;
    BiFunction<Float, Float, Float> f2d = (x1, x2) -> 0.1f * x1 * x1 + 2 * x2 * x2;
    
    Function<Float[], Float[]> gd2d = (state) -> {
        Float x1 = state[0], x2 = state[1], s1 = state[2], s2 = state[3];
        return new Float[]{x1 - eta * 0.2f * x1, x2 - eta * 4 * x2, 0f, 0f};
    };
    
    GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(gd2d, 20));


.. parsed-literal::
    :class: output

    Tablesaw not supporting for contour and meshgrids, will update soon


.. figure:: https://d2l-java-resources.s3.amazonaws.com/img/gd_flatx1.svg

   Ellipsoid with Flat x1.

By construction, the gradient in the :math:`x_2` direction is *much*
higher and changes much more rapidly than in the horizontal :math:`x_1`
direction. Thus we are stuck between two undesirable choices: if we pick
a small learning rate we ensure that the solution does not diverge in
the :math:`x_2` direction but we are saddled with slow convergence in
the :math:`x_1` direction. Conversely, with a large learning rate we
progress rapidly in the :math:`x_1` direction but diverge in
:math:`x_2`. The example below illustrates what happens even after a
slight increase in learning rate from :math:`0.4` to :math:`0.6`.
Convergence in the :math:`x_1` direction improves but the overall
solution quality is much worse.

.. code:: java

    float eta = 0.6f;
    GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(gd2d, 20));


.. parsed-literal::
    :class: output

    Tablesaw not supporting for contour and meshgrids, will update soon


.. figure:: https://d2l-java-resources.s3.amazonaws.com/img/gd_flatx1_large_lr.svg

   Ellipsoid with Flat x1 with Large Learning Rate.

The Momentum Method
~~~~~~~~~~~~~~~~~~~

The momentum method allows us to solve the gradient descent problem
described above. Looking at the optimization trace above we might intuit
that averaging gradients over the past would work well. After all, in
the :math:`x_1` direction this will aggregate well-aligned gradients,
thus increasing the distance we cover with every step. Conversely, in
the :math:`x_2` direction where gradients oscillate, an aggregate
gradient will reduce step size due to oscillations that cancel each
other out. Using :math:`\mathbf{v}_t` instead of the gradient
:math:`\mathbf{g}_t` yields the following update equations:

.. math::


   \begin{aligned}
   \mathbf{v}_t &\leftarrow \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}, \\
   \mathbf{x}_t &\leftarrow \mathbf{x}_{t-1} - \eta_t \mathbf{v}_t.
   \end{aligned}

Note that for :math:`\beta = 0` we recover regular gradient descent.
Before delving deeper into the mathematical properties let us have a
quick look at how the algorithm behaves in practice.

.. code:: java

    float eta = 0.6f;
    float beta = 0.5f;
    
    Function<Float[], Float[]> momentum2d = (state) -> {
        Float x1 = state[0], x2 = state[1], v1 = state[2], v2 = state[3];
        v1 = beta * v1 + 0.2f * x1;
        v2 = beta * v2 + 4 * x2;
        return new Float[]{x1 - eta * v1, x2 - eta * v2, v1, v2};
    };
    
    GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(momentum2d, 20));


.. parsed-literal::
    :class: output

    Tablesaw not supporting for contour and meshgrids, will update soon


.. figure:: https://d2l-java-resources.s3.amazonaws.com/img/contour_gd_mom.svg

   Contour Momentum.

As we can see, even with the same learning rate that we used before,
momentum still converges well. Let us see what happens when we decrease
the momentum parameter. Halving it to :math:`\beta = 0.25` leads to a
trajectory that barely converges at all. Nonetheless, it is a lot better
than without momentum (when the solution diverges).

.. code:: java

    eta = 0.6f;
    beta = 0.25f;
    GradDescUtils.showTrace2d(f2d, GradDescUtils.train2d(momentum2d, 20));


.. parsed-literal::
    :class: output

    Tablesaw not supporting for contour and meshgrids, will update soon


.. figure:: https://d2l-java-resources.s3.amazonaws.com/img/contour_gd_mom_less.svg

   Contour Momentum Less.

Note that we can combine momentum with SGD and in particular,
minibatch-SGD. The only change is that in that case we replace the
gradients :math:`\mathbf{g}_{t, t-1}` with :math:`\mathbf{g}_t`. Last,
for convenience we initialize :math:`\mathbf{v}_0 = 0` at time
:math:`t=0`. Let us look at what leaky averaging actually does to the
updates.

Effective Sample Weight
~~~~~~~~~~~~~~~~~~~~~~~

Recall that
:math:`\mathbf{v}_t = \sum_{\tau = 0}^{t-1} \beta^{\tau} \mathbf{g}_{t-\tau, t-\tau-1}`.
In the limit the terms add up to
:math:`\sum_{\tau=0}^\infty \beta^\tau = \frac{1}{1-\beta}`. In other
words, rather than taking a step of size :math:`\eta` in GD or SGD we
take a step of size :math:`\frac{\eta}{1-\beta}` while at the same time,
dealing with a potentially much better behaved descent direction. These
are two benefits in one. To illustrate how weighting behaves for
different choices of :math:`\beta` consider the diagram below.

.. code:: java

    /* Saved in GradDescUtils.java */
    public static Figure plotGammas(float[] time, float[] gammas,
                                  int width, int height) {
        double[] gamma1 = new double[time.length];
        double[] gamma2 = new double[time.length];
        double[] gamma3 = new double[time.length];
        double[] gamma4 = new double[time.length];
    
        // Calculate all gammas over time
        for (int i = 0; i < time.length; i++) {
            gamma1[i] = Math.pow(gammas[0], i);
            gamma2[i] = Math.pow(gammas[1], i);
            gamma3[i] = Math.pow(gammas[2], i);
            gamma4[i] = Math.pow(gammas[3], i);
        }
    
        // Gamma 1 Line
        ScatterTrace gamma1trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
                gamma1)
                .mode(ScatterTrace.Mode.LINE)
                .name(String.format("gamma = %.2f", gammas[0]))
                .build();
    
        // Gamma 2 Line
        ScatterTrace gamma2trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
                gamma2)
                .mode(ScatterTrace.Mode.LINE)
                .name(String.format("gamma = %.2f", gammas[1]))
                .build();
    
        // Gamma 3 Line
        ScatterTrace gamma3trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
                gamma3)
                .mode(ScatterTrace.Mode.LINE)
                .name(String.format("gamma = %.2f", gammas[2]))
                .build();
    
        // Gamma 4 Line
        ScatterTrace gamma4trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
                gamma4)
                .mode(ScatterTrace.Mode.LINE)
                .name(String.format("gamma = %.2f", gammas[3]))
                .build();
    
        Axis xAxis = Axis.builder()
                .title("time")
                .build();
    
        Layout layout = Layout.builder()
                .height(height)
                .width(width)
                .xAxis(xAxis)
                .build();
    
        return new Figure(layout, gamma1trace, gamma2trace, gamma3trace, gamma4trace);
    }

.. code:: java

    NDManager manager = NDManager.newBaseManager();
    
    float[] gammas = new float[]{0.95f, 0.9f, 0.6f, 0f};
    
    NDArray timesND = manager.arange(40f);
    float[] times = timesND.toFloatArray();
    
    plotGammas(times, gammas, 600, 400)


.. raw:: html

    <img id="844123c0aed84dbfb2aa0c19f59c2c7a_img"></img>
    <div id="844123c0aed84dbfb2aa0c19f59c2c7a"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_844123c0aed84dbfb2aa0c19f59c2c7a = document.getElementById('844123c0aed84dbfb2aa0c19f59c2c7a');
    var layout = {
        height: 400,
        width: 600,
        xaxis: {
        title: 'time',
        },
    
    
    };
    
    var trace0 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0","20.0","21.0","22.0","23.0","24.0","25.0","26.0","27.0","28.0","29.0","30.0","31.0","32.0","33.0","34.0","35.0","36.0","37.0","38.0","39.0"],
    y: ["1.0","0.949999988079071","0.9024999773502351","0.8573749677240853","0.814506209117175","0.7737808889516455","0.7350918352798762","0.6983372347529049","0.6634203646904311","0.6302493385473225","0.5987368641067988","0.5688000137639592","0.5403600062951367","0.5133419995387867","0.4876748934423338","0.46329114295667934","0.44012658028598456","0.41812024602496767","0.39721422873933754","0.37735351256720806","0.35848583244044324","0.3405615365449369","0.3235334556578802","0.3073567790181668","0.29198893640328016","0.2773894861023368","0.26352000849047963","0.25034400492455233","0.23782680169399162","0.22593545877417562","0.2146386831421063","0.2039067464263085","0.19371140667423523","0.18402583403130354","0.17482454013597948","0.16608331104510957","0.15777914351298675","0.14989018445646346","0.14239567344681003","0.13527588807698082"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'gamma = 0.95',
    };
    var trace1 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0","20.0","21.0","22.0","23.0","24.0","25.0","26.0","27.0","28.0","29.0","30.0","31.0","32.0","33.0","34.0","35.0","36.0","37.0","38.0","39.0"],
    y: ["1.0","0.8999999761581421","0.8099999570846563","0.7289999420642869","0.6560999304771451","0.5904899217867893","0.5314409155297335","0.47829681130622137","0.43046711877211463","0.3874203966317673","0.34867834773176853","0.313810504645452","0.2824294466990814","0.2541864952955305","0.22876783970569914","0.2058910502808789","0.18530194034396585","0.16677174189162672","0.15009456372631588","0.13508510377515104","0.12157659017695607","0.10941892826064868","0.09847703282583327","0.08862932719537452","0.07976639236274924","0.07178975122469533","0.06461077439062475","0.05814969541112137","0.052334724483612455","0.047101250787494144","0.042391124585763405","0.038152011116503896","0.034336809095238674","0.030903127367061484","0.027812813893567365","0.025031531841101472","0.0225283780601931","0.0202755397170554","0.018247985261943322","0.01642318630068312"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'gamma = 0.90',
    };
    var trace2 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0","20.0","21.0","22.0","23.0","24.0","25.0","26.0","27.0","28.0","29.0","30.0","31.0","32.0","33.0","34.0","35.0","36.0","37.0","38.0","39.0"],
    y: ["1.0","0.6000000238418579","0.36000002861023006","0.21600002574920757","0.12960002059936646","0.07776001544952515","0.04665601112365833","0.027993607786560987","0.01679616533935621","0.010077699604065514","0.006046620002710391","0.0036279721457888893","0.00217678337397093","0.0013060700762811178","7.836420769078079E-4","4.701852648281678E-4","2.82111170106991E-4","1.6926670879024902E-4","1.0156002930978222E-4","6.093602000724912E-5","3.65616134571774E-5","2.1936968946003234E-5","1.3162181890620038E-5","7.897309448182892E-6","4.738385857196265E-6","2.8430316272896817E-6","1.705819044156965E-6","1.0234914671640744E-6","6.140949047003828E-7","3.684569574613931E-7","2.2107418326153427E-7","1.3264451522773984E-7","7.958671229913559E-8","4.775202927697644E-8","2.865121870468296E-8","1.7190731905908062E-8","1.0314439553403824E-8","6.188663977957697E-9","3.713198534323865E-9","2.227919209123871E-9"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'gamma = 0.60',
    };
    var trace3 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0","20.0","21.0","22.0","23.0","24.0","25.0","26.0","27.0","28.0","29.0","30.0","31.0","32.0","33.0","34.0","35.0","36.0","37.0","38.0","39.0"],
    y: ["1.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0"],
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'gamma = 0.00',
    };
    
    
    var data = [ trace0, trace1, trace2, trace3];
    Plotly.newPlot(target_844123c0aed84dbfb2aa0c19f59c2c7a, data, layout);
    })</script>


Practical Experiments
---------------------

Let us see how momentum works in practice, i.e., when used within the
context of a proper optimizer. For this we need a somewhat more scalable
implementation.

Implementation from Scratch
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compared with (minibatch) SGD the momentum method needs to maintain a
set of auxiliary variables, i.e., velocity. It has the same shape as the
gradients (and variables of the optimization problem). In the
implementation below we call these variables ``states``.

.. code:: java

    NDList initMomentumStates(int featureDim) {
        NDManager manager = NDManager.newBaseManager();
        NDArray vW = manager.zeros(new Shape(featureDim, 1));
        NDArray vB = manager.zeros(new Shape(1));
        return new NDList(vW, vB);
    }
    
    public class Optimization {
        public static void sgdMomentum(NDList params, NDList states, Map<String, Float> hyperparams) {
            for (int i = 0; i < params.size(); i++) {
                NDArray param = params.get(i);
                NDArray velocity = states.get(i);
                // Update param
                velocity.muli(hyperparams.get("momentum")).addi(param.getGradient());
                param.subi(velocity.mul(hyperparams.get("lr")));
            }
        }
    }

Let us see how this works in practice.

.. code:: java

    AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);
    
    public TrainingChapter11.LossTime trainMomentum(float lr, float momentum, int numEpochs) 
        throws IOException, TranslateException {
        int featureDim = airfoil.getColumnNames().size();
        Map<String, Float> hyperparams = new HashMap<>();
        hyperparams.put("lr", lr);
        hyperparams.put("momentum", momentum);
        return TrainingChapter11.trainCh11(Optimization::sgdMomentum, initMomentumStates(featureDim), hyperparams, airfoil, featureDim, numEpochs);
    }
    
    trainMomentum(0.02f, 0.5f, 2);


.. parsed-literal::
    :class: output

    loss: 0.245, 0.077 sec/epoch


.. parsed-literal::
    :class: output

    REPL.$JShell$154B$TrainingChapter11$LossTime@7cf1ec7d


When we increase the momentum hyperparameter ``momentum`` to 0.9, it
amounts to a significantly larger effective sample size of
:math:`\frac{1}{1 - 0.9} = 10`. We reduce the learning rate slightly to
:math:`0.01` to keep matters under control.

.. code:: java

    trainMomentum(0.01f, 0.9f, 2);


.. parsed-literal::
    :class: output

    loss: 0.246, 0.069 sec/epoch


.. parsed-literal::
    :class: output

    REPL.$JShell$154B$TrainingChapter11$LossTime@1567cd1d


Reducing the learning rate further addresses any issue of non-smooth
optimization problems. Setting it to :math:`0.005` yields good
convergence properties.

.. code:: java

    trainMomentum(0.005f, 0.9f, 2);


.. parsed-literal::
    :class: output

    loss: 0.242, 0.070 sec/epoch


.. parsed-literal::
    :class: output

    REPL.$JShell$154B$TrainingChapter11$LossTime@9cc5b88


Concise Implementation
~~~~~~~~~~~~~~~~~~~~~~

There is very little to do in DJL since the standard ``Sgd`` solver
already had momentum built in. Setting matching parameters yields a very
similar trajectory.

.. code:: java

    Tracker lrt = Tracker.fixed(0.005f);
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).optMomentum(0.9f).build();
    
    TrainingChapter11.trainConciseCh11(sgd, airfoil, 2);


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.061 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 1.00, L2Loss: 0.29
    loss: 0.245, 0.142 sec/epoch


Theoretical Analysis
--------------------

So far the 2D example of :math:`f(x) = 0.1 x_1^2 + 2 x_2^2` seemed
rather contrived. We will now see that this is actually quite
representative of the types of problem one might encounter, at least in
the case of minimizing convex quadratic objective functions.

Quadratic Convex Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~

Consider the function

.. math:: h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b.

This is a general quadratic function. For positive semidefinite matrices
:math:`\mathbf{Q} \succ 0`, i.e., for matrices with positive eigenvalues
this has a minimizer at
:math:`\mathbf{x}^* = -\mathbf{Q}^{-1} \mathbf{c}` with minimum value
:math:`b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}`.
Hence we can rewrite :math:`h` as

.. math:: h(\mathbf{x}) = \frac{1}{2} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})^\top \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c}) + b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}.

The gradient is given by
:math:`\partial_{\mathbf{x}} f(\mathbf{x}) = \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})`.
That is, it is given by the distance between :math:`\mathbf{x}` and the
minimizer, multiplied by :math:`\mathbf{Q}`. Consequently also the
momentum is a linear combination of terms
:math:`\mathbf{Q} (\mathbf{x}_t - \mathbf{Q}^{-1} \mathbf{c})`.

Since :math:`\mathbf{Q}` is positive definite it can be decomposed into
its eigensystem via
:math:`\mathbf{Q} = \mathbf{O}^\top \boldsymbol{\Lambda} \mathbf{O}` for
an orthogonal (rotation) matrix :math:`\mathbf{O}` and a diagonal matrix
:math:`\boldsymbol{\Lambda}` of positive eigenvalues. This allows us to
perform a change of variables from :math:`\mathbf{x}` to
:math:`\mathbf{z} := \mathbf{O} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})`
to obtain a much simplified expression:

.. math:: h(\mathbf{z}) = \frac{1}{2} \mathbf{z}^\top \boldsymbol{\Lambda} \mathbf{z} + b'.

Here
:math:`c' = b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}`.
Since :math:`\mathbf{O}` is only an orthogonal matrix this does not
perturb the gradients in a meaningful way. Expressed in terms of
:math:`\mathbf{z}` gradient descent becomes

.. math:: \mathbf{z}_t = \mathbf{z}_{t-1} - \boldsymbol{\Lambda} \mathbf{z}_{t-1} = (\mathbf{I} - \boldsymbol{\Lambda}) \mathbf{z}_{t-1}.

The important fact in this expression is that gradient descent *does not
mix* between different eigenspaces. That is, when expressed in terms of
the eigensystem of :math:`\mathbf{Q}` the optimization problem proceeds
in a coordinate-wise manner. This also holds for momentum.

.. math::

   \begin{aligned}
   \mathbf{v}_t & = \beta \mathbf{v}_{t-1} + \boldsymbol{\Lambda} \mathbf{z}_{t-1} \\
   \mathbf{z}_t & = \mathbf{z}_{t-1} - \eta \left(\beta \mathbf{v}_{t-1} + \boldsymbol{\Lambda} \mathbf{z}_{t-1}\right) \\
       & = (\mathbf{I} - \eta \boldsymbol{\Lambda}) \mathbf{z}_{t-1} - \eta \beta \mathbf{v}_{t-1}.
   \end{aligned}

In doing this we just proved the following theorem: Gradient Descent
with and without momentum for a convex quadratic function decomposes
into coordinate-wise optimization in the direction of the eigenvectors
of the quadratic matrix.

Scalar Functions
~~~~~~~~~~~~~~~~

Given the above result let us see what happens when we minimize the
function :math:`f(x) = \frac{\lambda}{2} x^2`. For gradient descent we
have

.. math:: x_{t+1} = x_t - \eta \lambda x_t = (1 - \eta \lambda) x_t.

Whenever :math:`|1 - \eta \lambda| < 1` this optimization converges at
an exponential rate since after :math:`t` steps we have
:math:`x_t = (1 - \eta \lambda)^t x_0`. This shows how the rate of
convergence improves initially as we increase the learning rate
:math:`\eta` until :math:`\eta \lambda = 1`. Beyond that things diverge
and for :math:`\eta \lambda > 2` the optimization problem diverges.

.. code:: java

    float[] lambdas = new float[]{0.1f, 1f, 10f, 19f};
    float eta = 0.1f;
    
    float[] time = new float[0];
    float[] convergence = new float[0];
    String[] lambda = new String[0]; 
    for (float lam : lambdas) {
        float[] timeTemp = new float[20];
        float[] convergenceTemp = new float[20];
        String[] lambdaTemp = new String[20];
        for (int i = 0; i < timeTemp.length; i++) {
            timeTemp[i] = i;
            convergenceTemp[i] = (float) Math.pow(1 - eta * lam, i);
            lambdaTemp[i] = String.format("lambda = %.2f", lam);
        }
        time = ArrayUtils.addAll(time, timeTemp);
        convergence = ArrayUtils.addAll(convergence, convergenceTemp);
        lambda = ArrayUtils.addAll(lambda, lambdaTemp);
    }
    
    Table data = Table.create("data")
        .addColumns(
            DoubleColumn.create("time", Functions.floatToDoubleArray(time)),
            DoubleColumn.create("convergence", Functions.floatToDoubleArray(convergence)),
            StringColumn.create("lambda", lambda)
    );
    
    LinePlot.create("convergence vs. time", data, "time", "convergence", "lambda");


.. raw:: html

    <img id="70439a116e584cc489187d3a2a969de1_img"></img>
    <div id="70439a116e584cc489187d3a2a969de1"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_70439a116e584cc489187d3a2a969de1 = document.getElementById('70439a116e584cc489187d3a2a969de1');
    var layout = {
        title: 'convergence vs. time',
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'time',
        },
    
        yaxis: {
        title: 'convergence',
        },
    
    };
    
    var trace0 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0"],
    y: ["1.0","0.9900000095367432","0.9801000356674194","0.9702990055084229","0.9605960249900818","0.9509900808334351","0.9414802193641663","0.9320654273033142","0.9227447509765625","0.9135173559188843","0.9043821692466736","0.8953383564949036","0.8863849639892578","0.8775211572647095","0.8687459230422974","0.8600584864616394","0.8514578938484192","0.8429433107376099","0.8345139026641846","0.8261687755584717"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'lambda = 0.10',
    };
    var trace1 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0"],
    y: ["1.0","0.8999999761581421","0.809999942779541","0.7289999127388","0.6560999155044556","0.59048992395401","0.5314409136772156","0.47829681634902954","0.4304671287536621","0.3874203860759735","0.3486783504486084","0.3138104975223541","0.28242945671081543","0.2541864812374115","0.22876784205436707","0.20589104294776917","0.18530194461345673","0.16677173972129822","0.15009456872940063","0.1350851058959961"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'lambda = 1.00',
    };
    var trace2 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0"],
    y: ["1.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'lambda = 10.00',
    };
    var trace3 =
    {
    x: ["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0"],
    y: ["1.0","-0.8999999761581421","0.809999942779541","-0.7289999127388","0.6560999155044556","-0.59048992395401","0.5314409136772156","-0.47829681634902954","0.4304671287536621","-0.3874203860759735","0.3486783504486084","-0.3138104975223541","0.28242945671081543","-0.2541864812374115","0.22876784205436707","-0.20589104294776917","0.18530194461345673","-0.16677173972129822","0.15009456872940063","-0.1350851058959961"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'lambda = 19.00',
    };
    
    
    var data = [ trace0, trace1, trace2, trace3];
    Plotly.newPlot(target_70439a116e584cc489187d3a2a969de1, data, layout);
    })</script>


To analyze convergence in the case of momentum we begin by rewriting the
update equations in terms of two scalars: one for :math:`x` and one for
the momentum :math:`v`. This yields:

.. math::


   \begin{bmatrix} v_{t+1} \\ x_{t+1} \end{bmatrix} =
   \begin{bmatrix} \beta & \lambda \\ -\eta \beta & (1 - \eta \lambda) \end{bmatrix}
   \begin{bmatrix} v_{t} \\ x_{t} \end{bmatrix} = \mathbf{R}(\beta, \eta, \lambda) \begin{bmatrix} v_{t} \\ x_{t} \end{bmatrix}.

We used :math:`\mathbf{R}` to denote the :math:`2 \times 2` governing
convergence behavior. After :math:`t` steps the initial choice
:math:`[v_0, x_0]` becomes
:math:`\mathbf{R}(\beta, \eta, \lambda)^t [v_0, x_0]`. Hence, it is up
to the eigenvalues of :math:`\mathbf{R}` to detmine the speed of
convergence. See the `Distill
post <https://distill.pub/2017/momentum/>`__ of :cite:`Goh.2017` for a
great animation and :cite:`Flammarion.Bach.2015` for a detailed
analysis. One can show that :math:`0 < \eta \lambda < 2 + 2 \beta`
momentum converges. This is a larger range of feasible parameters when
compared to :math:`0 < \eta \lambda < 2` for gradient descent. It also
suggests that in general large values of :math:`\beta` are desirable.
Further details require a fair amount of technical detail and we suggest
that the interested reader consult the original publications.

Summary
-------

-  Momentum replaces gradients with a leaky average over past gradients.
   This accelerates convergence significantly.
-  It is desirable for both noise-free gradient descent and (noisy)
   stochastic gradient descent.
-  Momentum prevents stalling of the optimization process that is much
   more likely to occur for stochastic gradient descent.
-  The effective number of gradients is given by
   :math:`\frac{1}{1-\beta}` due to exponentiated downweighting of past
   data.
-  In the case of convex quadratic problems this can be analyzed
   explicitly in detail.
-  Implementation is quite straightforward but it requires us to store
   an additional state vector (momentum :math:`\mathbf{v}`).

Exercises
---------

1. Use other combinations of momentum hyperparameters and learning rates
   and observe and analyze the different experimental results.
2. Try out GD and momentum for a quadratic problem where you have
   multiple eigenvalues, i.e.,
   :math:`f(x) = \frac{1}{2} \sum_i \lambda_i x_i^2`, e.g.,
   :math:`\lambda_i = 2^{-i}`. Plot how the values of :math:`x` decrease
   for the initialization :math:`x_i = 1`.
3. Derive minimum value and minimizer for
   :math:`h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b`.
4. What changes when we perform SGD with momentum? What happens when we
   use mini-batch SGD with momentum? Experiment with the parameters?