Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_optimization/adadelta.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_optimization/adadelta.ipynb

.. _sec_adadelta:

Adadelta
========


Adadelta is yet another variant of AdaGrad. The main difference lies in
the fact that it decreases the amount by which the learning rate is
adaptive to coordinates. Moreover, traditionally it referred to as not
having a learning rate since it uses the amount of change itself as
calibration for future change. The algorithm was proposed in
:cite:`Zeiler.2012`. It is fairly straightforward, given the
discussion of previous algorithms so far.

The Algorithm
-------------

In a nutshell Adadelta uses two state variables, :math:`\mathbf{s}_t` to
store a leaky average of the second moment of the gradient and
:math:`\Delta\mathbf{x}_t` to store a leaky average of the second moment
of the change of parameters in the model itself. Note that we use the
original notation and naming of the authors for compatibility with other
publications and implementations (there is no other real reason why one
should use different Greek variables to indicate a parameter serving the
same purpose in momentum, Adagrad, RMSProp, and Adadelta). The parameter
du jour is :math:`\rho`. We obtain the following leaky updates:

.. math::

   \begin{aligned}
       \mathbf{s}_t & = \rho \mathbf{s}_{t-1} + (1 - \rho) \mathbf{g}_t^2, \\
       \mathbf{g}_t' & = \sqrt{\frac{\Delta\mathbf{x}_{t-1} + \epsilon}{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t, \\
       \mathbf{x}_t  & = \mathbf{x}_{t-1} - \mathbf{g}_t', \\
       \Delta \mathbf{x}_t & = \rho \Delta\mathbf{x}_{t-1} + (1 - \rho) \mathbf{x}_t^2.
   \end{aligned}

The difference to before is that we perform updates with the rescaled
gradient :math:`\mathbf{g}_t'` which is computed by taking the ratio
between the average squared rate of change and the average second moment
of the gradient. The use of :math:`\mathbf{g}_t'` is purely for
notational convenience. In practice we can implement this algorithm
without the need to use additional temporary space for
:math:`\mathbf{g}_t'`. As before :math:`\eta` is a parameter ensuring
nontrivial numerical results, i.e., avoiding zero step size or infinite
variance. Typically we set this to :math:`\eta = 10^{-5}`.

Implementation
--------------

Adadelta needs to maintain two state variables for each variable,
:math:`\mathbf{s}_t` and :math:`\Delta\mathbf{x}_t`. This yields the
following implementation.

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils
    %load ../utils/Functions.java
    %load ../utils/GradDescUtils.java
    %load ../utils/Accumulator.java
    %load ../utils/StopWatch.java
    %load ../utils/Training.java
    %load ../utils/TrainingChapter11.java

.. code:: java

    NDList initAdadeltaStates(int featureDimension) {
        NDManager manager = NDManager.newBaseManager();
        NDArray sW = manager.zeros(new Shape(featureDimension, 1));
        NDArray sB = manager.zeros(new Shape(1));
        NDArray deltaW = manager.zeros(new Shape(featureDimension, 1));
        NDArray deltaB = manager.zeros(new Shape(1));
        return new NDList(sW, deltaW, sB, deltaB);
    }
    
    public class Optimization {
        public static void adadelta(NDList params, NDList states, Map<String, Float> hyperparams) {
            float rho = hyperparams.get("rho");
            float eps = (float) 1e-5;
            for (int i = 0; i < params.size(); i++) {
                NDArray param = params.get(i);
                NDArray state = states.get(2 * i);
                NDArray delta = states.get(2 * i + 1);
                // Update parameter, state, and delta
                // In-place updates with the '__'i methods (ex. muli)
                // state = rho * state + (1 - rho) * param.gradient^2
                state.muli(rho).addi(param.getGradient().square().mul(1 - rho));
                // rescaledGradient = ((delta + eps)^(1/2) / (state + eps)^(1/2)) * param.gradient
                NDArray rescaledGradient = delta.add(eps).sqrt()
                    .div(state.add(eps).sqrt()).mul(param.getGradient());
                // param -= rescaledGradient
                param.subi(rescaledGradient);
                // delta = rho * delta + (1 - rho) * g^2
                delta.muli(rho).addi(rescaledGradient.square().mul(1 - rho));
            }
        }
    }

Choosing :math:`\rho = 0.9` amounts to a half-life time of 10 for each
parameter update. This tends to work quite well. We get the following
behavior.

.. code:: java

    AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);
    
    public TrainingChapter11.LossTime trainAdadelta(float rho, int numEpochs) throws IOException, TranslateException {
        int featureDimension = airfoil.getColumnNames().size();
        Map<String, Float> hyperparams = new HashMap<>();
        hyperparams.put("rho", rho);
        return TrainingChapter11.trainCh11(Optimization::adadelta, 
                                           initAdadeltaStates(featureDimension), 
                                           hyperparams, airfoil, 
                                           featureDimension, numEpochs);
    }
    
    TrainingChapter11.LossTime lossTime = trainAdadelta(0.9f, 2);


.. parsed-literal::
    :class: output

    loss: 0.246, 0.101 sec/epoch


As usual, for a concise implementation, we simply create an instance of
``adadelta`` from the ``Optimizer`` class.

.. code:: java

    Optimizer adadelta = Optimizer.adadelta().optRho(0.9f).build();
    
    TrainingChapter11.trainConciseCh11(adadelta, airfoil, 2);


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.087 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 1.00, L2Loss: 0.48
    loss: 0.472, 0.175 sec/epoch


Summary
-------

-  Adadelta has no learning rate parameter. Instead, it uses the rate of
   change in the parameters itself to adapt the learning rate.
-  Adadelta requires two state variables to store the second moments of
   gradient and the change in parameters.
-  Adadelta uses leaky averages to keep a running estimate of the
   appropriate statistics.

Exercises
---------

1. Adjust the value of :math:`\rho`. What happens?
2. Show how to implement the algorithm without the use of
   :math:`\mathbf{g}_t'`. Why might this be a good idea?
3. Is Adadelta really learning rate free? Could you find optimization
   problems that break Adadelta?
4. Compare Adadelta to Adagrad and RMS prop to discuss their convergence
   behavior.