Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_multilayer-perceptrons/dropout.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_multilayer-perceptrons/dropout.ipynb

.. _sec_dropout:

Dropout
=======


In :numref:`sec_weight_decay`, we introduced the classical approach to
regularizing statistical models by penalizing the :math:`\ell_2` norm of
the weights. In probabilistic terms, we could justify this technique by
arguing that we have assumed a prior belief that weights take values
from a Gaussian distribution with mean :math:`0`. More intuitively, we
might argue that we encouraged the model to spread out its weights among
many features rather than depending too much on a small number of
potentially spurious associations.

Overfitting Revisited
---------------------

Faced with more features than examples, linear models tend to overfit.
But given more examples than features, we can generally count on linear
models not to overfit. Unfortunately, the reliability with which linear
models generalize comes at a cost: Naively applied, linear models do not
take into account interactions among features. For every feature, a
linear model must assign either a positive or a negative weight,
ignoring context.

In traditional texts, this fundamental tension between generalizability
and flexibility is described as the *bias-variance tradeoff*. Linear
models have high bias (they can only represent a small class of
functions), but low variance (they give similar results across different
random samples of the data).

Deep neural networks inhabit the opposite end of the bias-variance
spectrum. Unlike linear models, neural networks are not confined to
looking at each feature individually. They can learn interactions among
groups of features. For example, they might infer that “Nigeria” and
“Western Union” appearing together in an email indicates spam but that
separately they do not.

Even when we have far more examples than features, deep neural networks
are capable of overfitting. In 2017, a group of researchers demonstrated
the extreme flexibility of neural networks by training deep nets on
randomly-labeled images. Despite the absence of any true pattern linking
the inputs to the outputs, they found that the neural network optimized
by SGD could label every image in the training set perfectly.

Consider what this means. If the labels are assigned uniformly at random
and there are 10 classes, then no classifier can do better than 10%
accuracy on holdout data. The generalization gap here is a whopping 90%.
If our models are so expressive that they can overfit this badly, then
when should we expect them not to overfit? The mathematical foundations
for the puzzling generalization properties of deep networks remain open
research questions, and we encourage the theoretically-oriented reader
to dig deeper into the topic. For now, we turn to the more terrestrial
investigation of practical tools that tend (empirically) to improve the
generalization of deep nets.

Robustness through Perturbations
--------------------------------

Let us think briefly about what we expect from a good predictive model.
We want it to peform well on unseen data. Classical generalization
theory suggests that to close the gap between train and test
performance, we should aim for a *simple* model. Simplicity can come in
the form of a small number of dimensions. We explored this when
discussing the monomial basis functions of linear models
:numref:`sec_model_selection`. Additionally, as we saw when discussing
weight decay (:math:`\ell_2` regularization)
:numref:`sec_weight_decay`, the (inverse) norm of the parameters
represents another useful measure of simplicity. Another useful notion
of simplicity is smoothness, i.e., that the function should not be
sensitive to small changes to its inputs. For instance, when we classify
images, we would expect that adding some random noise to the pixels
should be mostly harmless.

In 1995, Christopher Bishop formalized this idea when he proved that
training with input noise is equivalent to Tikhonov regularization
:cite:`Bishop.1995`. This work drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple), and
the requirement that it be resilient to perturbations in the input.

Then, in 2014, Srivastava et al.
:cite:`Srivastava.Hinton.Krizhevsky.ea.2014` developed a clever idea
for how to apply Bishop's idea to the *internal* layers of the network,
too. Namely, they proposed to inject noise into each layer of the
network before calculating the subsequent layer during training. They
realized that when training a deep network with many layers, injecting
noise enforces smoothness just on the input-output mapping.

Their idea, called *dropout*, involves injecting noise while computing
each internal layer during forward propagation, and it has become a
standard technique for training neural networks. The method is called
*dropout* because we literally *drop out* some neurons during training.
Throughout training, on each iteration, standard dropout consists of
zeroing out some fraction (typically 50%) of the nodes in each layer
before calculating the subsequent layer.

To be clear, we are imposing our own narrative with the link to Bishop.
The original paper on dropout offers intuition through a surprising
analogy to sexual reproduction. The authors argue that neural network
overfitting is characterized by a state in which each layer relies on a
specifc pattern of activations in the previous layer, calling this
condition *co-adaptation*. Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to break up co-adapted genes.

The key challenge then is *how* to inject this noise. One idea is to
inject the noise in an *unbiased* manner so that the expected value of
each layer---while fixing the others---equals to the value it would have
taken absent noise.

In Bishop's work, he added Gaussian noise to the inputs to a linear
model: At each training iteration, he added noise sampled from a
distribution with mean zero
:math:`\epsilon \sim \mathcal{N}(0,\sigma^2)` to the input
:math:`\mathbf{x}`, yielding a perturbed point
:math:`\mathbf{x}' = \mathbf{x} + \epsilon`. In expectation,
:math:`E[\mathbf{x}'] = \mathbf{x}`.

In standard dropout regularization, one debiases each layer by
normalizing by the fraction of nodes that were retained (not dropped
out). In other words, dropout with *dropout probability* :math:`p` is
applied as follows:

.. math::


   \begin{aligned}
   h' =
   \begin{cases}
       0 & \text{ with probability } p \\
       \frac{h}{1-p} & \text{ otherwise}
   \end{cases}
   \end{aligned}

By design, the expectation remains unchanged, i.e., :math:`E[h'] = h`.
Intermediate activations :math:`h` are replaced by a random variable
:math:`h'` with matching expectation.

Dropout in Practice
-------------------

Recall the multilayer perceptron (:numref:`sec_mlp`) with a hidden
layer and 5 hidden units. Its architecture is given by

.. math::


   \begin{aligned}
       \mathbf{h} & = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1), \\
       \mathbf{o} & = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2, \\
       \hat{\mathbf{y}} & = \mathrm{softmax}(\mathbf{o}).
   \end{aligned}

When we apply dropout to a hidden layer, zeroing out each hidden unit
with probability :math:`p`, the result can be viewed as a network
containing only a subset of the original neurons. In
:numref:`fig_dropout2`, :math:`h_2` and :math:`h_5` are removed.
Consequently, the calculation of :math:`y` no longer depends on
:math:`h_2` and :math:`h_5` and their respective gradient also vanishes
when performing backprop. In this way, the calculation of the output
layer cannot be overly dependent on any one element of
:math:`h_1, \ldots, h_5`.

.. _fig_dropout2:

.. figure:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/dropout2.svg

   MLP before and after dropout


Typically, ***we disable dropout at test time***. Given a trained model
and a new example, we do not drop out any nodes (and thus do not need to
normalize). However, there are some exceptions: some researchers use
dropout at test time as a heuristic for estimating the *uncertainty* of
neural network predictions: if the predictions agree across many
different dropout masks, then we might say that the network is more
confident. For now we will put off uncertainty estimation for subsequent
chapters and volumes.

Implementation from Scratch
---------------------------

To implement the dropout function for a single layer, we must draw as
many samples from a Bernoulli (binary) random variable as our layer has
dimensions, where the random variable takes value :math:`1` (keep) with
probability :math:`1-p` and :math:`0` (drop) with probability :math:`p`.
One easy way to implement this is to first draw samples from the uniform
distribution :math:`U[0, 1]`. Then we can keep those nodes for which the
corresponding sample is greater than :math:`p`, dropping the rest.

In the following code, we implement a ``dropoutLayer()`` function that
drops out the elements in the ``NDArray`` input ``X`` with probability
``dropout``, rescaling the remainder as described above (dividing the
survivors by ``1.0-dropout``).

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils.ipynb
    %load ../utils/DataPoints.java
    %load ../utils/Training.java
    %load ../utils/Accumulator.java

.. code:: java

    import ai.djl.basicdataset.cv.classification.*;
    import org.apache.commons.lang3.ArrayUtils;

We can test out the ``dropoutLayer()`` function on a few examples. In
the following lines of code, we pass our input ``X`` through the dropout
operation, with probabilities 0, 0.5, and 1, respectively.

.. code:: java

    NDManager manager = NDManager.newBaseManager();
    
    public NDArray dropoutLayer(NDArray X, float dropout) {
        // In this case, all elements are dropped out
        if (dropout == 1.0f) {
            return manager.zeros(X.getShape());
        }
        // In this case, all elements are kept
        if (dropout == 0f) {
            return X;
        }
    
        NDArray mask = manager.randomUniform(0f, 1.0f, X.getShape()).gt(dropout);
        return mask.toType(DataType.FLOAT32, false).mul(X).div(1.0f - dropout);
    }

.. code:: java

    NDArray X = manager.arange(16f).reshape(2, 8);
    System.out.println(dropoutLayer(X, 0));
    System.out.println(dropoutLayer(X, 0.5f));
    System.out.println(dropoutLayer(X, 1.0f));


.. parsed-literal::
    :class: output

    ND: (2, 8) gpu(0) float32
    [[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
     [ 8.,  9., 10., 11., 12., 13., 14., 15.],
    ]
    
    ND: (2, 8) gpu(0) float32
    [[ 0.,  2.,  0.,  6.,  8.,  0., 12.,  0.],
     [ 0., 18., 20.,  0., 24.,  0., 28.,  0.],
    ]
    
    ND: (2, 8) gpu(0) float32
    [[0., 0., 0., 0., 0., 0., 0., 0.],
     [0., 0., 0., 0., 0., 0., 0., 0.],
    ]
    

Defining Model Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~

Again, we work with the Fashion-MNIST dataset introduced in
:numref:`sec_softmax_scratch`. We define a multilayer perceptron with
two hidden layers containing 256 outputs each.

.. code:: java

    int numInputs = 784;
    int numOutputs = 10;
    int numHiddens1 = 256;
    int numHiddens2 = 256;
    
    NDArray W1 = manager.randomNormal(0, 0.01f, new Shape(numInputs, numHiddens1), DataType.FLOAT32);
    NDArray b1 = manager.zeros(new Shape(numHiddens1));
    NDArray W2 = manager.randomNormal(0, 0.01f, new Shape(numHiddens1, numHiddens2), DataType.FLOAT32);
    NDArray b2 = manager.zeros(new Shape(numHiddens2));
    NDArray W3 = manager.randomNormal(0, 0.01f, new Shape(numHiddens2, numOutputs), DataType.FLOAT32);
    NDArray b3 = manager.zeros(new Shape(numOutputs));
    
    
    NDList params = new NDList(W1, b1, W2, b2, W3, b3);
    
    for (NDArray param : params) {
        param.setRequiresGradient(true);
    }

Defining the Model
~~~~~~~~~~~~~~~~~~

The model below applies dropout to the output of each hidden layer
(following the activation function). We can set dropout probabilities
for each layer separately. A common trend is to set a lower dropout
probability closer to the input layer. Below we set it to 0.2 and 0.5
for the first and second hidden layer respectively. By using the
``isTraining`` boolean variable described in :numref:`sec_autograd`,
we can ensure that dropout is only active during training.

.. code:: java

    float dropout1 = 0.2f;
    float dropout2 = 0.5f;
    
    public NDArray net(NDArray X, boolean isTraining) {
    
        X = X.reshape(-1, numInputs);
        NDArray H1 = Activation.relu(X.dot(W1).add(b1));
    
        if (isTraining) {
            H1 = dropoutLayer(H1, dropout1);
        }
    
        NDArray H2 = Activation.relu(H1.dot(W2).add(b2));
        if (isTraining) {
            H2 = dropoutLayer(H2, dropout2);
        }
    
        return H2.dot(W3).add(b3);
    }

Training and Testing
~~~~~~~~~~~~~~~~~~~~

This is similar to the training and testing of multilayer perceptrons
described previously.

.. code:: java

    int numEpochs = Integer.getInteger("MAX_EPOCH", 10);
    float lr = 0.5f;
    int batchSize = 256;
    
    double[] trainLoss;
    double[] testAccuracy;
    double[] epochCount;
    double[] trainAccuracy;
    
    trainLoss = new double[numEpochs];
    trainAccuracy = new double[numEpochs];
    testAccuracy = new double[numEpochs];
    epochCount = new double[numEpochs];
    
    Loss loss = new SoftmaxCrossEntropyLoss();
    
    FashionMnist trainIter = FashionMnist.builder()
            .optUsage(Dataset.Usage.TRAIN)
            .setSampling(batchSize, true)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();
    
    
    FashionMnist testIter = FashionMnist.builder()
            .optUsage(Dataset.Usage.TEST)
            .setSampling(batchSize, true)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();
                                
    trainIter.prepare();
    testIter.prepare();

.. code:: java

    float epochLoss = 0f;
    float accuracyVal = 0f;
    
    for (int epoch = 1; epoch <= numEpochs; epoch++) {
        // Iterate over dataset
        System.out.print("Running epoch " + epoch + "...... ");
        for (Batch batch : trainIter.getData(manager)) {
            NDArray X = batch.getData().head();
            NDArray y = batch.getLabels().head();
    
            try (GradientCollector gc = Engine.getInstance().newGradientCollector()) {
                NDArray yHat = net(X, true); // net function call
    
                NDArray lossValue = loss.evaluate(new NDList(y), new NDList(yHat));
                NDArray l = lossValue.mul(batchSize);
    
                epochLoss += l.sum().getFloat();
    
                accuracyVal += Training.accuracy(yHat, y);
                gc.backward(l); // gradient calculation
            }
    
            batch.close();
            Training.sgd(params, lr, batchSize); // updater
        }
    
        trainLoss[epoch-1] = epochLoss/trainIter.size();
        trainAccuracy[epoch-1] = accuracyVal/trainIter.size();
    
        epochLoss = 0f;
        accuracyVal = 0f;
    
        for (Batch batch : testIter.getData(manager)) {
            NDArray X = batch.getData().head();
            NDArray y = batch.getLabels().head();
    
            NDArray yHat = net(X, false); // net function call
            accuracyVal += Training.accuracy(yHat, y);
        }
    
        testAccuracy[epoch-1] = accuracyVal/testIter.size();
        epochCount[epoch-1] = epoch;
        accuracyVal = 0f;
        System.out.println("Finished epoch " + epoch);
    }
    
    System.out.println("Finished training!");


.. parsed-literal::
    :class: output

    Running epoch 1...... Finished epoch 1
    Running epoch 2...... Finished epoch 2
    Running epoch 3...... Finished epoch 3
    Running epoch 4...... Finished epoch 4
    Running epoch 5...... Finished epoch 5
    Running epoch 6...... Finished epoch 6
    Running epoch 7...... Finished epoch 7
    Running epoch 8...... Finished epoch 8
    Running epoch 9...... Finished epoch 9
    Running epoch 10...... Finished epoch 10
    Finished training!


.. code:: java

    String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
    
    Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
    Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
    Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                    trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");
    
    Table data = Table.create("Data").addColumns(
        DoubleColumn.create("epochCount", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
        DoubleColumn.create("loss", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),
        StringColumn.create("lossLabel", lossLabel)
    );
    
    render(LinePlot.create("", data, "epochCount", "loss", "lossLabel"),"text/html");


.. raw:: html

    <img id="eee66a5bdf024408a597650996ec6c3e_img"></img>
    <div id="eee66a5bdf024408a597650996ec6c3e"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_eee66a5bdf024408a597650996ec6c3e = document.getElementById('eee66a5bdf024408a597650996ec6c3e');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epochCount',
        },
    
        yaxis: {
        title: 'loss',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["1.1958273649215698","0.5867002010345459","0.4948180913925171","0.4485013484954834","0.41622194647789","0.3974320888519287","0.3785196840763092","0.36579790711402893","0.3555448353290558","0.3447328209877014"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.5373833179473877","0.7836999893188477","0.8201833367347717","0.838450014591217","0.8472166657447815","0.8538833260536194","0.8633000254631042","0.8658333420753479","0.8704500198364258","0.8739833235740662"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.7235000133514404","0.8127999901771545","0.802299976348877","0.8181999921798706","0.8095999956130981","0.8483999967575073","0.8547999858856201","0.8360000252723694","0.8489999771118164","0.8657000064849854"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_eee66a5bdf024408a597650996ec6c3e, data, layout);
    })</script>


Concise Implementation
----------------------

Using DJL, all we need to do is add a ``Dropout`` layer after each
fully-connected layer, passing in the dropout probability as the only
argument to its constructor. During training, the ``Dropout`` layer will
randomly drop out outputs of the previous layer (or equivalently, the
inputs to the subsequent layer) according to the specified dropout
probability. When the model is not in training mode, the ``Dropout``
layer simply passes the data through during testing.

.. code:: java

    SequentialBlock net = new SequentialBlock();
    net.add(Blocks.batchFlattenBlock(784));
    net.add(Linear.builder().setUnits(256).build());
    net.add(Activation::relu);
    net.add(Dropout.builder().optRate(dropout1).build());
    net.add(Linear.builder().setUnits(256).build());
    net.add(Activation::relu);
    net.add(Dropout.builder().optRate(dropout2).build());
    net.add(Linear.builder().setUnits(10).build());
    net.setInitializer(new NormalInitializer(0.01f), Parameter.Type.WEIGHT);

Next, we train and test the model.

.. code:: java

    Map<String, double[]> evaluatorMetrics = new HashMap<>();
    
    Tracker lrt = Tracker.fixed(0.5f);
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();
    
    Loss loss = Loss.softmaxCrossEntropyLoss();
    
    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
                    .optOptimizer(sgd) // Optimizer (loss function)
                    .optDevices(Engine.getInstance().getDevices(1)) // single GPU
                    .addEvaluator(new Accuracy()) // Model Accuracy
                    .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
        try (Model model = Model.newInstance("mlp")) {
            model.setBlock(net);
    
            try (Trainer trainer = model.newTrainer(config)) {
    
                trainer.initialize(new Shape(1, 784));
                trainer.setMetrics(new Metrics());
    
                EasyTrain.fit(trainer, numEpochs, trainIter, testIter);
    
                Metrics metrics = trainer.getMetrics();
    
                trainer.getEvaluators().stream()
                        .forEach(evaluator -> {
                            evaluatorMetrics.put("train_epoch_" + evaluator.getName(), metrics.getMetric("train_epoch_" + evaluator.getName()).stream()
                                                .mapToDouble(x -> x.getValue().doubleValue()).toArray());
                            evaluatorMetrics.put("validate_epoch_" + evaluator.getName(), metrics.getMetric("validate_epoch_" + evaluator.getName()).stream()
                                                .mapToDouble(x -> x.getValue().doubleValue()).toArray());
                });
        }
    }


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.087 ms.


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.54, SoftmaxCrossEntropyLoss: 1.19
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.54, SoftmaxCrossEntropyLoss: 1.19
    INFO Validate: Accuracy: 0.77, SoftmaxCrossEntropyLoss: 0.61


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.79, SoftmaxCrossEntropyLoss: 0.58
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.79, SoftmaxCrossEntropyLoss: 0.58
    INFO Validate: Accuracy: 0.72, SoftmaxCrossEntropyLoss: 0.75


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.82, SoftmaxCrossEntropyLoss: 0.49
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.82, SoftmaxCrossEntropyLoss: 0.49
    INFO Validate: Accuracy: 0.76, SoftmaxCrossEntropyLoss: 0.59


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.45
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 4 finished.
    INFO Train: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.45
    INFO Validate: Accuracy: 0.83, SoftmaxCrossEntropyLoss: 0.45


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.43
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 5 finished.
    INFO Train: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.43
    INFO Validate: Accuracy: 0.83, SoftmaxCrossEntropyLoss: 0.46


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.40
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 6 finished.
    INFO Train: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.40
    INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.42


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.38
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 7 finished.
    INFO Train: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.38
    INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.40


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.37
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 8 finished.
    INFO Train: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.37
    INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.44


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.36
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 9 finished.
    INFO Train: Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.36
    INFO Validate: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.39


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 10 finished.
    INFO Train: Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35
    INFO Validate: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.45
    INFO forward P50: 0.625 ms, P90: 0.703 ms
    INFO training-metrics P50: 0.018 ms, P90: 0.020 ms
    INFO backward P50: 0.714 ms, P90: 0.745 ms
    INFO step P50: 1.247 ms, P90: 1.357 ms
    INFO epoch P50: 1.460 s, P90: 1.879 s


.. code:: java

    trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");
    trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");
    testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");
    
    String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];
    
    Arrays.fill(lossLabel, 0, trainLoss.length, "test acc");
    Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
    Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                    trainLoss.length + testAccuracy.length + trainAccuracy.length, "train loss");
    
    Table data = Table.create("Data").addColumns(
        DoubleColumn.create("epochCount", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
        DoubleColumn.create("loss", ArrayUtils.addAll(testAccuracy , ArrayUtils.addAll(trainAccuracy, trainLoss))),
        StringColumn.create("lossLabel", lossLabel)
    );
    
    render(LinePlot.create("", data, "epochCount", "loss", "lossLabel"),"text/html");


.. raw:: html

    <img id="26aab7eb958d4732b10e41c132eaaf06_img"></img>
    <div id="26aab7eb958d4732b10e41c132eaaf06"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_26aab7eb958d4732b10e41c132eaaf06 = document.getElementById('26aab7eb958d4732b10e41c132eaaf06');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epochCount',
        },
    
        yaxis: {
        title: 'loss',
        },
    
    };
    
    var trace0 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.7686","0.716","0.7574","0.8275","0.8282","0.848","0.8549","0.8488","0.8589","0.8359"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'test acc',
    };
    var trace1 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["0.54251665","0.78583336","0.8204833","0.83538336","0.84455","0.8526","0.85866666","0.86331666","0.8693","0.87156665"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train acc',
    };
    var trace2 =
    {
    x: ["1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0"],
    y: ["1.1882628","0.58020943","0.4928934","0.45122826","0.4270344","0.4005271","0.3834068","0.37195218","0.35558218","0.34666702"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'train loss',
    };
    
    
    var data = [ trace0, trace1, trace2];
    Plotly.newPlot(target_26aab7eb958d4732b10e41c132eaaf06, data, layout);
    })</script>


Summary
-------

-  Beyond controlling the number of dimensions and the size of the
   weight vector, dropout is yet another tool to avoid overfitting.
   Often all three are used jointly.
-  Dropout replaces an activation :math:`h` with a random variable
   :math:`h'` with expected value :math:`h` and with variance given by
   the dropout probability :math:`p`.
-  Dropout is only used during training.

Exercises
---------

1. What happens if you change the dropout probabilities for layers 1 and
   2? In particular, what happens if you switch the ones for both
   layers? Design an experiment to answer these questions, describe your
   results quantitatively, and summarize the qualitative takeaways.
2. Increase the number of epochs and compare the results obtained when
   using dropout with those when not using it.
3. What is the variance of the activations in each hidden layer when
   dropout is and is not applied? Draw a plot to show how this quantity
   evolves over time for both models.
4. Why is dropout not typically used at test time?
5. Using the model in this section as an example, compare the effects of
   using dropout and weight decay. What happens when dropout and weight
   decay are used at the same time? Are the results additive, are there
   diminished returns or (worse), do they cancel each other out?
6. What happens if we apply dropout to the individual weights of the
   weight matrix rather than the activations?
7. Invent another technique for injecting random noise at each layer
   that is different from the standard dropout technique. Can you
   develop a method that outperforms dropout on the Fashion-MNIST
   dataset (for a fixed architecture)?