Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_linear-networks/softmax-regression-djl.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_linear-networks/softmax-regression-djl.ipynb

.. _sec_softmax_djl:

Concise Implementation of Softmax Regression
============================================


Just as DJL made it much easier to implement linear regression in
:numref:`sec_linear_djl`, we will find it similarly (or possibly more)
convenient for implementing classification models. Again, we begin with
our import ritual.

.. code:: java

    %load ../utils/djl-imports
    
    import ai.djl.basicdataset.cv.classification.*;
    import ai.djl.metric.*;

Let us stick with the Fashion-MNIST dataset and keep the batch size at
:math:`256` as in the last section.

.. code:: java

    int batchSize = 256;
    boolean randomShuffle = true;
    
    // Get Training and Validation Datasets
    FashionMnist trainingSet = FashionMnist.builder()
            .optUsage(Dataset.Usage.TRAIN)
            .setSampling(batchSize, randomShuffle)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();
    
    
    FashionMnist validationSet = FashionMnist.builder()
            .optUsage(Dataset.Usage.TEST)
            .setSampling(batchSize, false)
            .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE))
            .build();

Initializing Model Parameters
-----------------------------

As mentioned in :numref:`sec_softmax`, the output layer of softmax
regression is a fully-connected (``Dense``) layer. Therefore, to
implement our model, we just need to add one ``Dense`` layer with 10
outputs to our ``Sequential``. Again, here, the ``Sequential`` is not
really necessary, but we might as well form the habit since it will be
ubiquitous when implementing deep models.

.. code:: java

    public class ActivationFunction {
        public static NDList softmax(NDList arrays) {
            return new NDList(arrays.singletonOrThrow().logSoftmax(1));
        }
    }

.. code:: java

    NDManager manager = NDManager.newBaseManager();
    
    Model model = Model.newInstance("softmax-regression");
    
    SequentialBlock net = new SequentialBlock();
    net.add(Blocks.batchFlattenBlock(28 * 28)); // flatten input
    net.add(Linear.builder().setUnits(10).build()); // set 10 output channels
    
    model.setBlock(net);

The Softmax
-----------

In the previous example, we calculated our model's output and then ran
this output through the cross-entropy loss. Mathematically, that is a
perfectly reasonable thing to do. However, from a computational
perspective, exponentiation can be a source of numerical stability
issues (as discussed in :numref:`sec_naive_bayes`). Recall that the
softmax function calculates
:math:`\hat y_j = \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}`, where
:math:`\hat y_j` is the :math:`j^\mathrm{th}` element of ``yHat`` and
:math:`z_j` is the :math:`j^\mathrm{th}` element of the input
``yLinear`` variable, as computed by the softmax.

If some of the :math:`z_i` are very large (i.e., very positive), then
:math:`e^{z_i}` might be larger than the largest number we can have for
certain types of ``float`` (i.e., overflow). This would make the
denominator (and/or numerator) ``inf`` and we wind up encountering
either :math:`0`, ``inf``, or ``nan`` for :math:`\hat y_j`. In these
situations we do not get a well-defined return value for
``crossEntropy()``. One trick to get around this is to first subtract
:math:`\text{max}(z_i)` from all :math:`z_i` before proceeding with the
``softmax`` calculation. You can verify that this shifting of each
:math:`z_i` by constant factor does not change the return value of
``softmax()``.

After the subtraction and normalization step, it might be possible that
some :math:`z_j` have large negative values and thus that the
corresponding :math:`e^{z_j}` will take values close to zero. These
might be rounded to zero due to finite precision (i.e underflow), making
:math:`\hat y_j` zero and giving us ``-inf`` for
:math:`\text{log}(\hat y_j)`. A few steps down the road in
backpropagation, we might find ourselves faced with a screenful of the
dreaded not-a-number (``nan``) results.

Fortunately, we are saved by the fact that even though we are computing
exponential functions, we ultimately intend to take their log (when
calculating the cross-entropy loss). By combining these two operators
(``softmax`` and ``crossEntropy``) together, we can escape the numerical
stability issues that might otherwise plague us during backpropagation.
As shown in the equation below, we avoided calculating :math:`e^{z_j}`
and can use instead :math:`z_j` directly due to the canceling in
:math:`\log(\exp(\cdot))`.

.. math::


   \begin{aligned}
   \log{(\hat y_j)} & = \log\left( \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}\right) \\
   & = \log{(e^{z_j})}-\text{log}{\left( \sum_{i=1}^{n} e^{z_i} \right)} \\
   & = z_j -\log{\left( \sum_{i=1}^{n} e^{z_i} \right)}.
   \end{aligned}

We will want to keep the conventional softmax function handy in case we
ever want to evaluate the probabilities output by our model. But instead
of passing softmax probabilities into our new loss function, we will
just pass the logits and compute the softmax and its log all at once
inside the softmaxCrossEntropy loss function, which does smart things
like the log-sum-exp trick (`see on
Wikipedia <https://en.wikipedia.org/wiki/LogSumExp>`__).

.. code:: java

    Loss loss = Loss.softmaxCrossEntropyLoss();

Optimization Algorithm
----------------------

Here, we use minibatch stochastic gradient descent with a learning rate
of :math:`0.1` as the optimization algorithm. Note that this is the same
as we applied in the linear regression example and it illustrates the
general applicability of the optimizers.

.. code:: java

    Tracker lrt = Tracker.fixed(0.1f);
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();

Instantiate Configuration
-------------------------

Now we'll create a training configuration that describes how we want to
train our model. We will then create a ``trainer`` to do the training
for us.

.. code:: java

    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
        .optOptimizer(sgd) // Optimizer
        .optDevices(manager.getEngine().getDevices(1)) // single GPU
        .addEvaluator(new Accuracy()) // Model Accuracy
        .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
    
    Trainer trainer = model.newTrainer(config);


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.053 ms.


Initializing Trainer
--------------------

We initialize the trainer with input shape (:math:`1`, :math:`748`).

.. code:: java

    trainer.initialize(new Shape(1, 28 * 28)); // Input Images are 28 x 28

Metrics
-------

Now we tell DJL to record metrics! (Remember, DJL doesn't record metrics
unless you tell it to!)

.. code:: java

    Metrics metrics = new Metrics();
    trainer.setMetrics(metrics);

Training
--------

In :numref:`sec_linear_djl`, we train the model by explicitly calling
``EasyTrain`` to train each batch and then updating the parameters. We
can actually instead call ``EasyTrain``'s ``fit()`` function to do this
for us in 1 line. It takes in a given number of epochs, a training set,
and a validation set and, along with the training, will do the
validation for us as well!

.. code:: java

    int numEpochs = 3;
    
    EasyTrain.fit(trainer, numEpochs, trainingSet, validationSet);
    var result = trainer.getTrainingResult();


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.74, SoftmaxCrossEntropyLoss: 0.79
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 1 finished.
    INFO Train: Accuracy: 0.74, SoftmaxCrossEntropyLoss: 0.79
    INFO Validate: Accuracy: 0.79, SoftmaxCrossEntropyLoss: 0.63


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.58
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 2 finished.
    INFO Train: Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.58
    INFO Validate: Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.57


.. parsed-literal::
    :class: output

    Training:    100% |████████████████████████████████████████| Accuracy: 0.82, SoftmaxCrossEntropyLoss: 0.53
    Validating:  100% |████████████████████████████████████████|


.. parsed-literal::
    :class: output

    INFO Epoch 3 finished.
    INFO Train: Accuracy: 0.82, SoftmaxCrossEntropyLoss: 0.53
    INFO Validate: Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.54


As before, this algorithm converges to a solution that achieves an
accuracy of 83.7%, albeit this time with fewer lines of code than
before. Note that in many cases, DJL takes additional precautions beyond
these most well-known tricks to ensure numerical stability, saving us
from even more pitfalls that we would encounter if we tried to code all
of our models from scratch in practice.

Exercises
---------

1. Try adjusting the hyper-parameters, such as batch size, epoch, and
   learning rate, to see what the results are.
2. Why might the test accuracy decrease again after a while? How could
   we fix this?