Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_recurrent-modern/lstm.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_recurrent-modern/lstm.ipynb

.. _sec_lstm:

Long Short-Term Memory (LSTM)
=============================


The challenge to address long-term information preservation and
short-term input skipping in latent variable models has existed for a
long time. One of the earliest approaches to address this was the long
short-term memory (LSTM) :cite:`Hochreiter.Schmidhuber.1997`. It
shares many of the properties of the GRU. Interestingly, LSTMs have a
slightly more complex design than GRUs but predates GRUs by almost two
decades.

Gated Memory Cell
-----------------

Arguably LSTM's design is inspired by logic gates of a computer. LSTM
introduces a *memory cell* (or *cell* for short) that has the same shape
as the hidden state (some literatures consider the memory cell as a
special type of the hidden state), engineered to record additional
information. To control the memory cell we need a number of gates. One
gate is needed to read out the entries from the cell. We will refer to
this as the *output gate*. A second gate is needed to decide when to
read data into the cell. We refer to this as the *input gate*. Last, we
need a mechanism to reset the content of the cell, governed by a *forget
gate*. The motivation for such a design is the same as that of GRUs,
namely to be able to decide when to remember and when to ignore inputs
in the hidden state via a dedicated mechanism. Let us see how this works
in practice.

Input Gate, Forget Gate, and Output Gate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Just like in GRUs, the data feeding into the LSTM gates are the input at
the current time step and the hidden state of the previous time step, as
illustrated in :numref:`lstm_0`. They are processed by three
fully-connected layers with a sigmoid activation function to compute the
values of the input, forget. and output gates. As a result, values of
the three gates are in the range of :math:`(0, 1)`.

|Computing the input gate, the forget gate, and the output gate in an
LSTM model.| .. _lstm_0:

Mathematically, suppose that there are :math:`h` hidden units, the batch
size is :math:`n`, and the number of inputs is :math:`d`. Thus, the
input is :math:`\mathbf{X}_t \in \mathbb{R}^{n \times d}` and the hidden
state of the previous time step is
:math:`\mathbf{H}_{t-1} \in \mathbb{R}^{n \times h}`. Correspondingly,
the gates at time step :math:`t` are defined as follows: the input gate
is :math:`\mathbf{I}_t \in \mathbb{R}^{n \times h}`, the forget gate is
:math:`\mathbf{F}_t \in \mathbb{R}^{n \times h}`, and the output gate is
:math:`\mathbf{O}_t \in \mathbb{R}^{n \times h}`. They are calculated as
follows:

.. math::


   \begin{aligned}
   \mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i),\\
   \mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f),\\
   \mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o),
   \end{aligned}

where
:math:`\mathbf{W}_{xi}, \mathbf{W}_{xf}, \mathbf{W}_{xo} \in \mathbb{R}^{d \times h}`
and
:math:`\mathbf{W}_{hi}, \mathbf{W}_{hf}, \mathbf{W}_{ho} \in \mathbb{R}^{h \times h}`
are weight parameters and
:math:`\mathbf{b}_i, \mathbf{b}_f, \mathbf{b}_o \in \mathbb{R}^{1 \times h}`
are bias parameters.

.. _lstm_1:

Candidate Memory Cell
~~~~~~~~~~~~~~~~~~~~~

Next we design the memory cell. Since we have not specified the action
of the various gates yet, we first introduce the *candidate* memory cell
:math:`\tilde{\mathbf{C}}_t \in \mathbb{R}^{n \times h}`. Its
computation is similar to that of the three gates described above, but
using a :math:`\tanh` function with a value range for :math:`(-1, 1)` as
the activation function. This leads to the following equation at time
step :math:`t`:

.. math:: \tilde{\mathbf{C}}_t = \text{tanh}(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c),

where :math:`\mathbf{W}_{xc} \in \mathbb{R}^{d \times h}` and
:math:`\mathbf{W}_{hc} \in \mathbb{R}^{h \times h}` are weight
parameters and :math:`\mathbf{b}_c \in \mathbb{R}^{1 \times h}` is a
bias parameter.

A quick illustration of the candidate memory cell is shown in
:numref:`lstm_1`.

|Computing the candidate memory cell in an LSTM model.|

Memory Cell
~~~~~~~~~~~

In GRUs, we have a mechanism to govern input and forgetting (or
skipping). Similarly, in LSTMs we have two dedicated gates for such
purposes: the input gate :math:`\mathbf{I}_t` governs how much we take
new data into account via :math:`\tilde{\mathbf{C}}_t` and the forget
gate :math:`\mathbf{F}_t` addresses how much of the old memory cell
content :math:`\mathbf{C}_{t-1} \in \mathbb{R}^{n \times h}` we retain.
Using the same pointwise multiplication trick as before, we arrive at
the following update equation:

.. math:: \mathbf{C}_t = \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t.

If the forget gate is always approximately 1 and the input gate is
always approximately 0, the past memory cells :math:`\mathbf{C}_{t-1}`
will be saved over time and passed to the current time step. This design
is introduced to alleviate the vanishing gradient problem and to better
capture long range dependencies within sequences.

We thus arrive at the flow diagram in :numref:`lstm_2`.

.. _lstm_2:

.. figure:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lstm-2.svg

   Computing the memory cell in an LSTM model.


Hidden State
~~~~~~~~~~~~

Last, we need to define how to compute the hidden state
:math:`\mathbf{H}_t \in \mathbb{R}^{n \times h}`. This is where the
output gate comes into play. In LSTM it is simply a gated version of the
:math:`\tanh` of the memory cell. This ensures that the values of
:math:`\mathbf{H}_t` are always in the interval :math:`(-1, 1)`.

.. math:: \mathbf{H}_t = \mathbf{O}_t \odot \tanh(\mathbf{C}_t).

Whenever the output gate approximates 1 we effectively pass all memory
information through to the predictor, whereas for the output gate close
to 0 we retain all the information only within the memory cell and
perform no further processing.

:numref:`lstm_3` has a graphical illustration of the data flow.

|Computing the hidden state in an LSTM model.| .. _lstm_3:

Implementation from Scratch
---------------------------

Now let us implement an LSTM from scratch. As same as the experiments in
:numref:`sec_rnn_scratch`, we first load the time machine dataset.

.. |Computing the input gate, the forget gate, and the output gate in an LSTM model.| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lstm-0.svg
.. |Computing the candidate memory cell in an LSTM model.| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lstm-1.svg
.. |Computing the hidden state in an LSTM model.| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lstm-3.svg

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils
    %load ../utils/Functions.java
    %load ../utils/PlotUtils.java
    
    %load ../utils/StopWatch.java
    %load ../utils/Accumulator.java
    %load ../utils/Animator.java
    %load ../utils/Training.java
    %load ../utils/timemachine/Vocab.java
    %load ../utils/timemachine/RNNModel.java
    %load ../utils/timemachine/RNNModelScratch.java
    %load ../utils/timemachine/TimeMachine.java
    %load ../utils/timemachine/TimeMachineDataset.java

.. code:: java

    NDManager manager = NDManager.newBaseManager();

.. code:: java

    int batchSize = 32;
    int numSteps = 35;
    
    TimeMachineDataset dataset =
            new TimeMachineDataset.Builder()
                    .setManager(manager)
                    .setMaxTokens(10000)
                    .setSampling(batchSize, false)
                    .setSteps(numSteps)
                    .build();
    dataset.prepare();
    Vocab vocab = dataset.getVocab();

Initializing Model Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Next we need to define and initialize the model parameters. As
previously, the hyperparameter ``numHiddens`` defines the number of
hidden units. We initialize weights following a Gaussian distribution
with 0.01 standard deviation, and we set the biases to 0.

.. code:: java

    public static NDList getLSTMParams(int vocabSize, int numHiddens, Device device) {
        int numInputs = vocabSize;
        int numOutputs = vocabSize;
    
        // Input gate parameters
        NDList temp = three(numInputs, numHiddens, device);
        NDArray W_xi = temp.get(0);
        NDArray W_hi = temp.get(1);
        NDArray b_i = temp.get(2);
    
        // Forget gate parameters
        temp = three(numInputs, numHiddens, device);
        NDArray W_xf = temp.get(0);
        NDArray W_hf = temp.get(1);
        NDArray b_f = temp.get(2);
    
        // Output gate parameters
        temp = three(numInputs, numHiddens, device);
        NDArray W_xo = temp.get(0);
        NDArray W_ho = temp.get(1);
        NDArray b_o = temp.get(2);
    
        // Candidate memory cell parameters
        temp = three(numInputs, numHiddens, device);
        NDArray W_xc = temp.get(0);
        NDArray W_hc = temp.get(1);
        NDArray b_c = temp.get(2);
    
        // Output layer parameters
        NDArray W_hq = normal(new Shape(numHiddens, numOutputs), device);
        NDArray b_q = manager.zeros(new Shape(numOutputs), DataType.FLOAT32, device);
    
        // Attach gradients
        NDList params =
                new NDList(
                        W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq,
                        b_q);
        for (NDArray param : params) {
            param.setRequiresGradient(true);
        }
        return params;
    }
    
    public static NDArray normal(Shape shape, Device device) {
        return manager.randomNormal(0, 0.01f, shape, DataType.FLOAT32, device);
    }
    
    public static NDList three(int numInputs, int numHiddens, Device device) {
        return new NDList(
                normal(new Shape(numInputs, numHiddens), device),
                normal(new Shape(numHiddens, numHiddens), device),
                manager.zeros(new Shape(numHiddens), DataType.FLOAT32, device));
    }

Defining the Model
~~~~~~~~~~~~~~~~~~

In the initialization function, the hidden state of the LSTM needs to
return an *additional* memory cell with a value of 0 and a shape of
(batch size, number of hidden units). Hence we get the following state
initialization.

.. code:: java

    public static NDList initLSTMState(int batchSize, int numHiddens, Device device) {
        return new NDList(
                manager.zeros(new Shape(batchSize, numHiddens), DataType.FLOAT32, device),
                manager.zeros(new Shape(batchSize, numHiddens), DataType.FLOAT32, device));
    }

The actual model is defined just like what we discussed before:
providing three gates and an auxiliary memory cell. Note that only the
hidden state is passed to the output layer. The memory cell
:math:`\mathbf{C}_t` does not directly participate in the output
computation.

.. code:: java

    public static Pair<NDArray, NDList> lstm(NDArray inputs, NDList state, NDList params) {
        NDArray W_xi = params.get(0);
        NDArray W_hi = params.get(1);
        NDArray b_i = params.get(2);
    
        NDArray W_xf = params.get(3);
        NDArray W_hf = params.get(4);
        NDArray b_f = params.get(5);
    
        NDArray W_xo = params.get(6);
        NDArray W_ho = params.get(7);
        NDArray b_o = params.get(8);
    
        NDArray W_xc = params.get(9);
        NDArray W_hc = params.get(10);
        NDArray b_c = params.get(11);
    
        NDArray W_hq = params.get(12);
        NDArray b_q = params.get(13);
    
        NDArray H = state.get(0);
        NDArray C = state.get(1);
        NDList outputs = new NDList();
        NDArray X, Y, I, F, O, C_tilda;
        for (int i = 0; i < inputs.size(0); i++) {
            X = inputs.get(i);
            I = Activation.sigmoid(X.dot(W_xi).add(H.dot(W_hi).add(b_i)));
            F = Activation.sigmoid(X.dot(W_xf).add(H.dot(W_hf).add(b_f)));
            O = Activation.sigmoid(X.dot(W_xo).add(H.dot(W_ho).add(b_o)));
            C_tilda = Activation.tanh(X.dot(W_xc).add(H.dot(W_hc).add(b_c)));
            C = F.mul(C).add(I.mul(C_tilda));
            H = O.mul(Activation.tanh(C));
            Y = H.dot(W_hq).add(b_q);
            outputs.add(Y);
        }
        return new Pair(
                outputs.size() > 1 ? NDArrays.concat(outputs) : outputs.get(0), new NDList(H, C));
    }

Training and Prediction
~~~~~~~~~~~~~~~~~~~~~~~

Let us train an LSTM as same as what we did in :numref:`sec_gru`, by
instantiating the ``RNNModelScratch`` class as introduced in
:numref:`sec_rnn_scratch`.

.. code:: java

    int vocabSize = vocab.length();
    int numHiddens = 256;
    Device device = manager.getDevice();
    int numEpochs = Integer.getInteger("MAX_EPOCH", 500);
    
    int lr = 1;
    
    Functions.TriFunction<Integer, Integer, Device, NDList> getParamsFn =
            (a, b, c) -> getLSTMParams(a, b, c);
    Functions.TriFunction<Integer, Integer, Device, NDList> initLSTMStateFn =
            (a, b, c) -> initLSTMState(a, b, c);
    Functions.TriFunction<NDArray, NDList, NDList, Pair<NDArray, NDList>> lstmFn = (a, b, c) -> lstm(a, b, c);
    
    RNNModelScratch model =
            new RNNModelScratch(
                    vocabSize, numHiddens, device, getParamsFn, initLSTMStateFn, lstmFn);
    TimeMachine.trainCh8(model, dataset, vocab, lr, numEpochs, device, false, manager);


.. raw:: html

    <img id="c24d2e4eca474f1bab49cf4016cf58ee_img"></img>
    <div id="c24d2e4eca474f1bab49cf4016cf58ee"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_c24d2e4eca474f1bab49cf4016cf58ee = document.getElementById('c24d2e4eca474f1bab49cf4016cf58ee');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'value',
        },
    
    };
    
    var trace0 =
    {
    x: ["10.0","20.0","30.0","40.0","50.0","60.0","70.0","80.0","90.0","100.0","110.0","120.0","130.0","140.0","150.0","160.0","170.0","180.0","190.0","200.0","210.0","220.0","230.0","240.0","250.0","260.0","270.0","280.0","290.0","300.0","310.0","320.0","330.0","340.0","350.0","360.0","370.0","380.0","390.0","400.0","410.0","420.0","430.0","440.0","450.0","460.0","470.0","480.0","490.0","500.0"],
    y: ["17.98337","17.43854","16.684038","15.651637","14.532355","13.004776","11.989496","11.352077","10.783813","10.449796","10.039866","9.6802635","9.388616","8.917858","8.555821","8.26651","7.976743","7.59439","7.30471","7.0664024","6.7312255","6.443585","6.2216496","5.876426","5.500538","5.1809087","4.998748","4.5595794","4.316013","3.9253535","3.6613238","3.2968278","3.053462","2.7909803","2.545046","2.3022468","2.1666284","1.8950771","1.7296609","1.6394353","1.5197394","1.4130836","1.4619044","1.3254068","1.3797011","1.0965222","1.3760917","1.0693219","1.0548135","1.1333622"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'ppl',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_c24d2e4eca474f1bab49cf4016cf58ee, data, layout);
    })</script>


.. parsed-literal::
    :class: output

    perplexity: 1.1, 11571.5 tokens/sec on gpu(0)
    time traveller a deald thas is all the earthat that ir mist a ve
    traveller after the pauserequired for the proper assimilati


Concise Implementation
----------------------

Using high-level APIs, we can directly instantiate an ``LSTM`` model.
This encapsulates all the configuration details that we made explicit
above. The code is significantly faster as it uses compiled operators
rather than Java for many details that we spelled out in detail before.

.. code:: java

    LSTM lstmLayer =
            LSTM.builder()
                    .setNumLayers(1)
                    .setStateSize(numHiddens)
                    .optReturnState(true)
                    .optBatchFirst(false)
                    .build();
    RNNModel modelConcise = new RNNModel(lstmLayer, vocab.length());
    TimeMachine.trainCh8(modelConcise, dataset, vocab, lr, numEpochs, device, false, manager);


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.055 ms.


.. raw:: html

    <img id="3e5706daaa0045dda4731eded2e6f92a_img"></img>
    <div id="3e5706daaa0045dda4731eded2e6f92a"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_3e5706daaa0045dda4731eded2e6f92a = document.getElementById('3e5706daaa0045dda4731eded2e6f92a');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'value',
        },
    
    };
    
    var trace0 =
    {
    x: ["10.0","20.0","30.0","40.0","50.0","60.0","70.0","80.0","90.0","100.0","110.0","120.0","130.0","140.0","150.0","160.0","170.0","180.0","190.0","200.0","210.0","220.0","230.0","240.0","250.0","260.0","270.0","280.0","290.0","300.0","310.0","320.0","330.0","340.0","350.0","360.0","370.0","380.0","390.0","400.0","410.0","420.0","430.0","440.0","450.0","460.0","470.0","480.0","490.0","500.0"],
    y: ["17.693495","17.316912","16.638601","15.7532015","14.271586","12.443709","11.577166","10.929921","10.596406","10.179512","9.906491","9.466758","9.044335","8.631279","8.285357","7.8996387","7.6350403","7.390467","7.060318","6.8119183","6.5098047","6.202165","5.9748254","5.749986","5.4543824","5.1613727","4.8719683","4.6259623","4.374988","4.082355","3.8955593","3.577078","3.2253885","3.0943046","2.9374752","2.5540292","2.3856726","2.3237522","2.0008998","1.8267057","1.7758105","1.5787628","1.4690225","1.2388519","1.201313","1.2183578","1.4385287","1.0924603","1.1575444","1.0855707"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'ppl',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_3e5706daaa0045dda4731eded2e6f92a, data, layout);
    })</script>


.. parsed-literal::
    :class: output

    perplexity: 1.1, 80819.7 tokens/sec on gpu(0)
    time traveller for so it will be convenient to speak of himwas e
    traveller frece thef and some there there wermat of lyon ab


LSTMs are the prototypical latent variable autoregressive model with
nontrivial state control. Many variants thereof have been proposed over
the years, e.g., multiple layers, residual connections, different types
of regularization. However, training LSTMs and other sequence models
(such as GRUs) are quite costly due to the long range dependency of the
sequence. Later we will encounter alternative models such as
Transformers that can be used in some cases.

Summary
-------

-  LSTMs have three types of gates: input gates, forget gates, and
   output gates that control the flow of information.
-  The hidden layer output of LSTM includes the hidden state and the
   memory cell. Only the hidden state is passed into the output layer.
   The memory cell is entirely internal.
-  LSTMs can alleviate vanishing and exploding gradients.

Exercises
---------

1. Adjust the hyperparameters and analyze the their influence on running
   time, perplexity, and the output sequence.
2. How would you need to change the model to generate proper words as
   opposed to sequences of characters?
3. Compare the computational cost for GRUs, LSTMs, and regular RNNs for
   a given hidden dimension. Pay special attention to the training and
   inference cost.
4. Since the candidate memory cell ensures that the value range is
   between :math:`-1` and :math:`1` by using the :math:`\tanh` function,
   why does the hidden state need to use the :math:`\tanh` function
   again to ensure that the output value range is between :math:`-1` and
   :math:`1`?
5. Implement an LSTM model for time series prediction rather than
   character sequence prediction.