Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_recurrent-modern/deep-rnn.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_recurrent-modern/deep-rnn.ipynb

.. _sec_deep_rnn:

Deep Recurrent Neural Networks
==============================


Up to now, we only discussed RNNs with a single unidirectional hidden
layer. In it the specific functional form of how latent variables and
observations interact is rather arbitrary. This is not a big problem as
long as we have enough flexibility to model different types of
interactions. With a single layer, however, this can be quite
challenging. In the case of the linear models, we fixed this problem by
adding more layers. Within RNNs this is a bit trickier, since we first
need to decide how and where to add extra nonlinearity.

In fact, we could stack multiple layers of RNNs on top of each other.
This results in a flexible mechanism, due to the combination of several
simple layers. In particular, data might be relevant at different levels
of the stack. For instance, we might want to keep high-level data about
financial market conditions (bear or bull market) available, whereas at
a lower level we only record shorter-term temporal dynamics.

Beyond all the above abstract discussion it is probably easiest to
understand the family of models we are interested in by reviewing
:numref:`fig_deep_rnn`. It describes a deep RNN with :math:`L` hidden
layers. Each hidden state is continuously passed to both the next time
step of the current layer and the current time step of the next layer.

|Architecture of a deep RNN.| .. _fig_deep_rnn:

Functional Dependencies
-----------------------

We can formalize the functional dependencies within the deep
architecture of :math:`L` hidden layers depicted in
:numref:`fig_deep_rnn`. Our following discussion focuses primarily on
the vanilla RNN model, but it applies to other sequence models, too.

Suppose that we have a minibatch input
:math:`\mathbf{X}_t \in \mathbb{R}^{n \times d}` (number of examples:
:math:`n`, number of inputs in each example: :math:`d`) at time step
:math:`t`. At the same time step, let the hidden state of the
:math:`l^\mathrm{th}` hidden layer (:math:`l=1,\ldots,L`) be
:math:`\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}` (number of hidden
units: :math:`h`) and the output layer variable be
:math:`\mathbf{O}_t \in \mathbb{R}^{n \times q}` (number of outputs:
:math:`q`). Setting :math:`\mathbf{H}_t^{(0)} = \mathbf{X}_t`, the
hidden state of the :math:`l^\mathrm{th}` hidden layer that uses the
activation function :math:`\phi_l` is expressed as follows:

.. math:: \mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)}  + \mathbf{b}_h^{(l)}),
    :label: eq_deep_rnn_H

where the weights
:math:`\mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h}` and
:math:`\mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h}`, together with
the bias :math:`\mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h}`, are the
model parameters of the :math:`l^\mathrm{th}` hidden layer.

In the end, the calculation of the output layer is only based on the
hidden state of the final :math:`L^\mathrm{th}` hidden layer:

.. math:: \mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q,

where the weight :math:`\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}` and
the bias :math:`\mathbf{b}_q \in \mathbb{R}^{1 \times q}` are the model
parameters of the output layer.

Just as with MLPs, the number of hidden layers :math:`L` and the number
of hidden units :math:`h` are hyperparameters. In other words, they can
be tuned or specified by us. In addition, we can easily get a deep gated
RNN by replacing the hidden state computation in
:eq:`eq_deep_rnn_H` with that from a GRU or an LSTM.

Concise Implementation
----------------------

Fortunately many of the logistical details required to implement
multiple layers of an RNN are readily available in high-level APIs. To
keep things simple we only illustrate the implementation using such
built-in functionalities. Let us take an LSTM model as an example. The
code is very similar to the one we used previously in
:numref:`sec_lstm`. In fact, the only difference is that we specify
the number of layers explicitly rather than picking the default of a
single layer. As usual, we begin by loading the dataset.

.. |Architecture of a deep RNN.| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/deep-rnn.svg

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/plot-utils
    %load ../utils/Functions.java
    %load ../utils/PlotUtils.java
    
    %load ../utils/StopWatch.java
    %load ../utils/Accumulator.java
    %load ../utils/Animator.java
    %load ../utils/Training.java
    %load ../utils/timemachine/Vocab.java
    %load ../utils/timemachine/RNNModel.java
    %load ../utils/timemachine/RNNModelScratch.java
    %load ../utils/timemachine/TimeMachine.java
    %load ../utils/timemachine/TimeMachineDataset.java

.. code:: java

    NDManager manager = NDManager.newBaseManager();

.. code:: java

    int batchSize = 32;
    int numSteps = 35;
    
    TimeMachineDataset dataset = new TimeMachineDataset.Builder()
            .setManager(manager)
            .setMaxTokens(10000)
            .setSampling(batchSize, false)
            .setSteps(numSteps)
            .build();
    dataset.prepare();
    Vocab vocab = dataset.getVocab();

The architectural decisions such as choosing hyperparameters are very
similar to those of :numref:`sec_lstm`. We pick the same number of
inputs and outputs as we have distinct tokens, i.e., ``vocabSize``. The
number of hidden units is still 256. The only difference is that we now
select a nontrivial number of hidden layers by specifying the value of
``numLayers``.

.. code:: java

    int vocabSize = vocab.length();
    int numHiddens = 256;
    int numLayers = 2;
    Device device = manager.getDevice();
    LSTM lstmLayer =
            LSTM.builder()
                    .setNumLayers(numLayers)
                    .setStateSize(numHiddens)
                    .optReturnState(true)
                    .optBatchFirst(false)
                    .build();
    
    RNNModel model = new RNNModel(lstmLayer, vocabSize);

Training and Prediction
-----------------------

Since now we instantiate two layers with the LSTM model, this rather
more complex architecture slows down training considerably.

.. code:: java

    int numEpochs = Integer.getInteger("MAX_EPOCH", 500);
    
    int lr = 2;
    TimeMachine.trainCh8(model, dataset, vocab, lr, numEpochs, device, false, manager);


.. parsed-literal::
    :class: output

    INFO Training on: 1 GPUs.
    INFO Load MXNet Engine Version 1.9.0 in 0.085 ms.


.. raw:: html

    <img id="c7404e9ad5d045caafdc2060ec5300fc_img"></img>
    <div id="c7404e9ad5d045caafdc2060ec5300fc"></div>
    <script>require(['https://cdn.plot.ly/plotly-1.57.0.min.js'], Plotly => {
    var target_c7404e9ad5d045caafdc2060ec5300fc = document.getElementById('c7404e9ad5d045caafdc2060ec5300fc');
    var layout = {
        height: 600,
        width: 800,
        showlegend: true,
        xaxis: {
        title: 'epoch',
        },
    
        yaxis: {
        title: 'value',
        },
    
    };
    
    var trace0 =
    {
    x: ["10.0","20.0","30.0","40.0","50.0","60.0","70.0","80.0","90.0","100.0","110.0","120.0","130.0","140.0","150.0","160.0","170.0","180.0","190.0","200.0","210.0","220.0","230.0","240.0","250.0","260.0","270.0","280.0","290.0","300.0","310.0","320.0","330.0","340.0","350.0","360.0","370.0","380.0","390.0","400.0","410.0","420.0","430.0","440.0","450.0","460.0","470.0","480.0","490.0","500.0"],
    y: ["17.549915","17.508152","17.49661","17.489319","17.484041","17.479866","17.47631","17.472937","17.469236","17.463982","17.45286","17.410763","17.225203","16.18816","15.51346","15.133431","14.5527525","12.871144","11.446094","11.250728","10.2822695","10.009546","9.3871765","8.951348","8.519899","8.134095","7.520752","7.0295734","6.59391","6.306904","5.464055","4.8967113","4.2784877","3.762164","3.1937325","2.7368114","2.0455618","1.6101153","1.6994061","1.1378525","1.0836816","1.0664746","1.0417486","1.0286196","1.0231631","1.0196568","1.0171984","1.0153427","1.0139174","1.0128156"],
    showlegend: true,
    mode: 'lines',
    xaxis: 'x',
    yaxis: 'y',
    type: 'scatter',
    name: 'ppl',
    };
    
    
    var data = [ trace0];
    Plotly.newPlot(target_c7404e9ad5d045caafdc2060ec5300fc, data, layout);
    })</script>


.. parsed-literal::
    :class: output

    perplexity: 1.0, 61496.0 tokens/sec on gpu(0)
    time traveller wolld he rour at we canting as wore arother direc
    travellereathe had ag a mome that beeal of the fourth dimen


Summary
-------

-  In deep RNNs, the hidden state information is passed to the next time
   step of the current layer and the current time step of the next
   layer.
-  There exist many different flavors of deep RNNs, such as LSTMs, GRUs,
   or vanilla RNNs. Conveniently these models are all available as parts
   of the high-level APIs of deep learning frameworks.
-  Initialization of models requires care. Overall, deep RNNs require
   considerable amount of work (such as learning rate and clipping) to
   ensure proper convergence.

Exercises
---------

1. Try to implement a two-layer RNN from scratch using the single layer
   implementation we discussed in :numref:`sec_rnn_scratch`.
2. Replace the LSTM by a GRU and compare the accuracy and training
   speed.
3. Increase the training data to include multiple books. How low can you
   go on the perplexity scale?
4. Would you want to combine sources of different authors when modeling
   text? Why is this a good idea? What could go wrong?