Run this notebook online:\ |Binder| or Colab: |Colab| .. |Binder| image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_multilayer-perceptrons/mlp.ipynb .. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_multilayer-perceptrons/mlp.ipynb .. _sec_mlp: Multilayer Perceptrons ====================== In the previous chapter, we introduced softmax regression (:numref:`sec_softmax`), implementing the algorithm from scratch (:numref:`sec_softmax_scratch`) and in DJL (:numref:`sec_softmax_djl`) and training classifiers to recognize 10 categories of clothing from low-resolution images. Along the way, we learned how to wrangle data, coerce our outputs into a valid probability distribution (via ``softmax``), apply an appropriate loss function, and to minimize it with respect to our model's parameters. Now that we have mastered these mechanics in the context of simple linear models, we can launch our exploration of deep neural networks, the comparatively rich class of models with which this book is primarily concerned. Hidden Layers ------------- To begin, recall the model architecture corresponding to our softmax regression example, illustrated in :numref:`fig_singlelayer` below. This model mapped our inputs directly to our outputs via a single linear transformation: .. math:: \hat{\mathbf{o}} = \mathrm{softmax}(\mathbf{W} \mathbf{x} + \mathbf{b}). If our labels truly were related to our input data by a linear function, then this approach would be sufficient. But linearity is a *strong assumption*. For example, linearity implies the *weaker* assumption of *monotonicity*: that any increase in our feature must either always cause an increase in our model's output (if the corresponding weight is positive), or always cause a decrease in our model's output (if the corresponding weight is negative). Sometimes that makes sense. For example, if we were trying to predict whether an individual will repay a loan, we might reasonably imagine that holding all else equal, an applicant with a higher income would always be more likely to repay than one with a lower income. While monotonic, this relationship likely is not linearly associated with the probability of repayment. An increase in income from $0 to $50k likely corresponds to a bigger increase in likelihood of repayment than an increase from $1M to $1.05M. One way to handle this might be to pre-process our data such that linearity becomes more plausible, say, by using the logarithm of income as our feature. Note that we can easily come up with examples that violate *monotonicity*. Say for example that we want to predict probability of death based on body temperature. For individuals with a body temperature above 37°C (98.6°F), higher termperatures indicate greater risk. However, for individuals with body termperatures below 37° C, higher temperatures indicate *lower* risk! In this case too, we might resolve the problem with some clever preprocessing. Namely, we might use the *distance* from 37°C as our feature. But what about classifying images of cats and dogs? Should increasing the intensity of the pixel at location (13, 17) always increase (or always decrease) the likelihood that the image depicts a dog? Reliance on a linear model corrsponds to the (implicit) assumption that the only requirement for differentiating cats vs. dogs is to assess the brightness of individual pixels. This approach is doomed to fail in a world where inverting an image preserves the category. And yet despite the apparent absurdity of linearity here, as compared to our previous examples, it is less obvious that we could address the problem with a simple preprocessing fix. That is because the significance of any pixel depends in complex ways on its context (the values of the surrounding pixels). While there might exist a representation of our data that would take into account the relevant interactions among our features (and on top of which a linear model would be suitable), we simply do not know how to calculate it by hand. With deep neural networks, we used observational data to jointly learn both a representation (via hidden layers) and a linear predictor that acts upon that representation. Incorporating Hidden Layers ~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can overcome these limitations of linear models and handle a more general class of functions by incorporating one or more hidden layers. The easiest way to do this is to stack many fully-connected layers on top of each other. Each layer feeds into the layer above it, until we generate an output. We can think of the first :math:`L-1` layers as our representation and the final layer as our linear predictor. This architecture is commonly called a *multilayer perceptron*, often abbreviated as *MLP*. Below, we depict an MLP diagramtically (:numref:`fig_mlp`). .. _fig_mlp: .. figure:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/mlp.svg Multilayer perceptron with hidden layers. This example contains a hidden layer with 5 hidden units in it. This multilayer perceptron has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units. Since the input layer does not involve any calculations, producing outputs with this network requires implementing the computations for each of the 2 layers (hidden and output). Note that these layers are both fully connected. Every input influences every neuron in the hidden layer, and each of these in turn influences every neuron in the output layer. From Linear to Nonlinear ~~~~~~~~~~~~~~~~~~~~~~~~ Formally, we calculate each layer in this one-hidden-layer MLP as follows: .. math:: \begin{aligned} \mathbf{h} & = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1, \\ \mathbf{o} & = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2, \\ \hat{\mathbf{y}} & = \mathrm{softmax}(\mathbf{o}). \end{aligned} Note that after adding this layer, our model now requires us to track and update two additional sets of parameters. So what have we gained in exchange? You might be surprised to find out that---in the model defined above---\ *we gain nothing for our troubles!* The reason is plain. The hidden units above are given by a linear function of the inputs, and the outputs (pre-softmax) are just a linear function of the hidden units. A linear function of a linear function is itself a linear function. Moreover, our linear model was already capable of representing any linear function. We can view the equivalence formally by proving that for any values of the weights, we can just collapse out the hidden layer, yielding an equivalent single-layer model with paramters :math:`\mathbf{W} = \mathbf{W}_2 \mathbf{W}_1` and :math:`\mathbf{b} = \mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2`. .. math:: \mathbf{o} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2 = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (\mathbf{W}_2 \mathbf{W}_1) \mathbf{x} + (\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2) = \mathbf{W} \mathbf{x} + \mathbf{b}. In order to realize the potential of multilayer architectures, we need one more key ingredient---an elementwise *nonlinear activation function* :math:`\sigma` to be applied to each hidden unit (following the linear transformation). The most popular choice for the nonlinearity these days is the rectified linear unit (ReLU) :math:`\mathrm{max}(x, 0)`. In general, with these activation functions in place, it is no longer possible to collapse our MLP into a linear model. .. math:: \begin{aligned} \mathbf{h} & = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1), \\ \mathbf{o} & = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2, \\ \hat{\mathbf{y}} & = \mathrm{softmax}(\mathbf{o}). \end{aligned} To build more general MLPs, we can continue stacking such hidden layers, e.g., :math:`\mathbf{h}_1 = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)` and :math:`\mathbf{h}_2 = \sigma(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2)`, one atop another, yielding ever more expressive models (assuming fixed width). MLPs can capture complex interactions among our inputs via their hidden neurons, which depend on the values of each of the inputs. We can easily design hidden nodes to perform arbitrary computation, for instance, basic logic operations on a pair of inputs. Moreover, for certain choices of the activation function, it is widely known that MLPs are universal approximators. Even with a single-hidden-layer network, given enough nodes (possibly absurdly many), and the right set of weights, we can model any function. *Actually learning that function is the hard part.* You might think of your neural network as being a bit like the C programming language. The language, like any other modern language, is capable of expressing any computable program. But actually coming up with a program that meets your specifications is the hard part. Moreover, just because a single-layer network *can* learn any function does not mean that you should try to solve all of your problems with single-layer networks. In fact, we can approximate many functions much more compactly by using deeper (vs wider) networks. We will touch upon more rigorous arguments in subsequent chapters, but first let us actually build an MLP in code. In this example, we’ll implement an MLP with two hidden layers and one output layer. Vectorization and Minibatch ~~~~~~~~~~~~~~~~~~~~~~~~~~~ As before, by the matrix :math:`\mathbf{X}`, we denote a minibatch of inputs. The calculations to produce outputs from an MLP with two hidden layers can thus be expressed: .. math:: \begin{aligned} \mathbf{H}_1 & = \sigma(\mathbf{W}_1 \mathbf{X} + \mathbf{b}_1), \\ \mathbf{H}_2 & = \sigma(\mathbf{W}_2 \mathbf{H}_1 + \mathbf{b}_2), \\ \mathbf{O} & = \mathrm{softmax}(\mathbf{W}_3 \mathbf{H}_2 + \mathbf{b}_3). \end{aligned} With some abuse of notation, we define the nonlinearity :math:`\sigma` to apply to its inputs in a row-wise fashion, i.e., one observation at a time. Note that we are also using the notation for *softmax* in the same way to denote a row-wise operation. Often, as in this section, the activation functions that we apply to hidden layers are not merely row-wise, but component wise. That means that after computing the linear portion of the layer, we can calculate each nodes activation without looking at the values taken by the other hidden units. This is true for most activation functions (the batch normalization operation to be introduced in :numref:`sec_batch_norm` is a notable exception to that rule). Now we will load the DJL and other relevant utility libraries to setup our environment which would enable us to run our code in the subsequent sections. .. code:: java %load ../utils/djl-imports %load ../utils/plot-utils Activation Functions -------------------- Activation functions decide whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. They are differentiable operators to transform input signals to outputs, while most of them add non-linearity. Because activation functions are fundamental to deep learning, let us briefly survey some common activation functions. ReLU Function ~~~~~~~~~~~~~ As stated above, the most popular choice, due to both simplicity of implementation its performance on a variety of predictive tasks is the rectified linear unit (ReLU). ReLU provides a very simple nonlinear transformation. Given the element :math:`z`, the function is defined as the maximum of that element and 0. .. math:: \mathrm{ReLU}(z) = \max(z, 0). Informally, the ReLU function retains only positive elements and discards all negative elements (setting the corresponding activations to 0). To gain some intuition, we can plot the function. Because it is used so commonly, ``DJL`` supports the ``relu`` function as a native operator. As you can see, the activation function is piecewise linear. .. code:: java NDManager manager = NDManager.newBaseManager(); NDArray x = manager.arange(-8.0f, 8.0f, 0.1f); x.setRequiresGradient(true); NDArray y = Activation.relu(x); // Converting the data into float arrays to render them in a plot. float[] X = x.toFloatArray(); float[] Y = y.toFloatArray(); Table data = Table.create("Data").addColumns( FloatColumn.create("X", X), FloatColumn.create("relu(x)", Y) ); render(LinePlot.create("", data, "x", "relu(X)"), "text/html"); .. raw:: html
When the input is negative, the derivative of the ReLU function is 0, and when the input is positive, the derivative of the ReLU function is 1. Note that the ReLU function is not differentiable when the input takes value precisely equal to 0. In these cases, we default to the left-hand-side (LHS) derivative and say that the derivative is 0 when the input is 0. We can get away with this because the input may never actually be zero. There is an old adage that if subtle boundary conditions matter, we are probably doing (*real*) mathematics, not engineering. That conventional wisdom may apply here. We plot the derivative of the ReLU function plotted below. .. code:: java try(GradientCollector collector = Engine.getInstance().newGradientCollector()){ y = Activation.relu(x); collector.backward(y); } NDArray res = x.getGradient(); float[] X = x.toFloatArray(); float[] Y = res.toFloatArray(); Table data = Table.create("Data").addColumns( FloatColumn.create("X", X), FloatColumn.create("grad of relu", Y) ); render(LinePlot.create("", data, "x", "grad of relu"), "text/html"); .. raw:: html Note that there are many variants to the ReLU function, including the parameterized ReLU (pReLU) of `He et al., 2015