Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_deep-learning-computation/use-gpu.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_deep-learning-computation/use-gpu.ipynb

.. _sec_use_gpu:

GPUs
====


In the introduction, we discussed the rapid growth of computation over
the past two decades. In a nutshell, GPU performance has increased by a
factor of 1000 every decade since 2000. This offers great opportunity
but it also suggests a significant need to provide such performance.

+----------+----------------------------------------+----------+------------------------------------------+
| Decade   | Dataset                                | Memory   | Floating Point Calculations per Second   |
+==========+========================================+==========+==========================================+
| 1970     | 100 (Iris)                             | 1 KB     | 100 KF (Intel 8080)                      |
+----------+----------------------------------------+----------+------------------------------------------+
| 1980     | 1 K (House prices in Boston)           | 100 KB   | 1 MF (Intel 80186)                       |
+----------+----------------------------------------+----------+------------------------------------------+
| 1990     | 10 K (optical character recognition)   | 10 MB    | 10 MF (Intel 80486)                      |
+----------+----------------------------------------+----------+------------------------------------------+
| 2000     | 10 M (web pages)                       | 100 MB   | 1 GF (Intel Core)                        |
+----------+----------------------------------------+----------+------------------------------------------+
| 2010     | 10 G (advertising)                     | 1 GB     | 1 TF (NVIDIA C2050)                      |
+----------+----------------------------------------+----------+------------------------------------------+
| 2020     | 1 T (social network)                   | 100 GB   | 1 PF (NVIDIA DGX-2)                      |
+----------+----------------------------------------+----------+------------------------------------------+

In this section, we begin to discuss how to harness this compute
performance for your research. First by using single GPUs and at a later
point, how to use multiple GPUs and multiple servers (with multiple
GPUs).

In this section, we will discuss how to use a single NVIDIA GPU for
calculations. First, make sure you have at least one NVIDIA GPU
installed. Then, `download
CUDA <https://developer.nvidia.com/cuda-downloads>`__ and follow the
prompts to set the appropriate path. Once these preparations are
complete, the ``nvidia-smi`` command can be used to view the graphics
card information.

You can call external terminal commands from inside the Java kernal by
prefixing ``%system`` to your command.

We do this below to call ``nvidia-smi`` from inside our notebook.

.. code:: java

    %system nvidia-smi


.. parsed-literal::
    :class: output

    Fri Feb  3 23:08:41 2023       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
    | N/A   40C    P0    51W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
    | N/A   33C    P0    50W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
    | N/A   33C    P0    53W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
    | N/A   33C    P0    52W / 300W |      0MiB / 16160MiB |      4%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+


You might have noticed that DJL tensor looks almost identical to NumPy.

But there are a few crucial differences. One of the key features that
distinguishes DJL from NumPy is its support for diverse hardware
devices.

In DJL, every array has an associated ``Device``. So far, by default,
all variables and associated computation have been assigned to the CPU.
Typically, other contexts might be various GPUs. Things can get even
hairier when we deploy jobs across multiple servers. By assigning arrays
to ``Devices`` intelligently, we can minimize the time spent
transferring data between devices. For example, when training neural
networks on a server with a GPU, we typically prefer for the model's
parameters to live on the GPU.

To run the programs in this section, you need at least two GPUs.

Note that this might be extravagant for most desktop computers but it is
easily available in the cloud, e.g., by using the AWS EC2 multi-GPU
instances. Almost all other sections do *not* require multiple GPUs.
Instead, this is simply to illustrate how data flows between different
devices.

Computing Devices
-----------------

We can specify devices, such as CPUs and GPUs, for storage and
calculation. By default, tensors are created in the main memory and then
uses the CPU to calculate it.

In DJL, the CPU and GPU can be indicated by ``cpu()`` and ``gpu()``. It
should be noted that ``cpu()`` (or any integer in the parentheses) means
all physical CPUs and memory. This means that DJL's calculations will
try to use all CPU cores. However, ``gpu()`` only represents one card
and the corresponding memory. If there are multiple GPUs, we use
``gpu(i)`` to represent the :math:`i^\mathrm{th}` GPU (:math:`i` starts
from 0). Also, ``gpu(0)`` and ``gpu()`` are equivalent.

.. code:: java

    %load ../utils/djl-imports

.. code:: java

    System.out.println(Device.cpu());
    System.out.println(Device.gpu());
    System.out.println(Device.gpu(1));


.. parsed-literal::
    :class: output

    cpu()
    gpu(0)
    gpu(1)


We can query the number of available GPUs.

.. code:: java

    System.out.println("GPU count: " + Engine.getInstance().getGpuCount());
    Device d = Device.gpu(1);


.. parsed-literal::
    :class: output

    GPU count: 4


Now we define two convenient functions that allow us to run codes even
if the requested GPUs do not exist.

.. code:: java

    /* Return the i'th GPU if it exists, otherwise return the CPU */
    public Device tryGpu(int i) {
        return Engine.getInstance().getGpuCount() > i ? Device.gpu(i) : Device.cpu();
    }
    
    /* Return all available GPUs or the [CPU] if no GPU exists */
    public Device[] tryAllGpus() {
        int gpuCount = Engine.getInstance().getGpuCount();
        if (gpuCount > 0) {
            Device[] devices = new Device[gpuCount];
            for (int i = 0; i < gpuCount; i++) {
                devices[i] = Device.gpu(i);
            }
            return devices;
        }
        return new Device[]{Device.cpu()};
    }
    
    System.out.println(tryGpu(0));
    System.out.println(tryGpu(3));
    
    Arrays.toString(tryAllGpus())


.. parsed-literal::
    :class: output

    gpu(0)
    gpu(3)


.. parsed-literal::
    :class: output

    [gpu(0), gpu(1), gpu(2), gpu(3)]


Tensors and GPUs
----------------

By default, tensors are created on the CPU. We can query the device
where the tensor is located.

.. code:: java

    NDManager manager = NDManager.newBaseManager();
    NDArray x = manager.create(new int[]{1, 2, 3});
    x.getDevice();


.. parsed-literal::
    :class: output

    gpu(0)


It is important to note that whenever we want to operate on multiple
terms, they need to be in the same context. For instance, if we sum two
tensors, we need to make sure that both arguments live on the same
device---otherwise the framework would not know where to store the
result or even how to decide where to perform the computation.

Storage on the GPU
~~~~~~~~~~~~~~~~~~

There are several ways to store a tensor on the GPU. For example, we can
specify a storage device when creating a tensor. Next, we create the
tensor variable ``x`` on the first ``gpu``. Notice that when printing
``x``, the device information changed. The tensor created on a GPU only
consumes the memory of this GPU. We can use the ``nvidia-smi`` command
to view GPU memory usage. In general, we need to make sure we do not
create data that exceeds the GPU memory limit.

.. code:: java

    NDArray x = manager.ones(new Shape(2, 3), DataType.FLOAT32, tryGpu(0));
    x


.. parsed-literal::
    :class: output

    ND: (2, 3) gpu(0) float32
    [[1., 1., 1.],
     [1., 1., 1.],
    ]


Assuming you have at least two GPUs, the following code will create a
random array on the second GPU.

.. code:: java

    NDArray y = manager.randomUniform(-1, 1, new Shape(2, 3), DataType.FLOAT32, tryGpu(1));
    y


.. parsed-literal::
    :class: output

    ND: (2, 3) gpu(1) float32
    [[ 0.3496, -0.8492,  0.9914],
     [-0.8102, -0.1691, -0.7754],
    ]


Copying
~~~~~~~

If we want to compute :math:`\mathbf{x} + \mathbf{y}`, we need to decide
where to perform this operation. For instance, as shown in
:numref:`fig_copyto`, we can transfer :math:`\mathbf{x}` to the second
GPU and perform the operation there. *Do not* simply add ``x + y``,
since this will result in an exception. The runtime engine would not
know what to do, it cannot find data on the same device and it fails.

|Copyto copies arrays to the target device| .. _fig_copyto:

``copyto`` copies the data to another device such that we can add them.
Since :math:`\mathbf{y}` lives on the second GPU, we need to move
:math:`\mathbf{x}` there before we can add the two.

.. |Copyto copies arrays to the target device| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/copyto.svg

.. code:: java

    NDArray z = x.toDevice(tryGpu(1), true);
    System.out.println(x);
    System.out.println(z);


.. parsed-literal::
    :class: output

    ND: (2, 3) gpu(0) float32
    [[1., 1., 1.],
     [1., 1., 1.],
    ]
    
    ND: (2, 3) gpu(1) float32
    [[1., 1., 1.],
     [1., 1., 1.],
    ]
    

Now that the data is on the same GPU (both :math:`\mathbf{z}` and
:math:`\mathbf{y}` are), we can add them up.

.. code:: java

    y.add(z)


.. parsed-literal::
    :class: output

    ND: (2, 3) gpu(1) float32
    [[1.3496, 0.1508, 1.9914],
     [0.1898, 0.8309, 0.2246],
    ]


Imagine that your variable ``z`` already lives on your second GPU. What
happens if we call still ``z.copyto(gpu(1))``? It will make a copy and
allocate new memory, even though that variable already lives on the
desired device! Just something to remember when you're manually moving
data across GPUs.

Side Notes
~~~~~~~~~~

People use GPUs to do machine learning because they expect them to be
fast. But transferring variables between contexts is slow. So we want
you to be 100% certain that you want to do something slow before we let
you do it. If the framework just did the copy automatically without
crashing then you might not realize that you had written some slow code.

Also, transferring data between devices (CPU, GPUs, other machines) is
something that is *much slower* than computation. It also makes
parallelization a lot more difficult, since we have to wait for data to
be sent (or rather to be received) before we can proceed with more
operations. This is why copy operations should be taken with great care.
As a rule of thumb, many small operations are much worse than one big
operation. Moreover, several operations at a time are much better than
many single operations interspersed in the code (unless you know what
you are doing) This is the case since such operations can block if one
device has to wait for the other before it can do something else. It is
a bit like ordering your coffee in a queue rather than pre-ordering it
by phone and finding out that it is ready when you are.

Last, when we print tensors or convert tensors to the NumPy format, if
the data is not in main memory, the framework will copy it to the main
memory first, resulting in additional transmission overhead. Even worse,
it is now subject to the dreaded Global Interpreter Lock that makes
everything wait for Python to complete.

Neural Networks and GPUs
------------------------

Now you may be thinking, well since we have to declare which device we
want create NDArrays on, we probably have to declare which device to
create our neural network on as well correct?

If so, good thinking! DJL, however, actually handles that all for you
with ``ParameterStore``. So you can train on multiple GPUs and not have
to worry about moving data around. Just simply declare and initialize
your ``Block``\ s as shown in previous sections and you're good to go!
You, however, always have the option of moving data around manually if
you like.

In short, as long as all data and parameters are on the same device, we
can learn models efficiently. In the following we will see several such
examples.

Summary
-------

-  We can specify devices for storage and calculation, such as CPU or
   GPU. By default, data are created in the main memory and then uses
   the CPU for calculations.
-  The framework requires all input data for calculation to be *on the
   same device*, be it CPU or the same GPU.
-  You can lose significant performance by moving data without care.

Exercises
---------

1. Try a larger computation task, such as the multiplication of large
   matrices, and see the difference in speed between the CPU and GPU.
   What about a task with a small amount of calculations?
2. How should we read and write model parameters on the GPU?
3. Measure the time it takes to compute 1000 matrix-matrix
   multiplications of :math:`100 \times 100` matrices and log the matrix
   norm :math:`\mathrm{tr} M M^\top` one result at a time vs. keeping a
   log on the GPU and transferring only the final result.
4. Measure how much time it takes to perform two matrix-matrix
   multiplications on two GPUs at the same time vs. in sequence on one
   GPU (hint: you should see almost linear scaling).