Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_computational-performance/auto-parallelism.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_computational-performance/auto-parallelism.ipynb

.. _sec_auto_para:

Automatic Parallelism
=====================


MXNet automatically constructs computational graphs at the backend.
Using a computational graph, the system is aware of all the
dependencies, and can selectively execute multiple non-interdependent
tasks in parallel to improve speed. For instance,
:numref:`fig_asyncgraph` in :numref:`sec_async` initializes two
variables independently. Consequently the system can choose to execute
them in parallel.

Typically, a single operator will use all the computational resources on
all CPUs or on a single GPU. For example, the ``dot`` operator will use
all cores (and threads) on all CPUs, even if there are multiple CPU
processors on a single machine. The same applies to a single GPU. Hence
parallelization is not quite so useful single-device computers. With
multiple devices things matter more. While parallelization is typically
most relevant between multiple GPUs, adding the local CPU will increase
performance slightly. See e.g.,
:cite:`Hadjis.Zhang.Mitliagkas.ea.2016` for a paper that focuses on
training computer vision models combining a GPU and a CPU. With the
convenience of an automatically parallelizing framework we can
accomplish the same goal in a few lines of Python code. More broadly,
our discussion of automatic parallel computation focuses on parallel
computation using both CPUs and GPUs, as well as the parallelization of
computation and communication. We begin by importing the required
packages and modules. Note that we need at least one GPU to run the
experiments in this section.

.. code:: java

    %load ../utils/djl-imports
    %load ../utils/StopWatch.java

Parallel Computation on CPUs and GPUs
-------------------------------------

Let us start by defining a reference workload to test - the ``run``
function below performs 10 matrix-matrix multiplications on the device
of our choosing using data allocated into two variables, ``x_cpu`` and
``x_gpu``.

.. code:: java

    public NDArray run(NDArray X){
        
        for(int i=0; i < 10; i++){
            X = X.dot(X);
        }
        return X;
    }
    
    NDManager manager = NDManager.newBaseManager();
    NDArray x_cpu = manager.randomUniform(0f, 1f, new Shape(2000, 2000), DataType.FLOAT32, Device.cpu());
    NDArray x_gpu = manager.randomUniform(0f, 1f, new Shape(6000, 6000), DataType.FLOAT32, Device.gpu());

Now we apply the function to the data. To ensure that caching does not
play a role in the results we warm up the devices by performing a single
pass on each of them prior to measuring.

.. code:: java

    // initial warm up of devices
    run(x_cpu);
    run(x_gpu);
    
    // calculating CPU computation time
    StopWatch stopWatch0 = new StopWatch();
    stopWatch0.start();
    
    run(x_cpu);
    
    stopWatch0.stop();
    ArrayList<Double> times = stopWatch0.getTimes();
    System.out.println("CPU time: " + times.get(times.size() - 1) + " nanoseconds ");
    
    // calculating GPU computation time 
    StopWatch stopWatch1 = new StopWatch();
    stopWatch1.start();
    
    run(x_gpu);
    
    stopWatch1.stop();
    times = stopWatch1.getTimes();
    System.out.println("GPU time: " + times.get(times.size() - 1) + " nanoseconds ");


.. parsed-literal::
    :class: output

    CPU time: 0.034554116 nanoseconds 
    GPU time: 0.038301803 nanoseconds 


.. code:: java

    // Calculating combined CPU and GPU computation times
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    
    run(x_cpu);
    run(x_gpu);
    
    stopWatch.stop();
    times = stopWatch.getTimes();
    System.out.println("CPU & GPU: " + times.get(times.size() - 1) + " nanoseconds ");


.. parsed-literal::
    :class: output

    CPU & GPU: 0.065659662 nanoseconds 


In the above case the total execution time is less than the sum of its
parts, since MXNet automatically schedules computation on both CPU and
GPU devices without the need for sophisticated code on behalf of the
user.

Parallel Computation and Communication
--------------------------------------

In many cases we need to move data between different devices, say
between CPU and GPU, or between different GPUs. This occurs e.g., when
we want to perform distributed optimization where we need to aggregate
the gradients over multiple accelerator cards. Let us simulate this by
computing on the GPU and then copying the results back to the CPU.

.. code:: java

    public NDArray copyToCPU(NDArray X){
        Y = X.toDevice(Device.cpu(), true);
        return Y;
    }
    
    // calculating GPU computation time
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    
    NDArray Y = run(x_gpu);
    
    stopWatch.stop();
    times = stopWatch.getTimes();
    System.out.println("Run on GPU: " + times.get(times.size() - 1) + " nanoseconds ");
    
    // calculating copy to CPU time
    StopWatch stopWatch1 = new StopWatch();
    stopWatch1.start();
    
    NDArray y_cpu = copyToCPU(Y);
    
    stopWatch1.stop();
    times = stopWatch1.getTimes();
    System.out.println("Copy to CPU: " + times.get(times.size() - 1) + " nanoseconds ");


.. parsed-literal::
    :class: output

    Run on GPU: 0.074793046 nanoseconds 
    Copy to CPU: 0.056765233 nanoseconds 


This is somewhat inefficient. Note that we could already start copying
parts of ``Y`` to the CPU while the remainder of the list is still being
computed. This situatio occurs, e.g., when we compute the (backprop)
gradient on a minibatch. The gradients of some of the parameters will be
available earlier than that of others. Hence it works to our advantage
to start using PCI-Express bus bandwidth while the GPU is still running.

.. code:: java

    // Calculating combined GPU computation and copy to CPU time.
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    
    NDArray Y = run(x_gpu);
    NDArray y_cpu = copyToCPU(Y);
    
    stopWatch.stop();
    times = stopWatch.getTimes();
    System.out.println("Run on GPU and copy to CPU: " + times.get(times.size() - 1) + " nanoseconds ");


.. parsed-literal::
    :class: output

    Run on GPU and copy to CPU: 0.079215031 nanoseconds 


The total time required for both operations is (as expected)
significantly less than the sum of their parts. Note that this task is
different from parallel computation as it uses a different resource: the
bus between CPU and GPUs. In fact, we could compute on both devices and
communicate, all at the same time. As noted above, there is a dependency
between computation and communication: ``Y[i]`` must be computed before
it can be copied to the CPU. Fortunately, the system can copy ``Y[i-1]``
while computing ``Y[i]`` to reduce the total running time.

We conclude with an illustration of the computational graph and its
dependencies for a simple two-layer MLP when training on a CPU and two
GPUs, as depicted in :numref:`fig_twogpu`. It would be quite painful
to schedule the parallel program resulting from this manually. This is
where it is advantageous to have a graph based compute backend for
optimization.

|Two layer MLP on a CPU and 2 GPUs.| .. _fig_twogpu:

Summary
-------

-  Modern systems have a variety of devices, such as multiple GPUs and
   CPUs. They can be used in parallel, asynchronously.
-  Modern systems also have a variety of resources for communication,
   such as PCI Express, storage (typically SSD or via network), and
   network bandwidth. They can be used in parallel for peak efficiency.
-  The backend can improve performance through through automatic
   parallel computation and communication.

Exercises
---------

1. 10 operations were performed in the ``run`` function defined in this
   section. There are no dependencies between them. Design an experiment
   to see if MXNet will automatically execute them in parallel.
2. When the workload of an individual operator is sufficiently small,
   parallelization can help even on a single CPU or GPU. Design an
   experiment to verify this.
3. Design an experiment that uses parallel computation on CPU, GPU and
   communication between both devices.
4. Use a debugger such as NVIDIA's Nsight to verify that your code is
   efficient.
5. Designing computation tasks that include more complex data
   dependencies, and run experiments to see if you can obtain the
   correct results while improving performance.

.. |Two layer MLP on a CPU and 2 GPUs.| image:: https://raw.githubusercontent.com/d2l-ai/d2l-en/8884afa3d1d3f6acd40fcc3aea0a3cf288461989/img/twogpu.svg