Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_preliminaries/tablesaw.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_preliminaries/tablesaw.ipynb

.. _sec_tablesaw:

Data Preprocessing
==================


So far we have introduced a variety of techniques for manipulating data
that are already stored in ``NDArray``\ s. To apply deep learning to
solving real-world problems, we often begin with preprocessing raw data,
rather than those nicely prepared data in the ``NDArray`` format. Among
popular data analytic tools in Java, the ``tablesaw`` package is
commonly used. If you have used the pandas package for Python, you will
find this familiar. So, we will briefly walk through steps for
preprocessing raw data with ``tablesaw`` and converting them into the
``NDArray`` format. We will cover more data preprocessing techniques in
later chapters.

Adding tablesaw dependencies to Jupyter notebook
------------------------------------------------

You can add tablesaw dependencies by adding a Java cell including:

::

    %%loadFromPOM
    <dependency>
        <groupId>tech.tablesaw</groupId>
        <artifactId>tablesaw-jsplot</artifactId>
        <version>0.38.1</version>
    </dependency>

To make it easy to include tablesaw in jupyter notebook, we create an
utility notebook that can be loaded by:

.. code:: java

    %load ../utils/plot-utils.ipynb

Reading the Dataset
-------------------

As an example, we begin by creating an artificial dataset that is stored
in a csv (comma-separated values) file ``../data/house_tiny.csv``. Data
stored in other formats may be processed in similar ways.

Below we write the dataset row by row into a csv file.

.. code:: java

    %load ../utils/djl-imports

.. code:: java

    File file = new File("../data/");
    file.mkdir();
    
    String dataFile = "../data/house_tiny.csv";
    
    // Create file
    File f = new File(dataFile);
    f.createNewFile();
    
    // Write to file
    try (FileWriter writer = new FileWriter(dataFile)) {
        writer.write("NumRooms,Alley,Price\n"); // Column names
        writer.write("NA,Pave,127500\n");  // Each row represents a data example
        writer.write("2,NA,106000\n");
        writer.write("4,NA,178100\n");
        writer.write("NA,NA,140000\n");
    }

To load the raw dataset from the created csv file, we import the
``tablesaw`` package and invoke the ``read`` function to read directly
from the csv we created. This dataset has four rows and three columns,
where each row describes the number of rooms ("NumRooms"), the alley
type ("Alley"), and the price ("Price") of a house.

.. code:: java

    Table data = Table.read().file("../data/house_tiny.csv");
    data


.. parsed-literal::
    :class: output

                                     
     NumRooms  |  Alley  |  Price   |
    ---------------------------------
               |   Pave  |  127500  |
            2  |         |  106000  |
            4  |         |  178100  |
               |         |  140000  |


Handling Missing Data
---------------------

Note that there are some blank spaces which are missing values. To
handle missing data, typical methods include *imputation* and
*deletion*, where imputation replaces missing values with substituted
ones, while deletion ignores missing values. Here we will consider
imputation.

We split the ``data`` into ``inputs`` and ``outputs`` by creating new
tables and specifying the columns desired, where the former takes the
first two columns while the latter only keeps the last column. For
numerical values in ``inputs`` that are missing, we replace the missing
data entries with the mean value of the same column.

.. code:: java

    Table inputs = data.create(data.columns());
    inputs.removeColumns("Price");
    Table outputs = data.select("Price");
    
    Column col = inputs.column("NumRooms");
    col.set(col.isMissing(), (int) inputs.nCol("NumRooms").mean());
    inputs


.. parsed-literal::
    :class: output

     NumRooms  |  Alley  |
    ----------------------
            3  |   Pave  |
            2  |         |
            4  |         |
            3  |         |


For categorical or discrete values in ``inputs``, we consider missing
data or null as a category. Since the "Alley" column only takes two
types of categorical values "Pave" and an empty string which represents
missing data/null, ``tablesaw`` can automatically convert this column to
two columns. We will modify these two columns to assign a name to them
which will be "Alley\_Pave" and "Alley\_nan". A row whose alley type is
"Pave" will set values of "Alley\_Pave" and "Alley\_nan" to true and
false. A row with a missing alley type will set their values to false
and true. After this, we will add these columns to the original
data/table but converting them to double so it changes true and false to
1 and 0 respectively. Finally, we remove the original column "Alley".

.. code:: java

    StringColumn col = (StringColumn) inputs.column("Alley");
    List<BooleanColumn> dummies = col.getDummies();
    inputs.removeColumns(col);
    inputs.addColumns(DoubleColumn.create("Alley_Pave", dummies.get(0).asDoubleArray()), 
                      DoubleColumn.create("Alley_nan", dummies.get(1).asDoubleArray())
                     );
    inputs


.. parsed-literal::
    :class: output

     NumRooms  |  Alley_Pave  |  Alley_nan  |
    -----------------------------------------
            3  |           1  |          0  |
            2  |           0  |          1  |
            4  |           0  |          1  |
            3  |           0  |          1  |


Conversion to the NDArray Format
--------------------------------

Now that all the entries in ``inputs`` and ``outputs`` are numerical,
they can be converted to the ``NDArray`` format. Once data are in this
format, they can be further manipulated with those NDArray
functionalities that we have introduced in :numref:`sec_NDArray`.

.. code:: java

    NDManager nd = NDManager.newBaseManager();
    NDArray x = nd.create(inputs.as().doubleMatrix());
    NDArray y = nd.create(outputs.as().intMatrix());
    x


.. parsed-literal::
    :class: output

    ND: (4, 3) gpu(0) float64
    [[3., 1., 0.],
     [2., 0., 1.],
     [4., 0., 1.],
     [3., 0., 1.],
    ]


.. code:: java

    y


.. parsed-literal::
    :class: output

    ND: (4, 1) gpu(0) int32
    [[127500],
     [106000],
     [178100],
     [140000],
    ]


Summary
-------

-  Like many other extension packages in the vast ecosystem of Java,
   ``tablesaw`` can work together with ``NDArray``.
-  Imputation and deletion can be used to handle missing data.

Exercises
---------

Create a raw dataset with more rows and columns.

1. Delete the column with the most missing values.
2. Convert the preprocessed dataset to the ``NDArray`` format.