Run this notebook online:Binder or Colab: Colab

2.2. Data Preprocessing

So far we have introduced a variety of techniques for manipulating data that are already stored in NDArrays. To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the NDArray format. Among popular data analytic tools in Java, the tablesaw package is commonly used. If you have used the pandas package for Python, you will find this familiar. So, we will briefly walk through steps for preprocessing raw data with tablesaw and converting them into the NDArray format. We will cover more data preprocessing techniques in later chapters.

2.2.1. Adding tablesaw dependencies to Jupyter notebook

You can add tablesaw dependencies by adding a Java cell including:

%%loadFromPOM
<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-jsplot</artifactId>
    <version>0.38.1</version>
</dependency>

To make it easy to include tablesaw in jupyter notebook, we create an utility notebook that can be loaded by:

%load ../utils/plot-utils.ipynb

2.2.2. Reading the Dataset

As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ../data/house_tiny.csv. Data stored in other formats may be processed in similar ways.

Below we write the dataset row by row into a csv file.

%load ../utils/djl-imports
File file = new File("../data/");
file.mkdir();

String dataFile = "../data/house_tiny.csv";

// Create file
File f = new File(dataFile);
f.createNewFile();

// Write to file
try (FileWriter writer = new FileWriter(dataFile)) {
    writer.write("NumRooms,Alley,Price\n"); // Column names
    writer.write("NA,Pave,127500\n");  // Each row represents a data example
    writer.write("2,NA,106000\n");
    writer.write("4,NA,178100\n");
    writer.write("NA,NA,140000\n");
}

To load the raw dataset from the created csv file, we import the tablesaw package and invoke the read function to read directly from the csv we created. This dataset has four rows and three columns, where each row describes the number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house.

Table data = Table.read().file("../data/house_tiny.csv");
data
 NumRooms  |  Alley  |  Price   |
---------------------------------
           |   Pave  |  127500  |
        2  |         |  106000  |
        4  |         |  178100  |
           |         |  140000  |

2.2.3. Handling Missing Data

Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.

We split the data into inputs and outputs by creating new tables and specifying the columns desired, where the former takes the first two columns while the latter only keeps the last column. For numerical values in inputs that are missing, we replace the missing data entries with the mean value of the same column.

Table inputs = data.create(data.columns());
inputs.removeColumns("Price");
Table outputs = data.select("Price");

Column col = inputs.column("NumRooms");
col.set(col.isMissing(), (int) inputs.nCol("NumRooms").mean());
inputs
 NumRooms  |  Alley  |
----------------------
        3  |   Pave  |
        2  |         |
        4  |         |
        3  |         |

For categorical or discrete values in inputs, we consider missing data or null as a category. Since the “Alley” column only takes two types of categorical values “Pave” and an empty string which represents missing data/null, tablesaw can automatically convert this column to two columns. We will modify these two columns to assign a name to them which will be “Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave” will set values of “Alley_Pave” and “Alley_nan” to true and false. A row with a missing alley type will set their values to false and true. After this, we will add these columns to the original data/table but converting them to double so it changes true and false to 1 and 0 respectively. Finally, we remove the original column “Alley”.

StringColumn col = (StringColumn) inputs.column("Alley");
List<BooleanColumn> dummies = col.getDummies();
inputs.removeColumns(col);
inputs.addColumns(DoubleColumn.create("Alley_Pave", dummies.get(0).asDoubleArray()),
                  DoubleColumn.create("Alley_nan", dummies.get(1).asDoubleArray())
                 );
inputs
 NumRooms  |  Alley_Pave  |  Alley_nan  |
-----------------------------------------
        3  |           1  |          0  |
        2  |           0  |          1  |
        4  |           0  |          1  |
        3  |           0  |          1  |

2.2.4. Conversion to the NDArray Format

Now that all the entries in inputs and outputs are numerical, they can be converted to the NDArray format. Once data are in this format, they can be further manipulated with those NDArray functionalities that we have introduced in Section 2.1.

NDManager nd = NDManager.newBaseManager();
NDArray x = nd.create(inputs.as().doubleMatrix());
NDArray y = nd.create(outputs.as().intMatrix());
x
ND: (4, 3) gpu(0) float64
[[3., 1., 0.],
 [2., 0., 1.],
 [4., 0., 1.],
 [3., 0., 1.],
]
y
ND: (4, 1) gpu(0) int32
[[127500],
 [106000],
 [178100],
 [140000],
]

2.2.5. Summary

  • Like many other extension packages in the vast ecosystem of Java, tablesaw can work together with NDArray.

  • Imputation and deletion can be used to handle missing data.

2.2.6. Exercises

Create a raw dataset with more rows and columns.

  1. Delete the column with the most missing values.

  2. Convert the preprocessed dataset to the NDArray format.