Run this notebook online: or Colab:
2.2. Data Preprocessing¶
So far we have introduced a variety of techniques for manipulating data
that are already stored in NDArray
s. To apply deep learning to
solving real-world problems, we often begin with preprocessing raw data,
rather than those nicely prepared data in the NDArray
format. Among
popular data analytic tools in Java, the tablesaw
package is
commonly used. If you have used the pandas package for Python, you will
find this familiar. So, we will briefly walk through steps for
preprocessing raw data with tablesaw
and converting them into the
NDArray
format. We will cover more data preprocessing techniques in
later chapters.
2.2.1. Adding tablesaw dependencies to Jupyter notebook¶
You can add tablesaw dependencies by adding a Java cell including:
%%loadFromPOM
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-jsplot</artifactId>
<version>0.38.1</version>
</dependency>
To make it easy to include tablesaw in jupyter notebook, we create an utility notebook that can be loaded by:
%load ../utils/plot-utils.ipynb
2.2.2. Reading the Dataset¶
As an example, we begin by creating an artificial dataset that is stored
in a csv (comma-separated values) file ../data/house_tiny.csv
. Data
stored in other formats may be processed in similar ways.
Below we write the dataset row by row into a csv file.
%load ../utils/djl-imports
File file = new File("../data/");
file.mkdir();
String dataFile = "../data/house_tiny.csv";
// Create file
File f = new File(dataFile);
f.createNewFile();
// Write to file
try (FileWriter writer = new FileWriter(dataFile)) {
writer.write("NumRooms,Alley,Price\n"); // Column names
writer.write("NA,Pave,127500\n"); // Each row represents a data example
writer.write("2,NA,106000\n");
writer.write("4,NA,178100\n");
writer.write("NA,NA,140000\n");
}
To load the raw dataset from the created csv file, we import the
tablesaw
package and invoke the read
function to read directly
from the csv we created. This dataset has four rows and three columns,
where each row describes the number of rooms (“NumRooms”), the alley
type (“Alley”), and the price (“Price”) of a house.
Table data = Table.read().file("../data/house_tiny.csv");
data
NumRooms | Alley | Price |
---------------------------------
| Pave | 127500 |
2 | | 106000 |
4 | | 178100 |
| | 140000 |
2.2.3. Handling Missing Data¶
Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.
We split the data
into inputs
and outputs
by creating new
tables and specifying the columns desired, where the former takes the
first two columns while the latter only keeps the last column. For
numerical values in inputs
that are missing, we replace the missing
data entries with the mean value of the same column.
Table inputs = data.create(data.columns());
inputs.removeColumns("Price");
Table outputs = data.select("Price");
Column col = inputs.column("NumRooms");
col.set(col.isMissing(), (int) inputs.nCol("NumRooms").mean());
inputs
NumRooms | Alley |
----------------------
3 | Pave |
2 | |
4 | |
3 | |
For categorical or discrete values in inputs
, we consider missing
data or null as a category. Since the “Alley” column only takes two
types of categorical values “Pave” and an empty string which represents
missing data/null, tablesaw
can automatically convert this column to
two columns. We will modify these two columns to assign a name to them
which will be “Alley_Pave” and “Alley_nan”. A row whose alley type is
“Pave” will set values of “Alley_Pave” and “Alley_nan” to true and
false. A row with a missing alley type will set their values to false
and true. After this, we will add these columns to the original
data/table but converting them to double so it changes true and false to
1 and 0 respectively. Finally, we remove the original column “Alley”.
StringColumn col = (StringColumn) inputs.column("Alley");
List<BooleanColumn> dummies = col.getDummies();
inputs.removeColumns(col);
inputs.addColumns(DoubleColumn.create("Alley_Pave", dummies.get(0).asDoubleArray()),
DoubleColumn.create("Alley_nan", dummies.get(1).asDoubleArray())
);
inputs
NumRooms | Alley_Pave | Alley_nan |
-----------------------------------------
3 | 1 | 0 |
2 | 0 | 1 |
4 | 0 | 1 |
3 | 0 | 1 |
2.2.4. Conversion to the NDArray Format¶
Now that all the entries in inputs
and outputs
are numerical,
they can be converted to the NDArray
format. Once data are in this
format, they can be further manipulated with those NDArray
functionalities that we have introduced in Section 2.1.
NDManager nd = NDManager.newBaseManager();
NDArray x = nd.create(inputs.as().doubleMatrix());
NDArray y = nd.create(outputs.as().intMatrix());
x
ND: (4, 3) gpu(0) float64
[[3., 1., 0.],
[2., 0., 1.],
[4., 0., 1.],
[3., 0., 1.],
]
y
ND: (4, 1) gpu(0) int32
[[127500],
[106000],
[178100],
[140000],
]
2.2.5. Summary¶
Like many other extension packages in the vast ecosystem of Java,
tablesaw
can work together withNDArray
.Imputation and deletion can be used to handle missing data.
2.2.6. Exercises¶
Create a raw dataset with more rows and columns.
Delete the column with the most missing values.
Convert the preprocessed dataset to the
NDArray
format.