Run this notebook online: or Colab:
2.2. Data Preprocessing¶
So far we have introduced a variety of techniques for manipulating data
that are already stored in NDArray
s. To apply deep learning to
solving real-world problems, we often begin with preprocessing raw data,
rather than those nicely prepared data in the NDArray
format. Among
popular data analytic tools in Java, the tablesaw
package is
commonly used. If you have used the pandas package for Python, you will
find this familiar. So, we will briefly walk through steps for
preprocessing raw data with tablesaw
and converting them into the
NDArray
format. We will cover more data preprocessing techniques in
later chapters.
2.2.1. Reading the Dataset¶
As an example, we begin by creating an artificial dataset that is stored
in a csv (comma-separated values) file ../data/house_tiny.csv
. Data
stored in other formats may be processed in similar ways.
Below we write the dataset row by row into a csv file.
import java.io.File;
import java.io.FileWriter;
File file = new File("../data/");
file.mkdir();
String dataFile = "../data/house_tiny.csv";
// Create file
File f = new File(dataFile);
f.createNewFile();
// Write to file
try (FileWriter writer = new FileWriter(dataFile)) {
writer.write("NumRooms,Alley,Price\n"); // Column names
writer.write("NA,Pave,127500\n"); // Each row represents a data example
writer.write("2,NA,106000\n");
writer.write("4,NA,178100\n");
writer.write("NA,NA,140000\n");
}
To load the raw dataset from the created csv file, we import the
tablesaw
package and invoke the read
function to read directly
from the csv we created. This dataset has four rows and three columns,
where each row describes the number of rooms (“NumRooms”), the alley
type (“Alley”), and the price (“Price”) of a house.
%mavenRepo snapshots https://oss.sonatype.org/content/repositories/snapshots/
%maven org.slf4j:slf4j-api:1.7.26
%maven org.slf4j:slf4j-simple:1.7.26
%%loadFromPOM
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-jsplot</artifactId>
<version>0.38.1</version>
</dependency>
import tech.tablesaw.api.*;
Table data = Table.read().file("../data/house_tiny.csv");
data
NumRooms | Alley | Price |
---------------------------------
| Pave | 127500 |
2 | | 106000 |
4 | | 178100 |
| | 140000 |
2.2.2. Handling Missing Data¶
Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.
We split the data
into inputs
and outputs
by creating new
tables and specifying the columns desired, where the former takes the
first two columns while the latter only keeps the last column. For
numerical values in inputs
that are missing, we replace the “NaN”
entries with the mean value of the same column.
import tech.tablesaw.columns.Column;
Table inputs = data.create(data.columns());
inputs.removeColumns("Price");
Table outputs = data.select("Price");
Column col = inputs.column("NumRooms");
col.set(col.isMissing(), (int) inputs.nCol("NumRooms").mean());
inputs
NumRooms | Alley |
----------------------
3 | Pave |
2 | |
4 | |
3 | |
For categorical or discrete values in inputs
, we consider NaN or
null as a category. Since the “Alley” column only takes two types of
categorical values “Pave” and an empty string which represents NaN/null,
tablesaw
can automatically convert this column to two columns. We
will modify these two columns to assign a name to them which will be
“Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave” will
set values of “Alley_Pave” and “Alley_nan” to true and false. A row
with a missing alley type will set their values to false and true. After
this, we will add these columns to the original data/table but
converting them to double so it changes true and false to 1 and 0
respectively. Finally, we remove the original column “Alley”.
StringColumn col = (StringColumn) inputs.column("Alley");
List<BooleanColumn> dummies = col.getDummies();
inputs.removeColumns(col);
inputs.addColumns(DoubleColumn.create("Alley_Pave", dummies.get(0).asDoubleArray()),
DoubleColumn.create("Alley_nan", dummies.get(1).asDoubleArray())
);
inputs
NumRooms | Alley_Pave | Alley_nan |
-----------------------------------------
3 | 1 | 0 |
2 | 0 | 1 |
4 | 0 | 1 |
3 | 0 | 1 |
2.2.3. Conversion to the NDArray Format¶
Now that all the entries in inputs
and outputs
are numerical,
they can be converted to the NDArray
format. Once data are in this
format, they can be further manipulated with those NDArray
functionalities that we have introduced in Section 2.1.
%maven ai.djl:api:0.9.0
%maven ai.djl:basicdataset:0.9.0
// See https://github.com/awslabs/djl/blob/master/mxnet/mxnet-engine/README.md
// MXNet
%maven ai.djl.mxnet:mxnet-engine:0.9.0
%maven ai.djl.mxnet:mxnet-native-auto:1.7.0-backport
import ai.djl.ndarray.*;
NDManager nd = NDManager.newBaseManager();
NDArray x = nd.create(inputs.as().doubleMatrix());
NDArray y = nd.create(outputs.as().intMatrix());
x
ND: (4, 3) gpu(0) float64
[[3., 1., 0.],
[2., 0., 1.],
[4., 0., 1.],
[3., 0., 1.],
]
y
ND: (4, 1) gpu(0) int32
[[127500],
[106000],
[178100],
[140000],
]
2.2.4. Summary¶
Like many other extension packages in the vast ecosystem of Java,
tablesaw
can work together withNDArray
.Imputation and deletion can be used to handle missing data.
2.2.5. Exercises¶
Create a raw dataset with more rows and columns.
Delete the column with the most missing values.
Convert the preprocessed dataset to the
NDArray
format.