Run this notebook online:\ |Binder| or Colab: |Colab| .. |Binder| image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_preliminaries/tablesaw.ipynb .. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_preliminaries/tablesaw.ipynb .. _sec_tablesaw: Data Preprocessing ================== So far we have introduced a variety of techniques for manipulating data that are already stored in ``NDArray``\ s. To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the ``NDArray`` format. Among popular data analytic tools in Java, the ``tablesaw`` package is commonly used. If you have used the pandas package for Python, you will find this familiar. So, we will briefly walk through steps for preprocessing raw data with ``tablesaw`` and converting them into the ``NDArray`` format. We will cover more data preprocessing techniques in later chapters. Adding tablesaw dependencies to Jupyter notebook ------------------------------------------------ You can add tablesaw dependencies by adding a Java cell including: :: %%loadFromPOM tech.tablesaw tablesaw-jsplot 0.38.1 To make it easy to include tablesaw in jupyter notebook, we create an utility notebook that can be loaded by: .. code:: java %load ../utils/plot-utils.ipynb Reading the Dataset ------------------- As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ``../data/house_tiny.csv``. Data stored in other formats may be processed in similar ways. Below we write the dataset row by row into a csv file. .. code:: java %load ../utils/djl-imports .. code:: java File file = new File("../data/"); file.mkdir(); String dataFile = "../data/house_tiny.csv"; // Create file File f = new File(dataFile); f.createNewFile(); // Write to file try (FileWriter writer = new FileWriter(dataFile)) { writer.write("NumRooms,Alley,Price\n"); // Column names writer.write("NA,Pave,127500\n"); // Each row represents a data example writer.write("2,NA,106000\n"); writer.write("4,NA,178100\n"); writer.write("NA,NA,140000\n"); } To load the raw dataset from the created csv file, we import the ``tablesaw`` package and invoke the ``read`` function to read directly from the csv we created. This dataset has four rows and three columns, where each row describes the number of rooms ("NumRooms"), the alley type ("Alley"), and the price ("Price") of a house. .. code:: java Table data = Table.read().file("../data/house_tiny.csv"); data .. parsed-literal:: :class: output NumRooms | Alley | Price | --------------------------------- | Pave | 127500 | 2 | | 106000 | 4 | | 178100 | | | 140000 | Handling Missing Data --------------------- Note that there are some blank spaces which are missing values. To handle missing data, typical methods include *imputation* and *deletion*, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation. We split the ``data`` into ``inputs`` and ``outputs`` by creating new tables and specifying the columns desired, where the former takes the first two columns while the latter only keeps the last column. For numerical values in ``inputs`` that are missing, we replace the missing data entries with the mean value of the same column. .. code:: java Table inputs = data.create(data.columns()); inputs.removeColumns("Price"); Table outputs = data.select("Price"); Column col = inputs.column("NumRooms"); col.set(col.isMissing(), (int) inputs.nCol("NumRooms").mean()); inputs .. parsed-literal:: :class: output NumRooms | Alley | ---------------------- 3 | Pave | 2 | | 4 | | 3 | | For categorical or discrete values in ``inputs``, we consider missing data or null as a category. Since the "Alley" column only takes two types of categorical values "Pave" and an empty string which represents missing data/null, ``tablesaw`` can automatically convert this column to two columns. We will modify these two columns to assign a name to them which will be "Alley\_Pave" and "Alley\_nan". A row whose alley type is "Pave" will set values of "Alley\_Pave" and "Alley\_nan" to true and false. A row with a missing alley type will set their values to false and true. After this, we will add these columns to the original data/table but converting them to double so it changes true and false to 1 and 0 respectively. Finally, we remove the original column "Alley". .. code:: java StringColumn col = (StringColumn) inputs.column("Alley"); List dummies = col.getDummies(); inputs.removeColumns(col); inputs.addColumns(DoubleColumn.create("Alley_Pave", dummies.get(0).asDoubleArray()), DoubleColumn.create("Alley_nan", dummies.get(1).asDoubleArray()) ); inputs .. parsed-literal:: :class: output NumRooms | Alley_Pave | Alley_nan | ----------------------------------------- 3 | 1 | 0 | 2 | 0 | 1 | 4 | 0 | 1 | 3 | 0 | 1 | Conversion to the NDArray Format -------------------------------- Now that all the entries in ``inputs`` and ``outputs`` are numerical, they can be converted to the ``NDArray`` format. Once data are in this format, they can be further manipulated with those NDArray functionalities that we have introduced in :numref:`sec_NDArray`. .. code:: java NDManager nd = NDManager.newBaseManager(); NDArray x = nd.create(inputs.as().doubleMatrix()); NDArray y = nd.create(outputs.as().intMatrix()); x .. parsed-literal:: :class: output ND: (4, 3) gpu(0) float64 [[3., 1., 0.], [2., 0., 1.], [4., 0., 1.], [3., 0., 1.], ] .. code:: java y .. parsed-literal:: :class: output ND: (4, 1) gpu(0) int32 [[127500], [106000], [178100], [140000], ] Summary ------- - Like many other extension packages in the vast ecosystem of Java, ``tablesaw`` can work together with ``NDArray``. - Imputation and deletion can be used to handle missing data. Exercises --------- Create a raw dataset with more rows and columns. 1. Delete the column with the most missing values. 2. Convert the preprocessed dataset to the ``NDArray`` format.