Run this notebook online:\ |Binder| or Colab: |Colab|

.. |Binder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/deepjavalibrary/d2l-java/master?filepath=chapter_recurrent-neural-networks/text-preprocessing.ipynb
.. |Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/github/deepjavalibrary/d2l-java/blob/colab/chapter_recurrent-neural-networks/text-preprocessing.ipynb

.. _sec_text_preprocessing:

Text Preprocessing
==================


We have reviewed and evaluated statistical tools and prediction
challenges for sequence data. Such data can take many forms.
Specifically, as we will focus on in many chapters of the book, text is
one of the most popular examples of sequence data. For example, an
article can be simply viewed as a sequence of words, or even a sequence
of characters. To facilitate our future experiments with sequence data,
we will dedicate this section to explain common preprocessing steps for
text. Usually, these steps are:

1. Load text as strings into memory.
2. Split strings into tokens (e.g., words and characters).
3. Build a table of vocabulary to map the split tokens to numerical
   indices.
4. Convert text into sequences of numerical indices so they can be
   manipulated by models easily.

.. code:: java

    %load ../utils/djl-imports

Reading the Dataset
-------------------

To get started we load text from H. G. Wells' `*The Time
Machine* <http://www.gutenberg.org/ebooks/35>`__. This is a fairly small
corpus of just over 30000 words, but for the purpose of what we want to
illustrate this is just fine. More realistic document collections
contain many billions of words. The following function reads the dataset
into a list of text lines, where each line is a string. For simplicity,
here we ignore punctuation and capitalization.

.. code:: java

    public String[] readTimeMachine() throws IOException {
        URL url = new URL("http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt");
        String[] lines;
        try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
            lines = in.lines().toArray(String[]::new);
        }
    
        for (int i = 0; i < lines.length; i++) {
            lines[i] = lines[i].replaceAll("[^A-Za-z]+", " ").strip().toLowerCase();
        }
        return lines;
    }
    
    String[] lines = readTimeMachine();
    System.out.println("# text lines: " + lines.length);
    System.out.println(lines[0]);
    System.out.println(lines[10]);


.. parsed-literal::
    :class: output

    # text lines: 3221
    the time machine by h g wells
    twinkled and his usually pale face was flushed and animated the


Tokenization
------------

The following ``tokenize`` function takes an array (``lines``) as the
input, where each element is a text sequence (e.g., a text line). Each
text sequence is split into a list of tokens. A *token* is the basic
unit in text. In the end, a list of token lists are returned, where each
token is a string.

.. code:: java

    public String[][] tokenize(String[] lines, String token) throws Exception {
        // Split text lines into word or character tokens.
        String[][] output = new String[lines.length][];
        if (token == "word") {
            for (int i = 0; i < output.length; i++) {
                output[i] = lines[i].split(" ");
            }
        } else if (token == "char") {
            for (int i = 0; i < output.length; i++) {
                output[i] = lines[i].split("");
            }
        } else {
            throw new Exception("ERROR: unknown token type: " + token);
        }
        return output; 
    }
    String[][] tokens = tokenize(lines, "word");
    for (int i = 0; i < 11; i++) {
        System.out.println(Arrays.toString(tokens[i]));
    }


.. parsed-literal::
    :class: output

    [the, time, machine, by, h, g, wells]
    []
    []
    []
    []
    [i]
    []
    []
    [the, time, traveller, for, so, it, will, be, convenient, to, speak, of, him]
    [was, expounding, a, recondite, matter, to, us, his, grey, eyes, shone, and]
    [twinkled, and, his, usually, pale, face, was, flushed, and, animated, the]


Vocabulary
----------

The string type of the token is inconvenient to be used by models, which
take numerical inputs. Now let us build a dictionary (HashMap), often
called *vocabulary* as well, to map string tokens into numerical indices
starting from 0. To do so, we first count the unique tokens in all the
documents from the training set, namely a *corpus*, and then assign a
numerical index to each unique token according to its frequency. Rarely
appeared tokens are often removed to reduce the complexity. Any token
that does not exist in the corpus or has been removed is mapped into a
special unknown token “<unk>”. We optionally add a list of reserved
tokens, such as “<pad>” for padding, “<bos>” to present the beginning
for a sequence, and “<eos>” for the end of a sequence.

.. code:: java

    public class Vocab {
        public int unk;
        public List<Map.Entry<String, Integer>> tokenFreqs;
        public List<String> idxToToken;
        public HashMap<String, Integer> tokenToIdx;
    
        public Vocab(String[][] tokens, int minFreq, String[] reservedTokens) {
            // Sort according to frequencies
            LinkedHashMap<String, Integer> counter = countCorpus2D(tokens);
            this.tokenFreqs = new ArrayList<Map.Entry<String, Integer>>(counter.entrySet()); 
            Collections.sort(tokenFreqs, 
                new Comparator<Map.Entry<String, Integer>>() { 
                    public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) { 
                        return (o2.getValue()).compareTo(o1.getValue()); 
                    }
                });
            
            // The index for the unknown token is 0
            this.unk = 0;
            List<String> uniqTokens = new ArrayList<>();
            uniqTokens.add("<unk>");
            Collections.addAll(uniqTokens, reservedTokens);
            for (Map.Entry<String, Integer> entry : tokenFreqs) {
                if (entry.getValue() >= minFreq && !uniqTokens.contains(entry.getKey())) {
                    uniqTokens.add(entry.getKey());
                }
            }
            
            this.idxToToken = new ArrayList<>();
            this.tokenToIdx = new HashMap<>();
            for (String token : uniqTokens) {
                this.idxToToken.add(token);
                this.tokenToIdx.put(token, this.idxToToken.size()-1);
            }
        }
        
        public int length() {
            return this.idxToToken.size();
        }
        
        public Integer[] getIdxs(String[] tokens) {
            List<Integer> idxs = new ArrayList<>();
            for (String token : tokens) {
                idxs.add(getIdx(token));
            }
            return idxs.toArray(new Integer[0]);
            
        }
        
        public Integer getIdx(String token) {
            return this.tokenToIdx.getOrDefault(token, this.unk);
        }
        
        
    }
    
    public LinkedHashMap<String, Integer> countCorpus(String[] tokens) {
        /* Count token frequencies. */
        LinkedHashMap<String, Integer> counter = new LinkedHashMap<>();
        if (tokens.length != 0) {
            for (String token : tokens) {
                counter.put(token, counter.getOrDefault(token, 0)+1);
            }
        }
        return counter;
    }
    
    public LinkedHashMap<String, Integer> countCorpus2D(String[][] tokens) {
        /* Flatten a list of token lists into a list of tokens */
        List<String> allTokens = new ArrayList<String>();
        for (int i = 0; i < tokens.length; i++) {
            for (int j = 0; j < tokens[i].length; j++) {
                 if (tokens[i][j] != "") {
                    allTokens.add(tokens[i][j]);
                 }
            }
        }
        return countCorpus(allTokens.toArray(new String[0]));
    }

We construct a vocabulary using the time machine dataset as the corpus.
Then we print the first few frequent tokens with their indices.

.. code:: java

    Vocab vocab = new Vocab(tokens, 0, new String[0]);
    for (int i = 0; i < 10; i++) {
        String token = vocab.idxToToken.get(i);
        System.out.print("(" + token + ", " + vocab.tokenToIdx.get(token) + ") ");
    }


.. parsed-literal::
    :class: output

    (<unk>, 0) (the, 1) (i, 2) (and, 3) (of, 4) (a, 5) (to, 6) (was, 7) (in, 8) (that, 9) 

Now we can convert each text line into a list of numerical indices.

.. code:: java

    for (int i : new int[] {0,10}) {
        System.out.println("Words:" + Arrays.toString(tokens[i]));
        System.out.println("Indices:" + Arrays.toString(vocab.getIdxs(tokens[i])));
    }


.. parsed-literal::
    :class: output

    Words:[the, time, machine, by, h, g, wells]
    Indices:[1, 19, 50, 40, 2183, 2184, 400]
    Words:[twinkled, and, his, usually, pale, face, was, flushed, and, animated, the]
    Indices:[2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]


Putting All Things Together
---------------------------

Using the above functions, we package everything into the
``loadCorpusTimeMachine`` function, which returns ``corpus``, a list of
token indices, and ``vocab``, the vocabulary of the time machine corpus.
The modifications we did here are: i) we tokenize text into characters,
not words, to simplify the training in later sections; ii) ``corpus`` is
a single list, not a list of token lists, since each text line in the
time machine dataset is not necessarily a sentence or a paragraph.

.. code:: java

    public Pair<List<Integer>, Vocab> loadCorpusTimeMachine(int maxTokens) throws IOException, Exception {
        /* Return token indices and the vocabulary of the time machine dataset. */
        String[] lines = readTimeMachine();
        String[][] tokens = tokenize(lines, "char");
        Vocab vocab = new Vocab(tokens, 0, new String[0]);
        // Since each text line in the time machine dataset is not necessarily a
        // sentence or a paragraph, flatten all the text lines into a single list
        List<Integer> corpus = new ArrayList<>();
        for (int i = 0; i < tokens.length; i++) {
            for (int j = 0; j < tokens[i].length; j++) {
                if (tokens[i][j] != "") {
                    corpus.add(vocab.getIdx(tokens[i][j]));
                }
            }
        }
        if (maxTokens > 0) {
            corpus = corpus.subList(0, maxTokens);
        }
        return new Pair(corpus, vocab);
    }
    
    Pair<List<Integer>, Vocab> corpusVocabPair = loadCorpusTimeMachine(-1);
    List<Integer> corpus = corpusVocabPair.getKey();
    Vocab vocab = corpusVocabPair.getValue();
    
    System.out.println(corpus.size());
    System.out.println(vocab.length());


.. parsed-literal::
    :class: output

    170580
    28


Summary
-------

-  Text is an important form of sequence data.
-  To preprocess text, we usually split text into tokens, build a
   vocabulary to map token strings into numerical indices, and convert
   text data into token indices for models to manipulate.

Exercises
---------

1. Tokenization is a key preprocessing step. It varies for different
   languages. Try to find another three commonly used methods to
   tokenize text.
2. In the experiment of this section, tokenize text into words and vary
   the ``minFreq`` arguments of the ``Vocab`` instance. How does this
   affect the vocabulary size?