Natural Language Processing

I have been working in the field of Natural Language Processing (NLP) since 1982. In this chapter we will use a few of my open source NLP project. In the next chapter I have selected the open source NLP project OpenNLP to provide more examples of using NLP to get you started using NLP in your own projects. For my current work I usually use a combination of my own library discussed in this chapter, OpenNLP (covered in the next chapter), and two deep learning NLP libraries for the Python language (spaCy and Hugging Face pre-trained deep learning models). Wile I don’t cover these Python libraries in this book, I do have examples of using spaCy in my Hy Language book. Hy is a Lisp language that is implemented in Python. For difficult NLP problems like coreference resolution (or anaphora resolution) I use deep learning models like BERT.

Deep learning is apparently “eating” the AI world but I firmly believe that hybrid systems stand the best chance of getting us to real artificial general intelligence (AGI) - time will tell as more NLP, knowledge representation, reasoning, etc., tasks are implemented in hybrid systems. Many experts in AI believe that deep learning only takes us so far, and in order to reach general artificial intelligence we will use some form of hybrid deep learning, symbolic AI, and probabalistic systems. That said, there are deep learning specialists who predict their favored technology will probably be sufficient to get to AGI.

Overview of the NLP Library and Running the Examples

We will cover a wide variety of techniques for processing text in this chapter. The part of speech tagger (POS), text categorization, and entity extraction examples are all derived from either my open source projects or my commercial projects that I developed in the 1990-2010 time frame.

The following UML class diagrams will give you an overview of my NLP library code:

UML class diagram for top level NLP code

The XML parsing code is for reading the file test_data/classification_tags.xml that contains ranked word terms for various categories we cover (e.g., politics, economy, etc.), often referred to as term frequency–inverse document frequency (tf-idf).

UML class diagram for low-level utilities

Each main class in this library has a main method that provides a short demonstration of using the class. The Makefile has targets for running the main method for each of the top level classes:

 1 names:
 2   mvn install -DskipTests
 3   mvn exec:java -Dexec.mainClass="com.markwatson.nlp.ExtractNames"
 4 
 5 autotagger:
 6   mvn install -DskipTests
 7   mvn exec:java -Dexec.mainClass="com.markwatson.nlp.AutoTagger"
 8 
 9 fasttag:
10   mvn install -DskipTests
11   mvn exec:java -Dexec.mainClass="com.markwatson.nlp.FastTag"

You might find it useful to run the examples before we look at the code.

Tokenizing, Stemming, and Part of Speech Tagging Text

Tokenizing text is the process of splitting a string containing text into individual tokens. Stemming is the reduction of words to abbreviated word roots that allow for easy comparison for equality of similar words. Tagging is identifying what part of speech each word is in input text. Tagging is complicated by many words having different parts of speech depending on context (examples: “bank the airplane,” “the river bank,” etc.) You can find the code in this section in the GitHub repository in the files src/com/markwatson/nlp/fasttag/FastTag.java and src/com/markwatson/nlp/util/Tokenizer.java. The required data files are in the directory test_data in the files lexicon.txt (for processing English text) and lexicon_medpost.txt (for processing medical text).

We will also look at a public domain word stemmer that I frequently use in this section.

Before we can process any text we need to break text into individual tokens. Tokens can be words, numbers and punctuation symbols. The class Tokenizer has two static methods, both take an input string to tokenize and return a list of token strings. The second method has an extra argument to specify the maximum number of tokens that you want returned:

1       static public List<String> wordsToList(String s)
2       static public List<String> wordsToList(String s, int maxR)

In line 2, maxR is maximum number of tokens to return and is useful when you want to sample the first part of a very long text.

The following listing shows a fragment of example code using this class with the output:

1        String text =
2          "The ball, rolling quickly, went down the hill.";
3        List<String> tokens = Tokenizer.wordsToList(text);
4        System.out.println(text);
5        for (String token : tokens)
6           System.out.print("\""+token+"\" ");
7        System.out.println();

This code fragment produces the following output:

1     The ball, rolling quickly, went down the hill.
2     "The" "ball" "," "rolling" "quickly" "," "went"
3     "down" "the" "hill" "."

For many applications, it is better to “stem” word tokens to simplify comparison of similar words. For example “run,” “runs,” and “running” all stem to “run.” The stemmer that we will use, which I believe to be in the public domain, is in the file src/public_domain/Stemmer.java. There are two convenient APIs defined at the end of the class, one to stem a string of multiple words and one to stem a single word token:

1         public List<String> stemString(String str)
2         public String stemOneWord(String word)

My FastTag part of speech (POS) tagger project resulted from my using in the early 1990s the excellent tagger written by Eric Brill while he was at the University of Pennsylvania. He used machine learning techniques to learn transition rules for tagging text using manually tagged text as training examples. In reading through his doctoral thesis I noticed that there were a few transition rules that covered most of the cases and I implemented a simple “fast tagger” in Common Lisp, Ruby, Scheme and Java. The Java version is in the file src/com/markwatson/nlp/fasttag/FastTag.java.

The file src/com/markwatson/nlp/fasttag/README.txt contains information on where to obtain Eric Brill’s original tagging system and also defines the tags for both his English language lexicon and the Medpost lexicon. The following table shows the most commonly used tags (see the README.txt file for a complete description).

 1 NN   singular noun          dog
 2 NNS  plural noun            dogs
 3 NNP  singular proper noun   California
 4 CC   conjunction            and, but, or
 5 DT   determiner             the, some
 6 IN   preposition            of, in, by
 7 JJ   adjective              large, small, green
 8 PP   proper pronoun         I, he, you
 9 RB   adverb                 slowly
10 RBR  comparative adverb     slowest
11 VB   verb                   eat
12 VBN  past participle verb   eaten
13 VBG  gerund verb            eating
14 WP   wh\* pronoun           who, what
15 WDT  wh\* determiner        which, that

Brill’s system worked by processing manually tagged text and then creating a list of words followed by the tags found for each word. Here are a few random lines selected from the test_data/lexicon.txt file:

1     Arco NNP
2     Arctic NNP JJ
3     fair JJ NN RB

Here “Arco” is a proper noun because it is the name of a corporation. The word “Arctic” can be either a proper noun or an adjective; it is used most frequently as a proper noun so the tag “NNP” is listed before “JJ.” The word “fair” can be an adjective, singular noun, or an adverb.

The class Tagger reads the file lexicon either as a resource stream (if, for example, you put **lexicon.txt **in the same JAR file as the compiled Tagger and Tokenizer class files) or as a local file. Each line in the lexicon.txt file is passed through the utility method parseLine that processes an input string using the first token in the line as a hash key and places the remaining tokens in an array that is the hash value. So, we would process the line “fair JJ NN RB” as a hash key of “fair” and the hash value would be the array of strings (only the first value is currently used but I keep the other values for future use).

When the tagger is processing a list of word tokens, it looks each token up in the hash table and stores the first possible tag type for the word. In our example, the word “fair” would be assigned (possibly temporarily) the tag “JJ.” We now have a list of word tokens and an associated list of possible tag types. We now loop through all of the word tokens applying eight transition rules that Eric Brill’s system learned. We will look at the first rule in some detail; i is the loop variable in the range [0, number of word tokens - 1] and word is the current word at index i:

1       //  rule 1: DT, {VBD | VBP} --> DT, NN
2       if (i > 0 && ret.get(i - 1).equals("DT")) {
3           if (word.equals("VBD") ||
4               word.equals("VBP") ||
5               word.equals("VB")) {
6                          ret.set(i, "NN");
7           }
8       }

In English, this rule states that if a determiner (DT) at word token index i - 1 is followed by either a past tense verb (VBD) or a present tense verb (VBP) then replace the tag type at index i with “NN.”

I list the remaining seven rules in a short syntax here and you can look at the Java source code to see how they are implemented:

 1       rule 2: convert a noun to a number (CD) if "."
 2               appears in the word
 3       rule 3: convert a noun to a past participle if
 4               words.get(i) ends with "ed"
 5       rule 4: convert any type to adverb if it ends in "ly"
 6       rule 5: convert a common noun (NN or NNS) to an
 7               adjective if it ends with "al"
 8       rule 6: convert a noun to a verb if the preceding
 9               work is "would"
10       rule 7: if a word has been categorized as a common
11               anoun nd it ends with "s", then set its type
12               to plural common noun (NNS)
13       rule 8: convert a common noun to a present participle
14               verb (i.e., a gerund)

My FastTag tagger is not quite as accurate as Brill’s original tagger so you might want to use his system written in C but which can be executed from Java as an external process or with a JNI interface.

In the next section we will use the tokenizer, stemmer, and tagger from this section to develop a system for identifying named entities in text.

Named Entity Extraction From Text

In this section we will look at identifying names of people and places in text. This can be useful for automatically tagging news articles with the people and place names that occur in the articles. The “secret sauce” for identifying names and places in text is the data in the file test_data/propername.ser – a serialized Java data file containing hash tables for human and place names. This data is read in the constructor for the class Names; it is worthwhile looking at the code if you have not used the Java serialization APIs before:

1       ObjectInputStream p = new ObjectInputStream(ins);
2       Hashtable lastNameHash = (Hashtable) p.readObject();
3       Hashtable firstNameHash = (Hashtable) p.readObject();
4       Hashtable placeNameHash = (Hashtable) p.readObject();
5       Hashtable prefixHash = (Hashtable) p.readObject();

If you want to see these data values, use code like

1       Enumeration keysE = placeNameHash.keys();
2       while (keysE.hasMoreElements()) {
3         Object key = keysE.nextElement();
4         System.out.println(key + " : " +
5                            placeNameHash.get(key));
6       }

The small part of the output from running this code snippet is:

1     Mauritius : country
2     Port-Vila : country_capital
3     Hutchinson : us_city
4     Mississippi : us_state
5     Lithuania : country

Before we look at the entity extraction code and how it works, we will first look at an example of using the main APIs for the Names class. The following example uses the methods isPlaceName, isHumanName, and getProperNames:

 1       System.out.println("Los Angeles: " +
 2                   names.isPlaceName("Los Angeles"));
 3       System.out.println("President Bush: " +
 4                   names.isHumanName("President Bush"));           
 5       System.out.println("President George Bush: " +
 6            names.isHumanName("President George Bush"));
 7       System.out.println("President George W. Bush: " +
 8            names.isHumanName("President George W. Bush"));
 9       ScoredList[] ret = names.getProperNames(
10             "George Bush played golf. President     \
11              George W. Bush went to London England, \
12              and Mexico to see Mary    \
13              Smith in Moscow. President Bush will   \
14              return home Monday.");
15       System.out.println("Human names: " +
16                          ret[0].getValuesAsString());
17       System.out.println("Place names: " +
18                          ret[1].getValuesAsString());

The output from running this example is:

 1     Los Angeles: true
 2     President Bush: true
 3     President George Bush: true
 4     President George W. Bush: true
 5     * place name: London,
 6             placeNameHash.get(name): country_capital
 7     * place name: Mexico,
 8             placeNameHash.get(name): country_capital
 9     * place name: Moscow, 
10             placeNameHash.get(name): country_capital
11     Human names: George Bush:1,
12                  President George W . Bush:1,
13                  Mary Smith:1,
14                  President Bush:1
15     Place names: London:1, Mexico:1, Moscow:1

The complete implementation that you can read through in the source file ExtractNames.java is reasonably simple. The methods isHumanName and isPlaceName simply look up a string in either of the human or place name hash tables. For testing a single word this is very easy; for example:

1       public boolean isPlaceName(String name) {
2         return placeNameHash.get(name) != null;
3       }

The versions of these APIs that handle names containing multiple words are just a little more complicated; we need to construct a string from the words between the starting and ending indices and test to see if this new string value is a valid key in the human names or place names hash tables. Here is the code for finding multi-word place names:

 1       public boolean isPlaceName(List<String> words,
 2                                  int startIndex,
 3                                  int numWords) {
 4         if ((startIndex + numWords) > words.size()) {
 5           return false;
 6         }
 7         if (numWords == 1) {
 8           return isPlaceName(words.get(startIndex));
 9         }
10         String s = "";
11         for (int i=startIndex;
12              i<(startIndex + numWords); i++) {
13           if (i < (startIndex + numWords - 1)) {
14             s = s + words.get(startIndex) + " ";
15           } else {
16             s = s + words.get(startIndex);
17           }
18         }
19         return isPlaceName(s);
20       }

This same scheme is used to test for multi-word human names. The top-level utility method getProperNames is used to find human and place names in text. The code in getProperNames is intentionally easy to understand but not very efficient because of all of the temporary test strings that need to be constructed.

Automatically Assigning Categories to Text

Here we will assign zero or more categories like “politics”, “economy”, etc. to text based on the words contained in the text. While the code for doing this is simple there is usually much work to do to build a word count database for different classifications. The approach we use here is often called “bag of words” because the words in input text matter but not the order of words in text or proximity to other words.

I have been working on open source products for automatic tagging and semantic extraction since the 1990s (see my old web site www.knowledgebooks.com if you are interested). In this section I will show you some simple techniques for automatically assigning tags or categories to text. We will use a set of category tags for which I have collected word frequency statistics. For example, a category of “Java” might be associated with the use of the words “Java,” “JVM,” “Sun,” etc. You can find my pre-trained tag data in the file:

1     test_data/classification_tags.xml

The Java source code for the class AutoTagger is in the file:

1     src-statistical-nlp/com/markwatson/nlp/AutoTagger.java

The AutoTagger class uses a few data structures to keep track of both the names of categories (tags) and the word count statistics for words associated with each tag name. I use a temporary hash table for processing the XML input data:

1   private static Hashtable<String, Hashtable<String, Float>> tagClasses;

The names of categories (tags) used are defined in the XML tag data file: change this file, and you alter both the tags and behavior of this utility class. Please note that the data in this XML file is from a small set of hand-labeled text found on the Web (i.e., my wife and I labelled articles as being about “religion”, “politics”, etc.).

This approach is called “bag of words.” The following listing shows a snippet of data defined in the XML tag data file describing some words (and their scores) associated with the tag “religion_buddhism”:

 1     <tags>
 2       <topic name="religion_buddhism">
 3         <term name="buddhism" score="52" />
 4         <term name="buddhist" score="50" />
 5         <term name="mind" score="50" />
 6         <term name="buddha" score="37" />
 7         <term name="practic" score="31" />
 8         <term name="teach" score="15" />
 9         <term name="path" score="14" />
10         <term name="mantra" score="14" />
11         <term name="thought" score="14" />
12         <term name="school" score="13" />
13         <term name="zen" score="13" />
14         <term name="mahayana" score="13" />
15         <term name="suffer" score="12" />
16         <term name="dharma" score="12" />
17         <term name="tibetan" score="11" />
18         . . .
19       </topic>
20       . . .
21     </tags>

Notice that the term names are stemmed words and all lower case. There are 28 categories (tags) defined in the input XML file included in the GitHub repository for this book.

For data access, I also maintain an array of tag names and an associated list of the word frequency hash tables for each tag name:

1       private static String[] tagClassNames;
2       private static
3         List<Hashtable<String, Float>> hashes =
4            new ArrayList<Hashtable<String, Float>>();

The XML data is read and these data structures are filled during static class load time so creating multiple instances of the class AutoTagger has no performance penalty in memory use. Except for an empty default class constructor, there is only one public API for this class, the method getTags gets the categories for input text:

1       public List<NameValue<String, Float>>
2                      getTags(String text) {

To be clear, the tags returned are classification tags like “politics,” “economy,” etc. The utility class NameValue is defined in the file:

1 src-statistical-nlp/com/markwatson/nlp/util/NameValue.java

To determine the tags for input text, we keep a running score for each defined tag type. I use the internal class SFtriple to hold triple values of word, score, and tag index. I choose the tags with the highest scores as the automatically assigned tags for the input text. Scores for each tag are calculated by taking each word in the input text, stemming it, and if the stem is in the word frequency hash table for the tag then add the score value in the hash table to the running sum for the tag. You can refer to the AutoTagger.java source code for details. Here is an example use of class AutoTagger:

 1       AutoTagger test = new AutoTagger();
 2       String s = "The President went to Congress to argue
 3                   for his tax bill before leaving on a
 4                   vacation to Las Vegas to see some shows
 5                   and gamble.";
 6       List<NameValue<String, Float>> results =
 7                                         test.getTags(s);
 8       for (NameValue<String, Float> result : results) {
 9         System.out.println(result);
10       }

The output looks like:

1     [NameValue: news_economy : 1.0]
2     [NameValue: news_politics : 0.84]

Text Clustering

Clustering text documents refers to associating similar text documents with each other. The text clustering system that I have written for my own projects is simple and effective but it inefficient for large numbers of documents. The example code in this section is inherently inefficient for clustering a large number of text documents because I perform significant semantic processing on each text document and then compare all combinations of documents. The runtime performance is O(N²) where N is the number of text documents. If you need to cluster or compare a very large number of documents you will probably want to use a K-Mean clustering algorithm instead.

I use a few different algorithms to rate the similarity of any two text documents and I will combine these depending on the requirements of the project that I am working on:

Calculate the intersection of common words in the two documents.
Calculate the intersection of common word stems in the two documents.
Calculate the intersection of tags assigned to the two documents.
Calculate the intersection of human and place names in the two documents.

In this section we will implement the second option: calculate the intersection of word stems in two documents. Without showing the package and import statements, it takes just a few lines of code to implement this algorithm when we use the Stemmer class.

The following listing shows the implementation of class ComparableDocument with comments. We start by defining constructors for documents defined by a File object and a String object:

 1     public class ComparableDocument {
 2         // disable default constructor calls:
 3         private ComparableDocument() { }
 4         public ComparableDocument(File document)
 5                      throws FileNotFoundException {
 6             this(new Scanner(document).
 7                        useDelimiter("\\Z").next());
 8         }
 9         public ComparableDocument(String text) {
10             List<String> stems =
11                    new Stemmer().stemString(text);
12             for (String stem : stems) {
13                 stem_count++;
14                 if (stemCountMap.containsKey(stem)) {
15                     Integer count = stemCountMap.get(stem);
16                     stemCountMap.put(stem, 1 + count);
17                 } else {
18                     stemCountMap.put(stem, 1);
19                 }
20             }
21         }

In the last constructor, I simply create a count of how many times each stem occurs in the document.

The public API allows us to get the stem count hash table, the number of stems in the original document, and a numeric comparison value for comparing this document with another (this is the first version – we will add an improvement later):

 1       public Map<String, Integer> getStemMap() {
 2         return stemCountMap;
 3       }
 4       public int getStemCount() {
 5         return stem_count;
 6       }
 7       public float
 8              compareTo(ComparableDocument otherDocument) {
 9         long count = 0;
10         Map<String,Integer> map2 = otherDocument.getStemMap();
11         Iterator iter = stemCountMap.keySet().iterator();
12         while (iter.hasNext()) {
13           Object key = iter.next();
14           Integer count1 = stemCountMap.get(key);
15           Integer count2 = map2.get(key);
16           if (count1!=null && count2!=null) {
17             count += count1 * count2;
18           }
19         }
20         return (float) Math.sqrt(
21                  ((float)(count*count) /
22                   (double)(stem_count *
23                            otherDocument.getStemCount())))
24                 / 2f;
25       }
26       private Map<String, Integer> stemCountMap =
27                         new HashMap<String, Integer>();
28       private int stem_count = 0;
29     }

I normalize the return value for the method compareTo to return a value of 1.0 if compared documents are identical (after stemming) and 0.0 if they contain no common stems. There are four test text documents in the test_data directory and the following test code compares various combinations. Note that I am careful to test the case of comparing identical documents:

 1       ComparableDocument news1 = 
 2          new ComparableDocument("testdata/news_1.txt");
 3       ComparableDocument news2 =
 4          new ComparableDocument("testdata/news_2.txt");
 5       ComparableDocument econ1 =
 6          new ComparableDocument("testdata/economy_1.txt");
 7       ComparableDocument econ2 =
 8          new ComparableDocument("testdata/economy_2.txt");
 9       System.out.println("news 1 - news1: " +
10                          news1.compareTo(news1));
11       System.out.println("news 1 - news2: " +
12                          news1.compareTo(news2));
13       System.out.println("news 2 - news2: " +
14                          news2.compareTo(news2));
15       System.out.println("news 1 - econ1: " +
16                          news1.compareTo(econ1));
17       System.out.println("econ 1 - econ1: " +
18                          econ1.compareTo(econ1));
19       System.out.println("news 1 - econ2: " +
20                          news1.compareTo(econ2));
21       System.out.println("econ 1 - econ2: " +
22                          econ1.compareTo(econ2));
23       System.out.println("econ 2 - econ2: " +
24                          econ2.compareTo(econ2));

The following listing shows output that indicates mediocre results; we will soon make an improvement that makes the results better. The output for this test code is:

1     news 1 - news1: 1.0
2     news 1 - news2: 0.4457711
3     news 2 - news2: 1.0
4     news 1 - econ1: 0.3649214
5     econ 1 - econ1: 1.0
6     news 1 - econ2: 0.32748842
7     econ 1 - econ2: 0.42922822
8     econ 2 - econ2: 1.0

There is not as much differentiation in comparison scores between political news stories and economic news stories. What is up here? The problem is that I did not remove common words (and therefore common word stems) when creating stem counts for each document. I wrote a utility class NoiseWords for identifying both common words and their stems; you can see the implementation in the file NoiseWords.java. Removing noise words improves the comparison results (I added a few tests since the last printout):

1     news 1 - news1: 1.0
2     news 1 - news2: 0.1681978
3     news 1 - econ1: 0.04279895
4     news 1 - econ2: 0.034234844
5     econ 1 - econ2: 0.26178515
6     news 2 - econ2: 0.106673114
7     econ 1 - econ2: 0.26178515

Much better results! The API for com.markwatson.nlp.util.NoiseWords is a single static method:

1 public static boolean checkFor(String stem)

You can add additional noise words to the data section in the file NoiseWords.java, depending on your application.

Wrapup

This chapter contains my own experiments with ad-hoc NLP that I often find useful for my work.

In the next chapter Natural Language Processing Using OpenNLP we use the Apache OpenNLP library that I also often use in my work.

Up next

Information Gathering