Introduction to Natural Language Processing

I have been working in the field of Natural Language Processing (NLP) since 1985 so I ‘lived through’ the revolutionary change in NLP that has occurred since 2014: deep learning results out-classed results from previous symbolic methods.

I will not cover older symbolic methods of NLP here, rather I refer you to my previous books Practical Artificial Intelligence Programming With Java, [Loving Common Lisp, The Savvy Programmer’s Secret Weapon, and Haskell Tutorial and Cookbook for examples. We get better results using Deep Learning (DL) for NLP and the library spaCy (https://spacy.io) that we use in this chapter provides near state of the art performance. The authors of spaCy frequently update it to use the latest breakthroughs in the field.

You will learn how to apply both DL and NLP by using the state-of-the-art full-feature library spaCy. This chapter concentrates on how to use spaCy in the Hy language for solutions to a few selected problems in NLP that I use in my own work. I urge you to also review the “Guides” section of the spaCy documentation where examples are in Python but after experimenting with the examples in this chapter you should have no difficulty in translating any spaCy Python examples to the Hy language.

If you have not already done so install the spaCy library and the full English language model:

pip install spacy
python -m spacy download en

You can use a smaller model (which requires loading “en_core_web_sm” instead of “en” in the following examples):

pip install spacy
python -m spacy download en_core_web_sm

Exploring the spaCy Library

We will use the Hy REPL to experiment with spaCy, Lisp style. The following REPL listings are all from the same session, split into separate listings so that I can talk you through the examples:

 1 Marks-MacBook:nlp $ hy
 2 hy 0.17.0+108.g919a77e using CPython(default) 3.7.3 on Darwin
 3 => (import spacy)
 4 => (setv nlp-model (spacy.load "en"))
 5 => (setv doc (nlp-model "President George Bush went to Mexico and he had a very good\
 6  meal"))
 7 => doc
 8 President George Bush went to Mexico and he had a very good meal
 9 => (dir doc)
10 ['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__fo\
11 rmat__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init_\
12 _', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new\
13 __', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__\
14 setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merg\
15 e', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count\
16 _by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_\
17 extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed'\
18 , 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun\
19 _chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sen\
20 ts', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 't\
21 o_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_span_hooks', 'user_\
22 token_hooks', 'vector', 'vector_norm', 'vocab']

In lines 3-6 we import the spaCy library, load the English language model, and create a document from input text. What is a spaCy document? In line 9 we use the standard Python function dir to look at all names and functions defined for the object doc returned from applying a spaCy model to a string containing text. The value printed shows many built in “dunder” (double underscore attributes), and we can remove these:

In lines 23-26 we use the dir function again to see the attributes and methods for this class, but filter out any attributes containing the characters “__”:

23 => (lfor
24 ... x (dir doc)
25 ... :if (not (.startswith x "__"))
26 ... x)
27 ['_', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'c\
28 har_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', '\
29 from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_ne\
30 red', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'no\
31 un_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', \
32 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws\
33 ', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_sp\
34 an_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']
35 =>

The to_json method looks promising so we will import the Python pretty print library and look at the pretty printed result of calling the to_json method on our document stored in doc:

 36 => (import [pprint [pprint]])
 37 => (pprint (doc.to_json))
 38 {'ents': [{'end': 21, 'label': 'PERSON', 'start': 10},
 39           {'end': 36, 'label': 'GPE', 'start': 30}],
 40  'sents': [{'end': 64, 'start': 0}],
 41  'text': 'President George Bush went to Mexico and he had a very good meal',
 42  'tokens': [{'dep': 'compound',
 43              'end': 9,
 44              'head': 2,
 45              'id': 0,
 46              'pos': 'PROPN',
 47              'start': 0,
 48              'tag': 'NNP'},
 49             {'dep': 'compound',
 50              'end': 16,
 51              'head': 2,
 52              'id': 1,
 53              'pos': 'PROPN',
 54              'start': 10,
 55              'tag': 'NNP'},
 56             {'dep': 'nsubj',
 57              'end': 21,
 58              'head': 3,
 59              'id': 2,
 60              'pos': 'PROPN',
 61              'start': 17,
 62              'tag': 'NNP'},
 63             {'dep': 'ROOT',
 64              'end': 26,
 65              'head': 3,
 66              'id': 3,
 67              'pos': 'VERB',
 68              'start': 22,
 69              'tag': 'VBD'},
 70             {'dep': 'prep',
 71              'end': 29,
 72              'head': 3,
 73              'id': 4,
 74              'pos': 'ADP',
 75              'start': 27,
 76              'tag': 'IN'},
 77             {'dep': 'pobj',
 78              'end': 36,
 79              'head': 4,
 80              'id': 5,
 81              'pos': 'PROPN',
 82              'start': 30,
 83              'tag': 'NNP'},
 84             {'dep': 'cc',
 85              'end': 40,
 86              'head': 3,
 87              'id': 6,
 88              'pos': 'CCONJ',
 89              'start': 37,
 90              'tag': 'CC'},
 91             {'dep': 'nsubj',
 92              'end': 43,
 93              'head': 8,
 94              'id': 7,
 95              'pos': 'PRON',
 96              'start': 41,
 97              'tag': 'PRP'},
 98             {'dep': 'conj',
 99              'end': 47,
100              'head': 3,
101              'id': 8,
102              'pos': 'VERB',
103              'start': 44,
104              'tag': 'VBD'},
105             {'dep': 'det',
106              'end': 49,
107              'head': 12,
108              'id': 9,
109              'pos': 'DET',
110              'start': 48,
111              'tag': 'DT'},
112             {'dep': 'advmod',
113              'end': 54,
114              'head': 11,
115              'id': 10,
116              'pos': 'ADV',
117              'start': 50,
118              'tag': 'RB'},
119             {'dep': 'amod',
120              'end': 59,
121              'head': 12,
122              'id': 11,
123              'pos': 'ADJ',
124              'start': 55,
125              'tag': 'JJ'},
126             {'dep': 'dobj',
127              'end': 64,
128              'head': 8,
129              'id': 12,
130              'pos': 'NOUN',
131              'start': 60,
132              'tag': 'NN'}]}
133 => 

The JSON data is nested dictionaries. In a later chapter on Knowledge Graphs, we will want to get the named entities like people, organizations, etc., from text and use this information to automatically generate data for Knowledge Graphs. The values for the key ents (stands for “entities”) will be useful. Notice that the words in the original text are specified by beginning and ending text token indices (values of head and end in lines 52 to 142).

The values for the key tokens listed on lines 42-132 contains the head (or starting index, ending index, the token number (id), and the part of speech (pos). We will list what the parts of speech mean later.

We would like the words for each entity to be concatenated into a single string for each entity and we do this here in lines 136-137 and see the results in lines 138-139.

I like to add the entity name strings back into the dictionary representing a document and line 140 shows the use of lfor to create a list of lists where the sublists contain the entity name as a single string and the type of entity. We list the entity types supported by spaCy in the next section.

134 => doc.ents
135 (George Bush, Mexico)
136 => (for [entity doc.ents]
137 ... (print "entity text:" entity.text "entity label:" entity.label_))
138 entity text: George Bush entity label: PERSON
139 entity text: Mexico entity label: GPE
140 => (lfor entity doc.ents [entity.text entity.label_])
141 [['George Bush', 'PERSON'], ['Mexico', 'GPE']]
142 => 

We can also access each sentence as a separate string. In this example the original text used to create our sample document had only a single sentence so the sents property returns a list containing a single string:

147 => (list doc.sents)
148 [President George Bush went to Mexico and he had a very good meal]
149 => 

The last example showing how to use a spaCy document object is listing each word with its part of speech:

150 => (for [word doc]
151 ... (print word.text word.pos_))
152 President PROPN
153 George PROPN
154 Bush PROPN
155 went VERB
156 to ADP
157 Mexico PROPN
158 and CCONJ
159 he PRON
160 had VERB
161 a DET
162 very ADV
163 good ADJ
164 meal NOUN
165 => 

The following list shows the definitions for the part of speech (POS) tags:

  • ADJ: adjective
  • ADP: adposition
  • ADV: adverb
  • AUX: auxiliary verb
  • CONJ: coordinating conjunction
  • DET: determiner
  • INTJ: interjection
  • NOUN: noun
  • NUM: numeral
  • PART: particle
  • PRON: pronoun
  • PROPN: proper noun
  • PUNCT: punctuation
  • SCONJ: subordinating conjunction
  • SYM: symbol
  • VERB: verb
  • X: other

Implementing a HyNLP Wrapper for the Python spaCy Library

We will generate two libraries (in files nlp_lib.hy and coref_nlp_lib.hy). The first is a general NLP library and the second specifically solves the anaphora resolution, or coreference, problem. There are test programs for each library in the files nlp_example.hy and coref_example.hy.

For an example in a later chapter, we will use the library developed here to automatically generate Knowledge Graphs from text data. We will need the ability to find person, company, location, etc. names in text. We use spaCy here to do this. The types of named entities on which spaCy is pre-trained includes:

  • CARDINAL: any number that is not identified as a more specific type, like money, time, etc.
  • DATE
  • FAC: facilities like highways, bridges, airports, etc.
  • GPE: Countries, states (or provinces), and cities
  • LOC: any non-GPE location
  • PRODUCT
  • EVENT
  • LANGUAGE: any named language
  • MONEY: any monetary value or unit of money
  • NORP: nationalities or religious groups
  • ORG: any organization like a company, non-profit, school, etc.
  • PERCENT: any number in [0, 100] followed by the percent % character
  • PERSON
  • ORDINAL: any number spelled out, like “one”, “two”, etc.
  • TIME

Listing for hy-lisp-python/nlp/nlp_lib.hy:

 1 (import spacy)
 2 
 3 (setv nlp-model (spacy.load "en"))
 4 
 5 (defn nlp [some-text]
 6   (setv doc (nlp-model some-text))
 7   (setv entities (lfor entity doc.ents [entity.text entity.label_]))
 8   (setv j (doc.to_json))
 9   (setv (get j "entities") entities)
10   j)

Listing for hy-lisp-python/nlp/nlp_example.hy:

1 #!/usr/bin/env hy
2 
3 (import [nlp-lib [nlp]])
4 
5 (print
6   (nlp "President George Bush went to Mexico and he had a very good meal"))
7 
8 (print
9   (nlp "Lucy threw a ball to Bill and he caught it"))
1 Marks-MacBook:nlp $ ./nlp_example.hy
2 {'text': 'President George Bush went to Mexico and he had a very good meal', 'ents':\
3  [{'start': 10, 'end': 21, 'label': 'PERSON'}, {'start': 30, 'end': 36, 'label': 'GP\
4 E'}], 'sents': [{'start': 0, 'end': 64}], 'tokens': 
5 
6   ..LOTS OF OUTPUT NOT SHOWN..

Coreference (Anaphora Resolution)

Another common NLP task is coreference (or anaphora resolution) which is the process of resolving pronouns in text (e.g., he, she, it, etc.) with preceding proper nouns that pronouns refer to. A simple example would be translating “John ran fast and he fell” to “John ran fast and John fell.” This is an easy example, but often proper nouns that pronouns refer to are in previous sentences and resolving coreference can be ambiguous and require knowledge of common word use and grammar. This problem is now handled by deep learning transfer models like BERT.

In addition to installing spaCy you also need the library neuralcoref. Only specific versions of spaCy and neuralcoref are compatible with each other. As of July 31, 2020 the following works to get dependencies and run the example for this section:

pip uninstall spacy neuralcoref
pip install spacy==2.1.3
python -m spacy download en
pip install neuralcoref==4.0.0
./coref_example.hy 

Please note that version 2.1.3 of spaCy is older than the default version that pip installs. You might want to create a new Python virtual environment for this example or if you use Anaconda then use separate Anaconda environment.

Listing of coref_nlp_lib.hy contains a wrapper for spaCy’s coreference model:

 1 (import argparse os)
 2 (import spacy neuralcoref)
 3 
 4 (setv nlp2 (spacy.load "en"))
 5 (neuralcoref.add_to_pipe nlp2)
 6 
 7 (defn coref-nlp [some-text]
 8   (setv doc (nlp2 some-text))
 9   { "corefs" doc._.coref_resolved
10     "clusters" doc._.coref_clusters
11     "scores" doc._.coref_scores})

Listing of coref_example.hy shows code to test the Hy spaCy and coreference wrapper:

1 #!/usr/bin/env hy
2 
3 (import [coref-nlp-lib [coref-nlp]])
4 
5 ;; tests:
6 (print (coref-nlp "President George Bush went to Mexico and he had a very good meal"\
7 ))
8 (print (coref-nlp "Lucy threw a ball to Bill and he caught it"))

The output will look like:

 1 Marks-MacBook:nlp $ ./coref_example.hy 
 2 {'corefs': 'President George Bush went to Mexico and President George Bush had a ver\
 3 y good meal', 'clusters': [President George Bush: [President George Bush, he]], 'sco\
 4 res': {President George Bush: {President George Bush: 1.5810412168502808}, George Bu\
 5 sh: {George Bush: 4.11817741394043, President George Bush: -1.546141266822815}, Mexi\
 6 co: {Mexico: 1.4138349294662476, President George Bush: -4.650205612182617, George B\
 7 ush: -3.666614532470703}, he: {he: -0.5704692006111145, President George Bush: 9.385\
 8 97583770752, George Bush: -1.4178757667541504, Mexico: -3.6565260887145996}, a very \
 9 good meal: {a very good meal: 1.652894377708435, President George Bush: -2.554375886\
10 9171143, George Bush: -2.13267183303833, Mexico: -1.6889561414718628, he: -2.7667927\
11 742004395}}}
12 
13 {'corefs': 'Lucy threw a ball to Bill and Bill caught a ball', 'clusters': [a ball: \
14 [a ball, it], Bill: [Bill, he]], 'scores': {Lucy: {Lucy: 0.41820740699768066}, a bal\
15 l: {a ball: 1.8033190965652466, Lucy: -2.721518039703369}, Bill: {Bill: 1.5611814260\
16 482788, Lucy: -2.8222298622131348, a ball: -1.806389570236206}, he: {he: -0.57600766\
17 42036438, Lucy: 3.054243326187134, a ball: -1.818403720855713, Bill: 3.0774276256561\
18 28}, it: {it: -1.0269954204559326, Lucy: -3.4972281455993652, a ball: -0.31290221214\
19 294434, Bill: -2.5343685150146484, he: -3.6687228679656982}}}

Anaphora resolution, also called coreference, refers to two or more words or phrases in an input text refer to the same noun. This analysis usually entails identifying which noun phrases that pronouns refer to.

Wrap-up

I spent several years of development time during the period from 1984 through 2015 working on natural language processing technology and as a personal side project I sold commercial NLP libraries that I wrote on my own time in Ruby and Common Lisp. The state-of-the-art of Deep Learning enhanced NLP is very good and the open source spaCy library makes excellent use of both conventional NLP technology and pre-trained Deep Learning models. I no longer spend very much time writing my own NLP libraries and instead use spaCy.

I urge you to read through the spaCy documentation because we covered just basic functionality here that we will also need in the later chapter on automatically generating data for Knowledge Graphs. After working through the interactive REPL sessions and the examples in this chapter, you should be able to translate any Python API example code to Hy.