Natural Language Processing Tools

The tools developed in this chapter are modules you can reuse in your programs. We will develop a command line program that reads a line of text from STDIN and writes semantic information as output to STDOUT. I have used this in a Ruby program by piping input text data to a forked process and reading the output which is a semantic representation of the input text.

Note: we previously saw a small application of the OpenAI completion LLMs to find place names in input text. We could replace most of the examples in this chapter with calls to a LLM completion API with NLP specific prompts.

We will be using this example as an external dependency to a later example in the chapter Knowledge Graph Creator.

A few of the data files I provide in this example are fairly large. As an example the file PeopleDbPedia.hs which builds a map from people’s names to the Wikipedia/DBPedia URI for information about them, is 2.5 megabytes in size. The first time you run stack build in the project directory it will take a while, so you might want to start building the project in the directory NlpTool and let it run while you read this chapter.

Here are three examples using the NlpTool command line application developed in this chapter:

Enter text (all on one line)
Canada and England signed a trade deal.
category:   economics
summary:    Canada and England signed a trade deal. 
countries:  [["Canada","<http://dbpedia.org/resource/Canada>"],
             ["England","<http://dbpedia.org/resource/England>"]]
Enter text (all on one line)
President George W Bush asked Congress for permission to invade Iraq.
category:   news_war
summary:    President George W Bush asked Congress for permission to invade Iraq. 
people: [["George W Bush","<http://dbpedia.org/resource/George_W._Bush>"]]
countries:  [["Iraq",""]]
Enter text (all on one line)
The British government is facing criticism from business groups over statements sugg\
esting the U.K. is heading for a hard divorce from the European Union — and pressure
 from lawmakers who want Parliament to have a vote on the proposed exit terms. The g
overnment's repeated emphasis on controlling immigration sent out "signs that the do
or is being closed, to an extent, on the open economy, that has helped fuel investme
nt," the head of employers' group the Confederation of British Industry, Carolyn Fai
rbairn, said in comments published Monday. Prime Minister Theresa May said last week
 that Britain would seek to retain a close relationship with the 28-nation bloc, wit
h continued free trade in goods and services. But she said the U.K. wouldn't cede co
ntrol over immigration, a conflict with the EU's principle of free movement among me
mber states.
category:   economics
summary:    Prime Minister Theresa May said last week that Britain would seek to ret\
ain a close relationship with the 28-nation bloc, with continued free trade in goods
 and services.

credit: news text from abcnews.com

Resolve Entities in Text to DBPedia URIs

The code for this application is in the directory NlpTool.

The software and data in this chapter can be used under the terms of either the GPL version 3 license or the Apache 2 license.

There are several automatically generated Haskell formatted data files that I created using Ruby scripts operating the Wikipedia data. For the purposes of this book I include these data-specific files for your use and enjoyment but we won’t spend much time discussing them. These files are:

BroadcastNetworkNamesDbPedia.hs
CityNamesDbpedia.hs
CompanyNamesDbpedia.hs
CountryNamesDbpedia.hs
PeopleDbPedia.hs
PoliticalPartyNamesDbPedia.hs
TradeUnionNamesDbPedia.hs
UniversityNamesDbPedia.hs

As an example, let’s look at a small sample of data in PeopleDbPedia.hs:

1 module PeopleDbPedia (peopleMap) where
2 
3 import qualified Data.Map as M
4 
5 peopleMap = M.fromList [
6   ("Aaron Sorkin", "<http://dbpedia.org/resource/Aaron_Sorkin>"),
7   ("Bill Clinton", "<http://dbpedia.org/resource/Bill_Clinton>"),
8   ("George W Bush", "<http://dbpedia.org/resource/George_W_Bush>"),

There are 35,146 names in the file PeopleDbPedia.hs. I have built for eight different types of entity names: Haskell maps that take entity names (String) and maps the entity names into relevant DBPedia URIs. Simple in principle, but a lot of work preparing the data. As I mentioned, we will use these data-specific files to resolve entity references in text.

The next listing shows the file Entities.hs. In lines 5-7 I import the entity mapping files I just described. In this example and later code I make heavy use of the Data.Map and Data.Set modules in the collections library (see the NlpTools.cabal file).

The operator isSubsetOf defined in line 39 tests to see if a value is contained in a collection. The built-in function all applies a function or operator to all elements in a collection and returns a true value if the function or operator returns true applied to each element in the collection.

The local utility function namesHelper defined in lines 41-53 is simpler than it looks. The function filter in line 42 applies the inline function in lines 43-45 (this function returns true for Maybe values that contain data) to a second list defined in lines 48-55. This second list is calculated by mapping an inline function over the input argument ngrams. The inline function looks up an ngram in a DBPedia map (passed as the second function argument) and returns the lookup value if it is not empty and if it is empty looks up the same ngram in a word map (last argument to this function).

The utility function namesHelper is then used to define functions to recognize company names, country names, people names, city names, broadcast network names, political party names, trade union names, and university names:

  1 -- Copyright 2014 by Mark Watson. All rights reserved. The software and data in this\
  2  project can be used under the terms of either the GPL version 3 license or the Apac
  3 he 2 license.
  4 
  5 module Entities (companyNames, peopleNames,
  6                  countryNames, cityNames, broadcastNetworkNames,
  7                  politicalPartyNames, tradeUnionNames, universityNames) where
  8 
  9 import qualified Data.Map as M
 10 import qualified Data.Set as S
 11 import Data.Char (toLower)
 12 import Data.List (sort, intersect, intersperse)
 13 import Data.Set (empty)
 14 import Data.Maybe (isJust)
 15 
 16 import Utils (splitWords, bigram, bigram_s, splitWordsKeepCase,
 17               trigram, trigram_s, removeDuplicates)
 18 
 19 import FirstNames (firstNames)
 20 import LastNames (lastNames)
 21 import NamePrefixes (namePrefixes)
 22 
 23 import PeopleDbPedia (peopleMap)
 24 
 25 import CountryNamesDbpedia (countryMap)
 26 import CountryNames (countryNamesOneWord, countryNamesTwoWords, countryNamesThreeWor\
 27 ds)
 28 
 29 import CompanyNamesDbpedia (companyMap)
 30 import CompanyNames (companyNamesOneWord, companyNamesTwoWords, companyNamesThreeWor\
 31 ds)
 32 import CityNamesDbpedia (cityMap)
 33  
 34 import BroadcastNetworkNamesDbPedia (broadcastNetworkMap)
 35 import PoliticalPartyNamesDbPedia (politicalPartyMap)
 36 import TradeUnionNamesDbPedia (tradeUnionMap)
 37 import UniversityNamesDbPedia (universityMap)
 38 
 39 xs `isSubsetOf` ys = all (`elem` ys) xs
 40     
 41 namesHelper ngrams dbPediaMap wordMap =
 42   filter 
 43     (\x -> case x of
 44          (_, Just x) -> True
 45          _ -> False) $
 46     map (\ngram -> (ngram,
 47                 let v = M.lookup ngram dbPediaMap in
 48                 if isJust v
 49                    then return (ngram, v)
 50                    else if S.member ngram wordMap
 51                            then Just (ngram, Just "")
 52                            else Nothing))
 53         ngrams   
 54 
 55 helperNames1W = namesHelper
 56 
 57 helperNames2W wrds = namesHelper (bigram_s wrds)
 58     
 59 helperNames3W wrds =  namesHelper (trigram_s wrds)
 60 
 61 companyNames wrds =
 62   let cns = removeDuplicates $ sort $
 63               helperNames1W wrds companyMap companyNamesOneWord ++
 64               helperNames2W wrds companyMap companyNamesTwoWords ++
 65               helperNames3W wrds companyMap companyNamesThreeWords in
 66   map (\(s, Just (a,Just b)) -> (a,b)) cns
 67   
 68 countryNames wrds =
 69   let cns = removeDuplicates $ sort $
 70               helperNames1W wrds countryMap countryNamesOneWord ++
 71               helperNames2W wrds countryMap countryNamesTwoWords ++
 72               helperNames3W wrds countryMap countryNamesThreeWords in
 73   map (\(s, Just (a,Just b)) -> (a,b)) cns
 74 
 75 peopleNames wrds =
 76   let cns = removeDuplicates $ sort $
 77               helperNames1W wrds peopleMap Data.Set.empty ++
 78               helperNames2W wrds peopleMap Data.Set.empty ++
 79               helperNames3W wrds peopleMap Data.Set.empty in
 80   map (\(s, Just (a,Just b)) -> (a,b)) cns
 81 
 82 cityNames wrds =
 83   let cns = removeDuplicates $ sort $
 84               helperNames1W wrds cityMap Data.Set.empty ++
 85               helperNames2W wrds cityMap Data.Set.empty ++
 86               helperNames3W wrds cityMap Data.Set.empty in
 87   map (\(s, Just (a,Just b)) -> (a,b)) cns
 88 
 89 broadcastNetworkNames wrds =
 90   let cns = removeDuplicates $ sort $
 91               helperNames1W wrds broadcastNetworkMap Data.Set.empty ++
 92               helperNames2W wrds broadcastNetworkMap Data.Set.empty ++
 93               helperNames3W wrds broadcastNetworkMap Data.Set.empty in
 94   map (\(s, Just (a,Just b)) -> (a,b)) cns
 95 
 96 politicalPartyNames wrds =
 97   let cns = removeDuplicates $ sort $
 98               helperNames1W wrds politicalPartyMap Data.Set.empty ++
 99               helperNames2W wrds politicalPartyMap Data.Set.empty ++
100               helperNames3W wrds politicalPartyMap Data.Set.empty in
101   map (\(s, Just (a,Just b)) -> (a,b)) cns
102 
103 tradeUnionNames wrds =
104   let cns = removeDuplicates $ sort $
105               helperNames1W wrds tradeUnionMap Data.Set.empty ++
106               helperNames2W wrds tradeUnionMap Data.Set.empty ++
107               helperNames3W wrds tradeUnionMap Data.Set.empty in
108   map (\(s, Just (a,Just b)) -> (a,b)) cns
109 
110 universityNames wrds =
111   let cns = removeDuplicates $ sort $
112              helperNames1W wrds universityMap Data.Set.empty ++
113              helperNames2W wrds universityMap Data.Set.empty ++
114              helperNames3W wrds universityMap Data.Set.empty in
115   map (\(s, Just (a,Just b)) -> (a,b)) cns
116 
117 
118 main = do
119     let s = "As read in the San Francisco Chronicle, the company is owned by John Sm\
120 ith, Bill Clinton, Betty Sanders, and Dr. Ben Jones. Ben Jones and Mr. John Smith ar
121 e childhood friends who grew up in Brazil, Canada, Buenos Aires, and the British Vir
122 gin Islands. Apple Computer relased a new version of OS X yesterday. Brazil Brazil B
123 razil. John Smith bought stock in ConocoPhillips, Heinz, Hasbro, and General Motors,
124  Fox Sports Radio. I listen to B J Cole. Awami National Party is a political party. 
125 ALAEA is a trade union. She went to Brandeis University."
126     --print $ humanNames s
127     print $ peopleNames $ splitWordsKeepCase s
128     print $ countryNames $ splitWordsKeepCase s
129     print $ companyNames $ splitWordsKeepCase s
130     print $ cityNames $ splitWordsKeepCase s
131     print $ broadcastNetworkNames $ splitWordsKeepCase s
132     print $ politicalPartyNames $ splitWordsKeepCase s
133     print $ tradeUnionNames $ splitWordsKeepCase s
134     print $ universityNames $ splitWordsKeepCase s

The following output is generated by running the test main function defined at the bottom of the file app/NlpTool.hs:

$ stack build --fast --exec NlpTool-exe
Building all executables for `NlpTool' once. After a successful build of all of them\
, only specified executables will be rebuilt.
NlpTool> build (lib + exe)
Preprocessing library for NlpTool-0.1.0.0..
Building library for NlpTool-0.1.0.0..
Preprocessing executable 'NlpTool-exe' for NlpTool-0.1.0.0..
Building executable 'NlpTool-exe' for NlpTool-0.1.0.0..
[1 of 2] Compiling Main
[2 of 2] Compiling Paths_NlpTool
Linking .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/NlpTool-exe/NlpTool-exe ...
NlpTool> copy/register
Installing library in /Users/markw/GITHUB/haskell_tutorial_cookbook_examples_private\
_new_edition/NlpTool/.stack-work/install/x86_64-osx/7a2928fbf8188dcb20f165f77b37045a
5c413cc7f63913951296700a6b7e292d/8.6.5/lib/x86_64-osx-ghc-8.6.5/NlpTool-0.1.0.0-DXKb
ucyA0S0AKOAcZGDl2H
Installing executable NlpTool-exe in /Users/markw/GITHUB/haskell_tutorial_cookbook_e\
xamples_private_new_edition/NlpTool/.stack-work/install/x86_64-osx/7a2928fbf8188dcb2
0f165f77b37045a5c413cc7f63913951296700a6b7e292d/8.6.5/bin
Registering library for NlpTool-0.1.0.0..
Enter text (all on one line)
As read in the San Francisco Chronicle, the company is owned by John Smith, Bill Cli\
nton, Betty Sanders, and Dr. Ben Jones. Ben Jones and Mr. John Smith are childhood f
riends who grew up in Brazil, Canada, Buenos Aires, and the British Virgin Islands. 
Apple Computer relased a new version of OS X yesterday. Brazil Brazil Brazil. John S
mith bought stock in ConocoPhillips, Heinz, Hasbro, and General Motors, Fox Sports R
adio. I listen to B J Cole. Awami National Party is a political party. ALAEA is a tr
ade union. She went to Brandeis University.
category:   news_politics
summary:    ALAEA is a trade union. Apple Computer relased a new version of OS X yes\
terday.
people: [["B J Cole","<http://dbpedia.org/resource/B._J._Cole>"]]
companies:  [["Apple","<http://dbpedia.org/resource/Apple>"],["ConocoPhillips","<htt\
p://dbpedia.org/resource/ConocoPhillips>"],["Hasbro","<http://dbpedia.org/resource/H
asbro>"],["Heinz","<http://dbpedia.org/resource/Heinz>"],["San Francisco Chronicle",
"<http://dbpedia.org/resource/San_Francisco_Chronicle>"]]
countries:  [["Brazil","<http://dbpedia.org/resource/Brazil>"],["Canada","<http://db\
pedia.org/resource/Canada>"]]
Enter text (all on one line)

Note that entities that are not recognized as Wikipedia objects don’t get recognized.

Bag of Words Classification Model

The file Categorize.hs contains a simple bag of words classification model. To prepare the classification models, I collected a large set of labelled text. Labels were “chemistry”, “computers”, etc. I ranked words based on how often they appeared in training texts for a classification category, normalized by how often they appeared in all training texts. This example uses two auto-generated and data-specific Haskell files, one for single words and the other for two adjacent word pairs:

Category1Gram.hs
Category2Gram.hs

In NLP work, single words are sometimes called 1grams and two word adjacent pairs are referred to as 2grams. Here is a small amount of data from Category1Gram.hs:

 1 module Category1Gram (**onegrams**) where
 2 
 3 import qualified Data.Map as M
 4 
 5 chemistry = M.fromList [("chemical", 1.15), ("atoms", 6.95),
 6                         ("reaction", 6.7), ("energy", 6.05),
 7                           ... ]
 8 computers = M.fromList [("software", 4.6), ("network", 4.65),
 9                         ("linux", 3.6), ("device", 3.55), ("computers", 3.05),
10                         ("storage", 2.7), ("disk", 2.3),
11                           ... ]
12 etc.

Here is a small amount of data from Category2Gram.hs:

 1 module Category2Gram (**twograms**) where
 2 
 3 import qualified Data.Map as M
 4 
 5 chemistry = M.fromList [("chemical reaction", 1.55),
 6                         ("atoms molecules", 0.6), 
 7                         ("periodic table", 0.5),
 8                         ("chemical reactions", 0.5),
 9                         ("carbon atom", 0.5),
10                          ... ]
11 computers = M.fromList [("computer system", 0.9),
12                         ("operating system", 0.75),
13                         ("random memory", 0.65),
14                         ("computer science", 0.65),
15                         ("computer program", 0.6),
16                          ... ]
17 etc.

It is very common to use term frequencies for single words for classification models. One problem with using single words is that the evidence that any word gives for a classification is independent of the surrounding words in text being evaluated. By also using word pairs (two word combinations are often called 2grams or two-grams) we pick up patterns like “not good” giving evidence for negative sentiment even with the word “good” in text being evaluated. For my own work, I have a huge corpus of 1gram, 2gram, 3gram, and 4gram data sets. For the purposes of the following example program, I am only using 1gram and 2gram data.

The following listing shows the file Categorize.hs. Before looking at the entire example, let’s focus in on some of the functions I have defined for using the word frequency data to categorized text.

*Categorize> :t stemWordsInString
stemWordsInString :: String -> [Char]
*Categorize> stemWordsInString "Banking industry is sometimes known for fraud."
"bank industri is sometim known for fraud"

stemScoredWordList is used to create a 1gram to word relevance score for each category. The keys are word stems.

*Categorize> stemScoredWordList onegrams 
[("chemistri",fromList [("acid",1.15),("acids",0.8),("alcohol",0.95),("atom",4.45)

Notice that “chemistri” is the stemmed version of “chemistry”, “bank” for “banks”, etc. stem2 is a 2gram frequency score by category mapping where the keys are word stems:

*Categorize> stem2
[("chemistry",fromList [("atom molecul",0.6),("carbon atom",0.5),("carbon carbon",0.\
5),

stem1 is like stem2, but for stemmed 1grams, not 2grams:

*Categorize> stem1
[("chemistry",fromList [("acid",0.8),("chang",1.05),("charg",0.95),("chemic",1.15),(\
"chemistri",1.45),

score is called with a list or words and a word value mapping. Here is an example:

*Categorize> :t score
score
  :: (Enum t, Fractional a, Num t, Ord a, Ord k) =>
     [k] -> [(t1, M.Map k a)] -> [(t, a)]
*Categorize> score ["atom", "molecule"] onegrams 
[(0,8.2),(25,2.4)]

This output is more than a little opaque. The pair (0, 8.2) means that the input words [“atom”, “molecule”] have a score of 8.2 for category indexed at 0 and the pair (25,2.4) means that the input words have a score of 2.4 for the category at index 25. The category at index 0 is chemistry and the category at index 25 is physics as we can see by using the higher level function bestCategories1 that caluculates categories for a word sequence using 1gram word data:

*Categorize> :t bestCategories1
bestCategories1 :: [[Char]] -> [([Char], Double)]
*Categorize> bestCategories1 ["atom", "molecule"]
[("chemistry",8.2),("physics",2.4)]

The top level function bestCategories uses 1gram data. Here is an example for using it:

*Categorize> splitWords "The chemist made a periodic table and explained a chemical \
reaction"
["the","chemist","made","a","periodic","table","and","explained","a","chemical","rea\
ction"]
*Categorize> bestCategories1 $ splitWords "The chemist made a periodic table and exp\
lained a chemical reaction"
[("chemistry",11.25),("health_nutrition",1.2)]

Notice that these words were also classified as category “health_nutrition” but with a low score of 1.2. The score for “chemistry” is almost an order of magnitude larger. bestCategories sorts return values in “best first” order.

splitWords is used to split a string into word tokens before calling bestCategories.

Here is the entire example in file Categorize.hs:

 1 module Categorize (bestCategories, splitWords, bigram) where
 2 
 3 import qualified Data.Map as M
 4 import Data.List (sortBy)
 5 
 6 import Category1Gram (onegrams)
 7 import Category2Gram (twograms)
 8 
 9 import Sentence (segment)
10 
11 import Stemmer (stem)
12 
13 import Utils (splitWords, bigram, bigram_s)
14 
15 catnames1 = map fst onegrams
16 catnames2 = map fst twograms
17 
18 stemWordsInString s = init $ concatMap ((++ " ") . stem) (splitWords s)
19 
20 stemScoredWordList = map (\(str,score) -> (stemWordsInString str, score))
21 
22 stem2 = map (\(category, swl) ->
23               (category, M.fromList (stemScoredWordList (M.toList swl))))
24             twograms
25 
26 stem1 = map (\(category, swl) ->
27               (category, M.fromList (stemScoredWordList (M.toList swl))))
28             onegrams
29 
30 scoreCat wrds amap =
31   sum $ map (\x ->  M.findWithDefault 0.0 x amap) wrds
32 
33 score wrds amap =
34  filter (\(a, b) -> b > 0.9) $ zip [0..] $ map (\(s, m) -> scoreCat wrds m) amap
35  
36 cmpScore (a1, b1) (a2, b2) = compare b2 b1
37                               
38 bestCategoriesHelper wrds ngramMap categoryNames=
39   let tg = bigram_s wrds in
40     map (first (categoryNames !!)) $ sortBy cmpScore $ score wrds ngramMap
41        
42 bestCategories1 wrds =
43   take 3 $ bestCategoriesHelper wrds onegrams catnames1
44 
45 bestCategories2 wrds =
46   take 3 $ bestCategoriesHelper (bigram_s wrds) twograms catnames2
47        
48 bestCategories1stem wrds =
49   take 3 $ bestCategoriesHelper wrds stem1 catnames1
50 
51 bestCategories2stem wrds =
52   take 3 $ bestCategoriesHelper (bigram_s wrds) stem2 catnames2
53 
54 bestCategories :: [String] -> [(String, Double)]
55 bestCategories wrds =
56   let sum1 = M.unionWith (+) (M.fromList $ bestCategories1 wrds) ( M.fromList $ best\
57 Categories2 wrds)
58       sum2 = M.unionWith (+) (M.fromList $ bestCategories1stem wrds) ( M.fromList $ \
59 bestCategories2stem wrds) 
60   in sortBy cmpScore $ M.toList $ M.unionWith (+) sum1 sum2
61       
62 main = do
63     let s = "The sport of hocky is about 100 years old by ahdi dates. American Footb\
64 all is a newer sport. Programming is fun. Congress passed a new budget that might he
65 lp the economy. The frontier initially was a value path. The ai research of john mcc
66 arthy."
67     print $ bestCategories1 (splitWords s)    
68     print $ bestCategories1stem (splitWords s)
69     print $ score (splitWords s) onegrams
70     print $ score (bigram_s (splitWords s)) twograms
71     print $ bestCategories2 (splitWords s)
72     print $ bestCategories2stem (splitWords s)
73     print $ bestCategories (splitWords s)

Here is the output:

 1 $ stack ghci
 2 :l Categorize.hs
 3 *Categorize> main
 4 [("computers_ai",17.900000000000002),("sports",9.75),("computers_ai_search",6.2)]
 5 [("computers_ai",18.700000000000003),("computers_ai_search",8.1),("computers_ai_lear\
 6 ning",5.7)]
 7 [(2,17.900000000000002),(3,1.75),(4,5.05),(6,6.2),(9,1.1),(10,1.2),(21,2.7),(26,1.1)\
 8 ,(28,1.6),(32,9.75)]
 9 [(2,2.55),(6,1.0),(32,2.2)]
10 [("computers_ai",2.55),("sports",2.2),("computers_ai_search",1.0)]
11 [("computers_ai",1.6)]
12 [("computers_ai",40.75000000000001),("computers_ai_search",15.3),("sports",11.95),("\
13 computers_ai_learning",5.7)]

Given that the variable s contains some test text, line 4 of this output was generated by evaluating bestCategories1 (splitWords s), lines 5-6 by evaluating bestCategories1stem (splitWords s), lines 7-8 from score (splitWords s) onegrams, line 9 from core (bigram_s (splitWords s)) twograms, line 10 from bestCategories2 (splitWords s), line 11 from bestCategories2stem (splitWords s), and lines 12-13 from bestCategories (splitWords s).

I called all of the utility fucntions in function main to demonstrate what they do but in practice I just call function bestCategories in my applications.

Text Summarization

This application uses both the Categorize.hs code and the 1gram data from the last section. The algorithm I devised for this example is based on a simple idea: we categorize text and keep track of which words provide the strongest evidence for the highest ranked categories. We then return a few sentences from the original text that contain the largest numbers of these important words.

module Summarize (summarize, summarizeS) where

import qualified Data.Map as M
import Data.List.Utils (replace)
import Data.Maybe (fromMaybe)

import Categorize (bestCategories)
import Sentence (segment)
import Utils (splitWords, bigram_s, cleanText)

import Category1Gram (onegrams)
import Category2Gram (twograms)

scoreSentenceHelper words scoreMap = -- just use 1grams for now
  sum $ map (\word ->  M.findWithDefault 0.0 word scoreMap) words

safeLookup key alist =
  fromMaybe 0 $ lookup key alist
 
scoreSentenceByBestCategories words catDataMaps bestCategories =
  map (\(category, aMap) -> 
        (category, safeLookup category bestCategories * 
                   scoreSentenceHelper words aMap)) catDataMaps

scoreForSentence words catDataMaps bestCategories =  
  sum $ map snd $ scoreSentenceByBestCategories words catDataMaps bestCategories

summarize s =
  let words = splitWords $ cleanText s
      bestCats = bestCategories words
      sentences = segment s
      result1grams = map (\sentence ->
                           (sentence,
                            scoreForSentence (splitWords sentence)
                                             onegrams bestCats)) 
                         sentences
      result2grams = map (\sentence ->
                           (sentence,
                            scoreForSentence (bigram_s (splitWords sentence))
                                             twograms bestCats)) 
                         sentences
      mergedResults = M.toList $ M.unionWith (+)
                      (M.fromList result1grams) (M.fromList result1grams)
      c400 = filter (\(sentence, score) -> score > 400) mergedResults
      c300 = filter (\(sentence, score) -> score > 300) mergedResults
      c200 = filter (\(sentence, score) -> score > 200) mergedResults
      c100 = filter (\(sentence, score) -> score > 100) mergedResults
      c000 = mergedResults in
  if not (null c400) then c400 else if not (null c300) then c300 else if not (null c\
200) then c200 else if not (null c100) then c100 else c000

summarizeS s =
  let a = replace "\"" "'" $ concatMap (\x -> fst x ++ " ") $ summarize s in
  if not (null a) then a else safeFirst $ segment s where
    safeFirst x 
      | length x > 1 = head x ++ x !! 1
      | not (null x)   = head x
      | otherwise    = ""
      
main = do     
  let s = "Plunging European stocks, wobbly bonds and grave concerns about the healt\
h of Portuguese lender Banco Espirito Santo SA made last week feel like a rerun of t
he euro crisis, but most investors say it was no more than a blip for a resurgent re
gion. Banco Espirito Santo has been in investors’ sights since December, when The Wa
ll Street Journal first reported on accounting irregularities at the complex firm. N
erves frayed on Thursday when Banco Espirito Santo's parent company said it wouldn't
 be able to meet some short-term debt obligations."
  print $ summarize s
  print $ summarizeS s

Lazy evaluation allows us in function summarize to define summaries of various numbers of sentences, but not all of these possible summaries are calculated.

$ stack ghci
*Main ... > :l Summarize.hs
*Summarize> main
[("Nerves frayed on Thursday when Banco Espirito Santo's parent company said it woul\
dn't be able to meet some short-term debt obligations.",193.54500000000002)]
"Nerves frayed on Thursday when Banco Espirito Santo's parent company said it wouldn\
't be able to meet some short-term debt obligations. "

Part of Speech Tagging

We close out this chapter with the Haskell version of my part of speech (POS) tagger that I originally wrote in Common Lisp, then converted to Ruby and Java. The file LexiconData.hs is similar to the lexical data files seen earlier: I am defining a map where keys a words and map values are POS tokens like NNP (proper noun), RB (adverb), etc. The file README.md contains a complete list of POS tag definitions.

The example code and data for this section is in the directory FastTag.

This listing shows a tiny representative part of the POS definitions in LexiconData.hs:

lexicon = M.fromList [("AARP", "NNP"), ("Clinic", "NNP"), ("Closed", "VBN"),
                      ("Robert", "NNP"), ("West-German", "JJ"),
                      ("afterwards", "RB"), ("arises", "VBZ"),
                      ("attacked", "VBN"), ...]

Before looking at the code example listing, let’s see how the functions defined in fasttag.hs work in a GHCi repl:

*Main LexiconData> bigram ["the", "dog", "ran",
                           "around", "the", "tree"]
[["the","dog"],["dog","ran"],["ran","around"],
 ["around","the"],["the","tree"]]
*Main LexiconData> tagHelper "car"
["car","NN"]
*Main LexiconData> tagHelper "run"
["run","VB"]
*Main LexiconData> substitute ["the", "dog", "ran", "around",
                               "the", "tree"]
[[["the","DT"],["dog","NN"]],[["dog","NN"],["ran","VBD"]],
 [["ran","VBD"],["around","IN"]],[["around","IN"],["the","DT"]],
 [["the","DT"],["tree","NN"]]]
*Main LexiconData> fixTags $  substitute ["the", "dog", "ran", 
                                          "around", "the", "tree"]
["NN","VBD","IN","DT","NN"]

Function bigram takes a list or words and returns a list of word pairs. We need the word pairs because parts of the tagging algorithm needs to see a word with its preceding word. In an imperative language, I would loop over the words and for a word at index i I would have the word at index i - 1. In a functional language, we avoid using loops and in this case create a list of adjacent word pairs to avoid having to use an explicit loop. I like this style of functional programming but if you come from years of using imperative language like Java and C++ it takes some getting used to.

tagHelper converts a word into a list of the word and its likely tag. substitute applies tagHelper to a list of words, getting the most probable tag for each word. The function fixTags will occasionally override the default word tags based on a few rules that are derived from Eric Brill’s paper A Simple Rule-Based Part of Speech Tagger.

Here is the entire example:

 1 module Main where
 2 
 3 import qualified Data.Map as M
 4 import Data.Strings (strEndsWith, strStartsWith)
 5 import Data.List (isInfixOf)
 6 
 7 import LexiconData (lexicon)
 8 
 9 bigram :: [a] -> [[a]]
10 bigram [] = []
11 bigram [_] = []
12 bigram xs = take 2 xs : bigram (tail xs)
13 
14 containsString word substring = isInfixOf substring word
15 
16 fixTags twogramList =
17   map
18   -- in the following inner function, [last,current] might be bound,
19   -- for example, to [["dog","NN"],["ran","VBD"]]
20   (\[last, current] ->
21     -- rule 1: DT, {VBD | VBP} --> DT, NN
22     if last !! 1 == "DT" && (current !! 1 == "VBD" ||
23                              current !! 1 == "VB" ||
24                              current !! 1 == "VBP")
25     then "NN" 
26     else
27       -- rule 2: convert a noun to a number (CD) if "." appears in the word
28       if (current !! 1) !! 0 == 'N' && containsString (current !! 0) "."
29       then "CD"
30       else
31         -- rule 3: convert a noun to a past participle if
32         --         words.get(i) ends with "ed"
33         if (current !! 1) !! 0 == 'N' && strEndsWith (current !! 0) "ed"
34         then "VBN"
35         else
36           -- rule 4: convert any type to adverb if it ends in "ly"
37           if strEndsWith (current !! 0) "ly"
38           then "RB"
39           else
40             -- rule 5: convert a common noun (NN or NNS) to an
41             --         adjective if it ends with "al"
42             if strStartsWith (current !! 1) "NN" &&
43                strEndsWith (current !! 1) "al"
44             then "JJ"
45             else
46               -- rule 6: convert a noun to a verb if the preceeding
47               --         word is "would"
48               if strStartsWith (current !! 1) "NN" &&
49                  (last !! 0) == "would" -- should be case insensitive
50               then "VB"
51               else
52                 -- rule 7: if a word has been categorized as a
53                 --         common noun and it ends with "s",
54                 --         then set its type to plural common noun (NNS)
55                 if strStartsWith (current !! 1) "NN" &&
56                    strEndsWith (current !! 0) "s"
57                 then "NNS"
58                 else
59                   -- rule 8: convert a common noun to a present
60                   --         participle verb (i.e., a gerand)
61                   if strStartsWith (current !! 1) "NN" &&
62                      strEndsWith (current !! 0) "ing"
63                   then "VBG"
64                   else (current !! 1))
65  twogramList
66   
67 substitute tks = bigram $ map tagHelper tks
68 
69 tagHelper token =
70   let tags = M.findWithDefault [] token lexicon in
71   if tags == [] then [token, "NN"] else [token, tags]
72 
73 tag tokens = fixTags $ substitute ([""] ++ tokens)
74 
75 
76 main = do
77   let tokens = ["the", "dog", "ran", "around", "the", "tree", "while",
78                 "the", "cat", "snaked", "around", "the", "trunk",
79                 "while", "banking", "to", "the", "left"]
80   print $ tag tokens
81   print $ zip tokens $ tag tokens

*Main LexiconData> main
["DT","NN","VBD","IN","DT","NN","IN","DT","NN","VBD","IN","DT",
 "NN","IN","VBG","TO","DT","VBN"]
[("the","DT"),("dog","NN"),("ran","VBD"),("around","IN"),
 ("the","DT"),("tree","NN"),("while","IN"),("the","DT"),
 ("cat","NN"),("snaked","VBD"),("around","IN"),("the","DT"),
 ("trunk","NN"),("while","IN"),("banking","VBG"),("to","TO"),
 ("the","DT"),("left","VBN")]

The README.md file contains definitions of the POS definitions. Here are the ones used in this example:

DT Determiner               the,some
NN noun                     dog,cat,road
VBD verb, past tense        ate,ran
IN Preposition              of,in,by

Natural Language Processing Wrap Up

NLP is a large topic. I have attempted to show you just the few tricks that I use often and are simple to implement. I hope that you reuse the code in this chapter in your own projects when you need to detect entities, classify text, summarize text, and assign part of speech tags to words in text.

Up next

Linked Data and the Semantic Web