### This book is licensed with Creative Commons Attribution CC BY Version 3 That Allows Reuse In Derived Works

You are free to:

• Share - copy and redistribute the material in any medium or format.
• Adapt - remix, transform, and build upon the material for any purpose, even commercially.

You are required to give appropriate credit in any derived works:

This eBook will be updated occasionally so please periodically check the leanpub.com web page for this book for updates.

This is the eighth edition released August 2022.

If you found a copy of this book on the web and find it of value then please consider buying a copy at leanpub.com/lovinglisp to support the author and fund work for future updates. You can also download a free copy from Leanpub by setting the price to zero. Please look at my website for other books and materials.

## Preface

### Notes on the Eighth Edition Published August 2022

The main change is splitting the Knowledge Graph Navigator (KGN) chapter that features the LispWorks CAPI UI APIs into three chapters for a library for KGN functionality, a text based (console) UI, and a CAPI based UI. I added examples using the OpenAI GPT-3 APIs. There are other small corrections and improvements.

### Notes on the Seventh Edition Published March 2021

I added two short chapters to the previous edition: Knowledge Graph Sampler for Creating Small Custom Knowledge Graphs and Using Common Lisp With Wolfram/One.

### Notes on the Sixth Edition Published June 2020

Two examples optionally use the CAPI user interface toolkit provided with LispWorks Common Lisp and work with the free personal edition. The first CAPI application is Knowledge Graph Navigator and the second CAPI example is Knowledge Graph Creator. Both of these examples build up utilities for working with Knowledge Graphs and the Semantic Web.

I expand the Plot Library chapter to generate either PNG graphics files or if you are using the free personal edition of LispWorks you can also direct plotting output to a new window in interactive programs.

I added a new chapter on using the py4cl library to embed Python libraries and application code into a Common Lisp system. I provide new examples for embedding spaCy and TensorFlow applications in Common Lisp applications. In earlier editions, I used a web services interface to wrap Python code using spaCy and TensorFlow. I am leaving that chapter intact, renaming it from “Using Python Deep Learning Models In Common Lisp” to “Using Python Deep Learning Models In Common Lisp With a Web Services Interface.” The new chapter for this edition is “Using the PY4CL Library to Embed Python in Common Lisp.”

### Notes on the Fifth Edition Published September 2019

• A complete application for processing text to generate data for Knowledge Graphs (targeting the open source Neo4J graph database and also support RDF semantic web/linked data).
• A library for accessing the state of the art spaCy natural language processing (NLP) library and also a state of the art deep learning model. These models are implemented in thin Python wrappers that use Python libraries like spaCy, PyTorch, and TensorFlow. These examples replace a simple hybrid Java and Common Lisp example in previous editions.

I have added text and explanations as appropriate throughout the book and I removed the CouchDB examples.

I have made large changes to how the code for this book is packaged. I have reorganized the example code on GitHub by providing the examples as multiple Quicklisp libraries or applications. I now do this with all of my Common Lisp code and it makes it easier to write smaller libraries that can be composed into larger applications. In my own workflow, I also like to use Makefile targets to build standalone applications that can be run on other computers without installing Lisp development environments. Please follow the directions at the end of the Preface for configuring Quicklisp for easy builds and use of the example software for this book.

### Why Use Common Lisp?

Why Common Lisp? Isn’t Common Lisp an old language? Do many people still use Common Lisp?

I believe that using Lisp languages like Common Lisp, Clojure, Racket, and Scheme are all secret weapons useful in agile software development. An interactive development process and live production updates feel like a breath of fresh air if you have development on heavy weight like Java Enterprise Edition (JEE).

Yes, Common Lisp is an old language but with age comes stability and extremely good compiler technology. There is also a little inconsistency between different Common Lisp systems in such things as handling threads but with a little up front knowledge you can choose which Common Lisp systems will support your requirements.

### Getting an Access Key for Microsoft Bing Search APIs

You will need to set up an Azure account if you don’t already have one. I use the Bing search APIs fairly often for research but I have never spent more than about a dollar a month and usually I get no bill at all. For personal use it is a very inexpensive service.

You start by going to the web page https://azure.microsoft.com/en-us/try/cognitive-services/ and sign up for an access key. The Search APIs sign up is currently in the fourth tab in this web form. When you navigate to the Search APIs tab, select the option Bing Search APIs v7. You will get an API key that you need to store in an environment variable that you will soon need:

That is not my real subscription key!

You also set the Bing search API as an environment variable:

### Example Search Script

Instead of using a pure Common Lisp HTTP client library I often prefer using the curl command run in a separate process. The curl utility handles all possible authentication modes, handles headers, response data in several formats, etc. We capture the output from curl in a string that in turn gets processed by a JSON library.

It takes very little Common Lisp code to access the Bing search APIs. The function websearch makes a generic web search query. The function get-wikidata-uri uses the websearch function by adding “site:wikidata.org” to the query and returning only the WikiData URI for the original search term. We will later see several examples. I will list the entire library with comments to follow:

We get the Bing access key and the search API endpoint in lines 8-9. Lines 10-16 create a complete call to the curl* command line utility. We spawn a process to run **curl and capture the string output in the variable response in lines 17-18. You might want to add a few print statements to see typical values for the variables command and response. The response data is JSON data encoded in a string, with straightforward code in lines 19-28 to parse out the values we want.

The following repl listing shows this library in use:

I have been using the Bing search APIs for many years. They are a standard part of my application building toolkit.

### Wrap-up

You can check out the wide range of Congitive Services on the Azure site. Available APIs include: language detection, speech recognition, vision libraries for object recognition, web search, and anomaly detection in data.

In addition to using automated web scraping to get data for my personal research, I often use automated web search. I find the Microsoft’s Azure Bing search APIs are the most convenient to use and I like paying for services that I use.

## Accessing Relational Databases

There are good options for accessing relational databases from Common Lisp. Personally I almost always use Postgres and in the past I used either native foreign client libraries or the socket interface to Postgres. Recently, I decided to switch to CLSQL which provides a common interface for accessing Postgres, MySQL, SQLite, and Oracle databases. There are also several recent forks of CLSQL on github. We will use CLSQL in examples in this book. Hopefully while reading the Chapter on Quicklisp you installed CLSQL and the back end for one or more databases that you use for your projects.

For some database applications when I know that I will always use the embedded SQLite database (i.e., that I will never want to switch to Postgres of another database) I will just use the sqlite library as I do in chapter Knowledge Graph Navigator.

If you have not installed CLSQL yet, then please install it now:

You also need to install one or more CLSQL backends, depending on which relational databases you use:

The directory src/clsql_examples contains the standalone example files for this chapter.

While I often prefer hand crafting SQL queries, there seems to be a general movement in software development towards the data mapper or active record design patterns. CLSQL provides Object Relational Mapping (ORM) functionality to CLOS.

You will need to create a new database news in order to follow along with the examples in this chapter and later in this book. I will use Postgres for examples in this chapter and use the following to create a new database (my account is “markw” and the following assumes that I have Postgres configured to not require a password for this account when accessing the database from “localhost”):

We will use three example programs that you can find in the src/clsql_examples directory in the book repository on github:

• clsql_create_news_schema.lisp to create table “articles” in database “news”
• clsql_write_to_news.lisp to write test data to table “articles”

The following listing shows the file src/clsql_examples/clsql_create_news_schema.lisp:

In this repl listing, we create the database table “articles” using the function create-articles-table that we just defined:

The following listing shows the file src/clsql_examples/clsql_write_to_news.lisp:

You should load the file clsql_write_to_news.lisp one time in a repl to create the test data. The following listing shows file clsql_read_from_news.lisp:

You can also embed SQL where clauses in queries:

which produces this output:

In this example, I am using a SQL like expression to perform partial text matching.

### Database Wrap Up

You learned the basics for accessing relational databases. When I am designing new systems for processing data I like to think of my Common Lisp code as being purely functional: my Lisp functions accept arguments that they do not modify and return results. I like to avoid side effects, that is changing global state. When I do have to handle mutable state (or data) I prefer storing mutable state in an external database. I use this same approach when I use the Haskell functional programming language.

## Using MongoDB, Solr NoSQL Data Stores

Non-relational data stores are commonly used for applications that don’t need either full relational algebra or must scale.

The MongoDB example code is in the file src/loving_snippets/mongo_news.lisp. The Solr example code is in the subdirectories src/solr_examples.

Note for the fifth edition: The Common Lisp cl-mongo library is now unsupported for versions of MongoDB later than 2.6 (released in 2016). You can install an old version of MongoDB for macOS or for Linux. I have left the MongoDB examples in this section but I can’t recommend that you use cl-mongo and MongoDB for any serious applications.

Brewer’s CAP theorem states that a distributed data storage system comprised of multiple nodes can be robust to two of three of the following guarantees: all nodes always have a Consistent view of the state of data, general Availablity of data if not all nodes are functioning, and Partition tolerance so clients can still communicate with the data storage system when parts of the system are unavailable because of network failures. The basic idea is that different applications have different requirements and sometimes it makes sense to reduce system cost or improve scalability by easing back on one of these requirements.

A good example is that some applications may not need transactions (the first guarantee) because it is not important if clients sometimes get data that is a few seconds out of date.

MongoDB allows you to choose consistency vs. availability vs. efficiency.

I cover the Solr indexing and search service (based on Lucene) both because a Solr indexed document store is a type of NoSQL data store and also because I believe that you will find Solr very useful for building systems, if you don’t already use it.

### MongoDB

The following discussion of MongoDB is based on just my personal experience, so I am not covering all use cases. I have used MongoDB for:

• Small clusters of MongoDB nodes to analyze social media data, mostly text mining and sentiment analysis. In all cases for each application I ran MongoDB with one write master (i.e., I wrote data to this one node but did not use it for reads) and multiple read-only slave nodes. Each slave node would run on the same server that was usually performing a single bit of analytics.
• Multiple very large independent clusters for web advertising. Problems faced included trying to have some level of consistency across data centers. Replica sets were used within each data center.
• Running a single node MongoDB instance for low volume data collection and analytics.

One of the advantages of MongoDB is that it is very “developer friendly” because it supports ad-hoc document schemas and interactive queries. I mentioned that MongoDB allows you to choose consistency vs. availability vs. efficiency. When you perform MongoDB writes you can specify some granularity of what constitutes a “successful write” by requiring that a write is performed at a specific number of nodes before the client gets acknowledgement that the write was successful. This requirement adds overhead to each write operation and can cause writes to fail if some nodes are not available.

The MongoDB online documentation is very good. You don’t have to read it in order to have fun playing with the following Common Lisp and MongoDB examples, but if you find that MongoDB is a good fit for your needs after playing with these examples then you should read the documentation. I usually install MongoDB myself but it is sometimes convenient to use a hosting service. There are several well regarded services and I have used MongoHQ.

At this time there is no official Common Lisp support for accessing MongoDB but there is a useful project by Alfons Haffmans’ cl-mongo that will allow us to write Common Lisp client applications and have access to most of the capabilities of MongoDB.

The file src/mongo_news.lisp contains the example code used in the next three sessions.

The following repl listing shows the cl-mongo APIs for creating a new document, adding elements (attributes) to it, and inserting it in a MongoDB data store:

In this example, three string attributes were added to a new document before it was saved.

#### Fetching Documents by Attribute

We will start by fetchng and pretty-printing all documents in the collection articles and fetching all articles a list of nested lists where the inner nested lists are document URI, title, and text:

Output for these two functions looks like:

By reusing the function article-results->lisp-data defined in the last section, we can also search for JSON documents using regular expressions matching attribute values:

I set the limit to return a maximum of ten documents. If you do not set the limit, this example code only returns one search result. The following repl listing shows the results from calling function search-articles-text:

I find using MongoDB to be especially effective when experimenting with data and code. The schema free JSON document format, using interactive queries using the mongo shell, and easy to use client libraries like clouchdb for Common Lisp will let you experiment with a lot of ideas in a short period of time. The following listing shows the use of the interactive mongo shell. The database news is the database used in the MongoDB examples in this chapter; you will notice that I also have other databases for other projects on my laptop:

Line 1 of this listing shows starting the mongo shell. Line 4 shows how to list all databases in the data store. In line 13 I select the database “news” to use. Line 15 prints out the names of all collections in the current database “news”. Line 18 prints out all documents in the “articles” collection. You can read the documentation for the mongo shell for more options like selective queries, adding indices, etc.

When you run a MongoDB service on your laptop, also try the admin interface on http://localhost:28017/.

### A Common Lisp Solr Client

The Lucene project is one of the most widely used Apache Foundation projects. Lucene is a flexible library for preprocessing and indexing text, and searching text. I have personally used Lucene on so many projects that it would be difficult to count them. The Apache Solr Project adds a network interface to the Lucene text indexer and search engine. Solr also adds other utility features to Lucene:

• While Lucene is a library to embed in your programs, Solr is a complete system.
• Solr provides good defaults for preprocessing and indexing text and also provides rich support for managing structured data.
• Provides both XML and JSON APIs using HTTP and REST.
• Supports faceted search, geospatial search, and provides utilities for highlighting search terms in surrounding text of search results.
• If your system ever grows to a very large number of users, Solr supports scaling via replication.

I hope that you will find the Common Lisp example Solr client code in the following sections helps you make Solr part of large systems that you write using Common Lisp.

#### Installing Solr

Download a binary Solr distribution and un-tar or un-zip this Solr distribution, cd to the distribution directory, then cd to the example directory and run:

You can access the Solr Admin Web App at http://localhost:8983/solr/#/. This web app can be seen in the following screen shot:

There is no data in the Solr example index yet, so following the Solr tutorial instructions:

You will learn how to add documents to Solr directly in your Common Lisp programs in a later section.

Assuming that you have a fast Internet connection so that downloading Solr was quick, you have hopefully spent less than five or six minutes getting Solr installed and running with enough example search data for the Common Lisp client examples we will play with. Solr is a great tool for storing, indexing, and searching data. I recommend that you put off reading the official Solr documentation for now and instead work through the Common Lisp examples in the next two sections. Later, if you want to use Solr then you will need to carefully read the Solr documentation.

#### Solr’s REST Interface

The Solr REST Interface Documentation documents how to perform search using HTTP GET requests. All we need to do is implement this in Common Lisp which you will see is easy.

Assuming that you have Solr running and the example data loaded, we can try searching for documents with, for example, the word “British” using the URL http://localhost:8983/solr/select?q=British. This is a REST request URL and you can use utilities like curl or wget to fetch the XML data. I fetched the data in a web browser, as seen in the following screen shot of a Firefox web browser (I like the way Firefox formats and displays XML data):

The attributes in the returned search results need some explanation. We indexed several example XML data files, one of which contained the following XML element that we just saw as a search result:

So, the search result has the same attributes as the structured XML data that was added to the Solr search index. Solr’s capability for indexing structured data is a superset of just indexing plain text. If for example we were indexing news stories, then example input data might look like:

With this example, a search result that returned this document as a result would return attributes id, title, and text, and the values of these three attributes.

By default the Solr web service returns XML data as seen in the last screen shot. For our examples, I prefer using JSON so we are going to always add a request parameter wt=json to all REST calls. The following screen shot shows the same data returned in JSON serialization format instead of XML format of a Chrome web browser (I like the way Chrome formats and displays JSON data with the JSONView Chrome Browser extension):

You can read the full JSON REST Solr documentation later, but for our use here we will use the following search patterns:

• http://localhost:8983/solr/select?q=British+One&wt=json - search for documents with either of the words “British” or “one” in them. Note that in URIs that the “+” character is used to encode a space character. If you wanted a “+” character you would encode it with “%2B” and a space character is encoded as “%20”. The default Solr search option is an OR of the search terms, unlike, for example, Google Search.
• http://localhost:8983/solr/select?q=British+AND+one&wt=json - search for documents that contain both of the words “British” and “one” in them. The search term in plain text is “British AND one”.

As we sawearlier in Network Programming it is fairly simple to use the drakma and cl-json Common Lisp libraries to call REST services that return JSON data. The function do-search defined in the next listing (all the Solr example code is in the file src/solr-client.lisp) constructs a query URI as we saw in the last section and uses the Drackma library to perform an HTTP GET operation and the cl-json library to parse the returned string containing JSON data into Lisp data structures:

This example code does return the search results as Lisp list data; for example:

I might modify the search function to return just the fetched documents as a list, discarding the returned Solr meta data:

There are a few more important details if you want to add Solr search to your Common Lisp applications. When there are many search results you might want to fetch a limited number of results and then “page” through them. The following strings can be added to the end of a search query:

• &rows=2 this example returns a maximum of two “rows” or two query results.
• &start=4 this example skips the first 4 available results

A query that combines skipping results and limiting the number of returned results looks like this:

#### Common Lisp Solr Client for Adding Documents

In the last example we relied on adding example documents to the Solr search index using the directions for setting up a new Solr installation. In a real application, in addition to performing search requests for indexed documents you will need to add new documents from your Lisp applications. Using the Drakma we will see that it is very easy to add documents.

We need to construct a bit of XML containing new documents in the form:

You can specify whatever field names (attributes) that are required for your application. You can also pass multiple <doc></doc> elements in one add request. We will want to specify documents in a Lisp-like way: a list of cons values where each cons value is a field name and a value. For the last XML document example we would like an API that lets us just deal with Lisp data like:

One thing to note: the attribute names and values must be passed as strings. Other data types like integers, floating point numbers, structs, etc. will not work.

This is nicer than having to use XML, right? The first thing we need is a function to convert a list of cons values to XML. I could have used the XML Builder functionality in the cxml library that is available via Quicklisp, but for something this simple I just wrote it in pure Common Lisp with no other dependencies (also in the example file src/solr-client.lisp) :

The macro with-output-to-string on line 2 of the listing is my favorite way to generate strings. Everything written to the variable stream inside the macro call is appended to a string; this string is the return value of the macro.

The following function adds documents to the Solr document input queue but does not actually index them:

You have noticed in line 3 that I am accessing a Solr server running on localhost and not a remote server. In an application using a remote Solr server you would need to modify this to reference your server; for example:

For efficiency Solr does not immediately add new documents to the index until you commit the additions. The following function should be called after you are done adding documents to actually add them to the index:

Notice that all we need is an empty element <commit></commit> that signals the Solr server that it should index all recently added documents. The following repl listing shows everything working together (I am assuming that the contents of the file src/solr-client.lisp has been loaded); not all of the output is shown in this listing:

#### Common Lisp Solr Client Wrap Up

Solr has a lot of useful features that we have not used here like supporting faceted search (drilling down in previous search results), geolocation search, and looking up indexed documents by attribute. In the examples I have shown you, all text fields are indexed but Solr optionally allows you fine control over indexing, spelling correction, word stemming, etc.

Solr is a very capable tool for storing, indexing, and searching data. I have seen Solr used effectively on projects as a replacement for a relational database or other NoSQL data stores like CouchDB or MongoDB. There is a higher overhead for modifying or removing data in Solr so for applications that involve frequent modifications to stored data Solr might not be a good choice.

### NoSQL Wrapup

There are more convenient languages than Common Lisp to use for accessing MongoDB. To be honest, my favorites are Ruby and Clojure. That said, for applications where the advantages of Common Lisp are compelling, it is good to know that your Common Lisp applications can play nicely with MongoDB.

I am a polyglot programmer: I like to use the best programming language for any specific job. When we design and build systems with more than one programming language, there are several options to share data:

• Use foreign function interfaces to call one language from another from inside one process.
• Use a service architecture and send requests using REST or SOAP.
• Use shared data stores, like relational databases, MongoDB, CouchDB and Solr.

Hopefully this chapter and the last chapter will provide most of what you need for the last option.

## Natural Language Processing

Natural Language Processing (NLP) is the automated processing of natural language text with several goals:

• Determine the parts of speech (POS tagging) of words based on the surrounding words.
• Detect if two text documents are similar.
• Categorize text (e.g., is it about the economy, politics, sports, etc.)
• Summarize text
• Determine the sentiment of text
• Detect names (e.g., place names, people’s names, product names, etc.)

We will use a library that I wrote that performs POS tagging, categorization (classification), summarization, and detects proper names.

My example code for this chapter is contained in separate Quicklisp projects located in the subdirectories:

• src/fasttag: performs part of speech tagging and tokenizes text
• src/categorize_summarize: performs categorization (e.g., detects the topic of text is news, politics, economy, etc.) and text summarization
• src/kbnlp: the top level APIs for my pure Common Lisp natural language processing (NLP) code. In later chapters we will take a different approach by using Python deep learning models for NLP that we call as a web service. I use both approaches in my own work.

I worked on this Lisp code, and also similar code in Java, from about 2001 to 2011, and again in 2019 for my application for generating knowledge graph data automatically (this is an example in a later chapter). I am going to begin the next section with a quick explanation of how to run the example code. If you find the examples interesting then you can also read the rest of this chapter where I explain how the code works.

The approach that I used in my library for categorization (word counts) is now dated. I recommend that you consider taking Andrew Ng’s course on Machine Learning on the free online Coursera system and then take one of the Coursera NLP classes for a more modern treatment of NLP.

In addition to the code for my library you might also find the linguistic data in src/linguistic_data useful.

I repackaged the NLP example code into one long file. The code used to be split over 18 source files. The code should be loaded from the src/kbnlp directory:

This also loads the projects in src/fasttag and src/categorize_summarize.

Unfortunately, it takes about a minute using SBCL to load the required linguistic data so I recommend creating a Lisp image that can be reloaded to avoid the time required to load the data:

In line 1 in this repl listing, I use the SBCL built-in function save-lisp-and-die to create the Lisp image file. Using save-lisp-and-die is a great technique to use whenever it takes a while to set up your work environment. Saving a Lisp image for use the next time you work on a Common Lisp project is reminiscent of working in Smalltalk where your work is saved between sessions in an image file.

Note: I often use Clozure-CL (CCL) instead of SBCL for developing my NLP libraries because CCL loads my data files much faster than SBCL.

You can now start SBCL with the NLP library and data preloaded using the Lisp image that you just created:

At the end of the file src/knowledgebooks_nlp.lisp in comments is some test code that processes much more text so that a summary is also generated; here is a bit of the output you will see if you load the test code into your repl:

The top-level function make-text-object takes one required argument that can be either a string containing text or an array of strings where each string is a word or punctuation. Function make-text-object has two optional keyword parameters: the URL where the text was found and a title.

In line 2, we check if this function was called with a string containing text in which case the function words-from-string is used to tokenize the text into an array of string tokens. Line two defines the local variable txt-obj with the value of a new text object with only three slots (attributes) defined: text, url, and title. Line 4 sets the slot text-tags to the part of speech tokens using the function part-of-speech-tagger. We use the function find-names-places in line 8 to get person and place names and store these values in the text object. In lines 11 through 17 we use the function get-word-list-category to set the categories in the text object. In line 18 we similarly use the function summarize to calculate a summary of the text and also store it in the text object. We will discuss these NLP helper functions throughout the rest of this chapter.

The function make-text-object returns a struct that is defined as:

### Part of Speech Tagging

This tagger is the Common Lisp implementation of my FastTag open source project. I based this project on Eric Brill’s PhD thesis (1995). He used machine learning on annotated text to learn tagging rules. I used a subset of the tagging rules that he generated that were most often used when he tested his tagger. I hand coded his rules in Lisp (and Ruby, Java, and Pascal). My tagger is less accurate, but it is fast - thus the name FastTag.

If you just need part of speech tagging (and not summarization, categorization, and top level APIs used in the last section) you can load:

You can find the tagger implementation in the function part-of-speech-tagger. We already saw sample output from the tagger in the last section:

The following table shows the meanings of the tags and a few example words:

Tag Definition Example words
CC Coord Conjuncn and, but, or
NN Noun, sing. or mass dog
CD Cardinal number one, two
NNS Noun, plural dogs, cats
DT Determiner the, some
NNP Proper noun, sing. Edinburgh
EX Existential there there
NNPS Proper noun, plural Smiths
FW Foreign Word mon dieu
PDT Predeterminer all, both
IN Preposition of, in, by
POS Possessive ending ’s
PP Personal pronoun I, you, she
PP$Possessive pronoun my, one’s JJS Adj., superlative biggest RB Adverb quickly LS List item marker 1, One RBR Adverb, comparative faster MD Modal can, should RBS Adverb, superlative fastest RP Particle up, off WP$ Possessive-Wh whose
SYM Symbol +, %, &
TO “to” to
$Dollar sign$
UH Interjection oh, oops
# Pound sign #
VB verb, base form eat, run
quote
VBD verb, past tense ate
VBG verb, gerund eating
( Left paren (
VBN verb, past part eaten
) Right paren )
VBP Verb, present eat
, Comma ,
VBZ Verb, present eats
. Sent-final punct . ! ?
WDT Wh-determiner which, that
: Mid-sent punct. : ; —
WP Wh pronoun who, what

The function part-of-speech-tagger loops through all input words and initially assigns the most likely part of speech as specified in the lexicon. Then a subset of Brill’s rules are applied. Rules operate on the current word and the previous word.

As an example Common Lisp implementation of a rule, look for words that are tagged as common nouns, but end in “ing” so they should be a gerand (verb form):

You can find the lexicon data in the file src/linguistic_data/FastTagData.lisp. This file is List code instead of plain data (that in retrospect would be better because it would load faster) and looks like:

I generated this file automatically from lexicon data using a small Ruby script. Notice that words can have more than one possible part of speech. The most common part of speech for a word is the first entry in the lexicon.

### Categorizing Text

The code to categorize text is fairly simple using a technique often called “bag of words.” I collected sample text in several different categories and for each category (like politics, sports, etc.) I calculated the evidence or weight that words contribute to supporting a category. For example, the word “president” has a strong weight for the category “politics” but not for the category “sports.” The reason is that the word “president” occurs frequently in articles and books about politics. The data file that contains the word weightings for each category is src/data/cat-data-tables.lisp. You can look at this file; here is a very small part of it:

If you only need categorization and not the other libraries developed in this chapter, you can just load this library and run the example in the comment at the bottom of the file categorize_summarize.lisp:

({lang=”lisp”,linenos=off} (ql:quickload “categorize_summarize”) (defvar x “President Bill Clinton <<2 pages text no shown>> “) (defvar words1 (myutils:words-from-string x)) (print words1) (setq cats1 (categorize_summarize:categorize words1)) (print cats1) (defvar sum1 (categorize_summarize:summarize words1 cats1)) (print sum1)

Let’s look at the implementation, starting with creating hash tables for storing word count data for each category or topic:

This file was created by a simple Ruby script (not included with the book’s example code) that processes a list of sub-directories, one sub-directory per category. The following listing shows the implementation of function get-word-list-category that calculates category tags for input text:

On thing to notice in this listing is lines 11 through 15 where I define a nested function list-sort that takes a list of sub-lists and sorts the sublists based on the second value (which is a number) in the sublists. I often nest functions when the “inner” functions are only used in the “outer” function.

Lines 2 through 9 define several local variables used in the outer function. The global variable categoryHashtables is a list of word weighting score hash tables, one for each category. The local variable category-score-accumulation-array is initialized to an array containing the number zero in each element and will be used to “keep score” of each category. The highest scored categories will be the return value for the outer function.

Lines 17 through 27 are two nested loops. The outer loop is over each word in the input word array. The inner loop is over the number of categories. The logic is simple: for each word check to see if it has a weighting score in each category’s word weighting score hash table and if it is, increment the matching category’s score.

The local variable ss is set to an empty list on line 28 and in the loop in lines 29 through 38 I am copying over categories and their scores when the score is over a threshold value of 0.01. We sort the list in ss on line 39 using the inner function and then return the categories with a score greater than the median category score.

### Detecting People’s Names and Place Names

The code for detecting people and place names is in the top level API code in the package defined in src/kbnlp. This package is loaded using:

The functions that support identifying people’s names and place names in text are in the Common Lisp package kb nlp::

• find-names (words tags exclusion-list) – words is an array of strings for the words in text, tags are the parts of speech tags (from FastTag), and the exclusion list is a an array of words that you want to exclude from being considered as parts of people’s names. The list of found names records starting and stopping indices for names in the array words.
• not-in-list-find-names-helper (a-list start end) – returns true if a found name is not already been added to a list for saving people’s names in text
• find-places (words exclusion-list) – this is similar to find-names, but it finds place names. The list of found place names records starting and stopping indices for place names in the array words.
• not-in-list-find-places-helper (a-list start end) – returns true if a found place name is not already been added to a list for saving place names in text
• build-list-find-name-helper (v indices) – This converts lists of start/stop word indices to strings containing the names
• find-names-places (txt-object) – this is the top level function that your application will call. It takes a defstruct text object as input and modifies the defstruct text by adding people’s and place names it finds in the text. You saw an example of this earlier in this chapter.

I will let you read the code and just list the top level function:

In line 2 we are using the slot accessor text-text to fetch the array of word tokens from the text object. In lines 3, 4, and 5 we are doing the same for part of speech tags, place name indices in the words array, and person names indices in the words array.

In lines 6 through 11 we are using the function build-list-find-name-helper twice to construct the person names and place names as strings given the indices in the words array. We are also using the Common Lisp built-in function remove-duplicates to get rid of duplicate names.

In lines 12 through 16 we are discarding any persons names that do not contain a space, that is, only keep names that are at least two word tokens. Lines 17 through 19 define the return value for the function: a list of lists of people and place names using the function remove-shorter-names twice to remove shorter versions of the same names from the lists. For example, if we had two names “Mr. John Smith” and “John Smith” then we would want to drop the shorter name “John Smith” from the return list.

### Summarizing Text

The code for summarizing text is located in the directory src/categorize_summarize and can be loaded using:

The code for summarization depends on the categorization code we saw earlier.

There are many applications for summarizing text. As an example, if you are writing a document management system you will certainly want to use something like Solr to provide search functionality. Solr will return highlighted matches in snippets of indexed document field values. Using summarization, when you add documents to a Solr (or other) search index you could create a new unindexed field that contains a document summary. Then when the users of your system see search results they will see the type of highlighted matches in snippets they are used to seeing in Google, Bing, or DuckDuckGo search results, and, they will see a summary of the document.

Sounds good? The problem to solve is getting good summaries of text and the technique used may have to be modified depending on the type of text you are trying to summarize. There are two basic techniques for summarization: a practical way that almost everyone uses, and an area of research that I believe has so far seen little practical application. The techniques are sentence extraction and abstraction of text into a shorter form by combining and altering sentences. We will use sentence extraction.

How do we choose which sentences in text to extract for the summary? The idea I had in 1999 was simple. Since I usually categorize text in my NLP processing pipeline why not use the words that gave the strongest evidence for categorizing text, and find the sentences with the largest number of these words. As a concrete example, if I categorize text as being “politics”, I identify the words in the text like “president”, “congress”, “election”, etc. that triggered the “politics” classification, and find the sentences with the largest concentrations of these words.

Summarization is something that you will probably need to experiment with depending on your application. My old summarization code contained a lot of special cases, blocks of commented out code, etc. I have attempted to shorten and simplify my old summarization code for the purposes of this book as much as possible and still maintain useful functionality.

The function for summarizing text is fairly simple because when the function summarize is called by the top level NLP library function make-text-object, the input text has already been categorized. Remember from the example at the beginning of the chapter that the category data looks like this:

This category data is saved in the local variable cats on line 4 of the following listing.

The nested loops in lines 8 through 33 look a little complicated, so let’s walk through it. Our goal is to calculate an importance score for each word token in the input text and to then select a few sentences containing highly scored words. The outer loop is over the word tokens in the input text. For each word token we loop over the list of categories, looking up the current word in each category hash and incrementing the score for the current word token. As we increment the word token scores we also look for sentence breaks and save sentences.

The complicated bit of code in lines 16 through 32 where I construct sentences and their scores, and store sentences with a score above a threshold value in the list best-sentences. After the two nested loops, in lines 34 through 44 we simply sort the sentences by score and select the “best” sentences for the summary. The extracted sentences are no longer in their original order, which can have strange effects, but I like seeing the most relevant sentences first.

### Text Mining

Text mining in general refers to finding data in unstructured text. We have covered several text mining techniques in this chapter:

• Named entity recognition - the NLP library covered in this chapter recognizes person and place entity names. I leave it as an exercise for you to extend this library to handle company and product names. You can start by collecting company and product names in the files src/kbnlp/linguistic_data/names/names.companies and src/kbnlp/data/names/names.products and extend the library code.
• Categorizing text - you can increase the accuracy of categorization by adding more weighted words/terms that support categories. If you are already using Java in the systems you build, I recommend the Apache OpenNLP library that is more accurate than the simpler “bag of words” approach I used in my Common Lisp NLP library. If you use Python, then I recommend that you also try the NLTK library.
• Summarizing text.

In the next chapter I am going to cover another “data centric” topic: performing information gathering on the web. You will likely find some synergy between being able to use NLP to create structured data from unstructured text.

## Information Gathering

This chapter covers information gathering on the web using data sources and general techniques that I have found useful. When I was planning this new book edition I had intended to also cover some basics for using the Semantic Web from Common Lisp, basically distilling some of the data from my previous book “Practical Semantic Web and Linked Data Applications, Common Lisp Edition” published in 2011. However since a free PDF is now available for that book I decided to just refer you to my previous work if you are interested in the Semantic Web and Linked Data. You can also find the Java edition of this previous book on my web site.

Gathering information from the web in realtime has some real advantages:

• You don’t need to worry about storing data locally.
• Information is up to date (depending on which web data resources you choose to use).

There are also a few things to consider:

• Data on the web may have legal restrictions on its use so be sure to read the terms and conditions on web sites that you would like to use.
• Authorship and validity of data may be questionable.

### DBPedia Lookup Service

To load and run an example, try:

Wikipedia is a great resource to have on hand but I am going to show you in this section how to access the Semantic Web version or Wikipedia, DBPedia using the DBPedia Lookup Service in the next code listing that shows the contents of the example file dbpedia-lookup.lisp in the directory src/dbpedia:

I am only capturing the attributes for DBPedia URI, label and description in this example code. If you uncomment line 41 and look at the entire response body from the call to DBPedia Lookup, you can see other attributes that you might want to capture in your applications.

Here is a sample call to the function dbpedia:dbpedia-lookup (only some of the returned data is shown):

Wikipedia, and the DBPedia linked data for of Wikipedia are great sources of online data. If you get creative, you will be able to think of ways to modify the systems you build to pull data from DPPedia. One warning: Semantic Web/Linked Data sources on the web are not available 100% of the time. If your business applications depend on having the DBPedia always available then you can follow the instructions on the DBPedia web site to install the service on one of your own servers.

### Web Spiders

When you write web spiders to collect data from the web there are two things to consider:

• Make sure you read the terms of service for web sites whose data you want to use. I have found that calling or emailing web site owners explaining how I want to use the data on their site usually works to get permission.
• Make sure you don’t access a site too quickly. It is polite to wait a second or two between fetching pages and other assets from a web site.

We have already used the Drakma web client library in this book. See the files src/dbpedia/dbpedia-lookup.lisp (covered in the last section) and src/solr_examples/solr-client.lisp (covered in the Chapter on NoSQL). Paul Nathan has written library using Drakma to crawl a web site with an example to print out links as they are found. His code is available under the AGPL license at articulate-lisp.com/src/web-trotter.lisp and I recommend that as a starting point.

I find it is sometimes easier during development to make local copies of a web site so that I don’t have to use excess resources from web site hosts. Assuming that you have the wget utility installed, you can mirror a site like this:

Both of these examples have a two-second delay between HTTP requests for resources. The option -m indicates to recursively follow all links on the web site. The -w 2 option delays for two seconds between requests. The option -mk converts URI references to local file references on your local mirror. The second example on line 2 is more convenient.

We covered reading from local files in the Chapter on Input and Output. One trick I use is to simply concatenate all web pages into one file. Assuming that you created a local mirror of a web site, cd to the top level directory and use something like this:

You can then open the file, search for text in in p, div, h1, etc. HTML elements to process an entire web site as one file.

### Using Apache Nutch

Apache Nutch, like Solr, is built on Lucene search technology. I use Nutch as a “search engine in a box” when I need to spider web sites and I want a local copy with a good search index.

Nutch handles a different developer’s use case over Solr which we covered in the Chapter on NoSQL. As we saw, Solr is an effective tool for indexing and searching structured data as documents. With very little setup, Nutch can be set up to automatically keep an up to date index of a list of web sites, and optionally follow links to some desired depth from these “seed” web sites.

You can use the same Common Lisp client code that we used for Solr with one exception; you will need to change the root URI for the search service to:

So the modified client code src/solr_examples/solr-client.lisp needs one line changed:

Early versions of Nutch were very simple to install and configure. Later versions of Nutch have been more complex, more performant, and have more services, but it will take you longer to get set up than earlier versions. If you just want to experiment with Nutch, you might want to start with an earlier version.

The OpenSearch.org web site contains many public OpenSearch services that you might want to try. If you want to modify the example client code in src/solr-client.lisp a good start is OpenSearch services that return JSON data and OpenSearch Community JSON formats web page is a good place to start. Some of the services on this web page like the New York Times service require that you sign up for a developer’s API key.

When I start writing an application that requires web data (no matter which programming language I am using) I start by finding services that may provide the type of data I need and do my initial development with a web browser with plugin support to nicely format XML and JSON data. I do a lot of exploring and take a lot of notes before I write any code.

### Wrap Up

I tried to provide some examples and advice in this short chapter to show you that even though other languages like Ruby and Python have more libraries and tools for gathering information from the web, Common Lisp has good libraries for information gathering also and they are easily used via Quicklisp.

## Linear Algebra Using the MAGICL Library

### Installation

MAGICL uses either the BLAS or LAPACK packages for efficient array calculations and the many linear algebra functions provided in these packages. BLAS and LAPACK are partially written in FORTRAN and may not be easy to install on your system. Fear not! You can run MAGICL (somewhat slowly) using their pure Common Lisp backend that can be installed:

• cd ~/~/quicklisp/local-projects
• git clone https://github.com/quil-lang/magicl.git

Then you can simply use the pure Common Lisp backend; for example:

That said, try to install one of the much faster backends using the documentation for installing dependencies.

## Using The CL Machine-Learning Library

The CL Machine-Learning (CLML) library was originally developed by MSI (NTT DATA Mathematical Systems Inc. in Japan) and is supported by many developers. You should visit the CLML web page for project documentation and follow the installation directions and read about the project before using the examples in this chapter. However if you just want to quickly try the following CLML examples then you can install CLML using Quicklisp:

The installation will take a while to run but after installation using the libraries via quickload is fast. You can now run the example Quicklisp project src/clml_examples:

Please be patient the first time you run this because the first time you load the example project, the one time installation of CLML will take a while to run but after installation then the example project loads quickly. CLML installation involves downloading and installing BLAS, LAPACK, and other libraries.

Other resources for CLML are the tutorials and contributed extensions that include support for plotting (using several libraries) and for fetching data sets.

Although CLML is fairly portable we will be using SBCL and we need to increase the heap space when starting SBCL when we want to use the CLML library:

You can refer to the documentation at https://github.com/mmaul/clml. This documentation lists the packages with some information for each package but realistically I keep the source code for CLML in an editor or IDE and read source code while writing code that uses CLML. I will show you with short examples how to use the KNN (K nearest neighbors) and SVM (support vector machines) APIs. We will not cover other useful CLML APIs like time series processing, Naive Bayes, PCA (principle component analysis) and general matrix and tensor operations.

Even though the learning curve is a bit steep, CLML provides a lot of functionality for machine learning, dealing with time series data, and general matrix and tensor operations.

The CLML project uses several data sets and since the few that we will use are small files, they are included in the book’s repository in directory machine_learning_data under the src directory. The first few lines of labeled_cancer_training_data.csv are:

The first line in the CSV data files specifies names for each attribute with the name of the last column being “Class” which here takes on values benign or malignant. Later, the goal will be to create models that are constructed from training data and then make predictions of the “Class” of new input data. We will look at how to build and use machine learning models later but here we concentrate on reading and using input data.

The example file clml_data_apis.lisp shows how to open a file and loop over the values for each row:

The function read-data defined in lines 11-19 uses the utility function clml.hjs.read-data:read-data-from-file to read a CSV (comma separated value) spreadsheet file from disk. The CSV file is expected to contain 10 columns (set in lines 17-18) with the first nine columns containing floating point values and the last column text data.

The function loop-over-and-print-data defined in lines 21-26 reads the CLML data set object, looping over each data sample (i.e., each row in the original spreadsheet file) and printing it.

In the next section we will use the same cancer data training file, and another test data in the same format to cluster this cancer data into similar sets, one set for non-malignant and one for malignant samples.

### K-Means Clustering of Cancer Data Set

We will now read the same University of Wisconsin cancer data set and cluster the input samples (one sample per row of the spreadsheet file) into similar classes. We will find after training a model that the data is separated into two clusters, representing non-malignant and malignant samples.

The function cancer-data-cluster-example-read-data defined in lines 33-47 is very similar to the function read-data in the last section except here we read in two data files: one for training and one for testing.

The function cluster-using-k-nn defined in lines 13-30 uses the training and test data objects to first train a model and then to test it with test data that was previously used for training. Notice how we call this function in line 47: the first two arguments are the two data set objects, the third is the string “Class” that is the label for the 10th column of the original spreadsheet CSV files, and the last argument is the type of distance measurement used to compare two data samples (i.e., comparing any two rows of the training CSV data file).

The following listing shows the output from running the last code example:

### SVM Classification of Cancer Data Set

We will now reuse the same cancer data set but use a different way to classify data into non-malignant and malignant categories: Support Vector Machines (SVM). SVMs are linear classifiers which means that they work best when data is linearly separable. In the case of the cancer data, there are nine dimensions of values that (hopefully) predict one of the two output classes (or categories). If we think of the first 9 columns of data as defining a 9-dimensional space, then SVM will work well when a 8-dimensional hyperplane separates the samples into the two output classes (categories).

To make this simpler to visualize, if we just had two input columns, that defines a two-dimensional space, and if a straight line can separate most of the examples into the two output categories, then the data is linearly separable so SVM is a good technique to use. The SVM algorithm is effectively determining the parameters defining this one-dimensional line (or in the cancer data case, the 9-dimensional hyperspace).

What if data is not linearly separable? Then use the backpropagation neural network code in the chapter “Backpropagation Neural Networks” or the deep learning code in the chapter “Using Armed Bear Common Lisp With DeepLearning4j” to create a model.

SVM is very efficient so it often makes sense to first try SVM and if trained models are not accurate enough then use neural networks, including deep learning.

The following listing of file clml_svm_classifier.lisp shows how to read data, build a model and evaluate the model with different test data. In line 15 we use the function clml.svm.mu:svm that requires the type of kernel function to use, the training data, and testing data. Just for reference, we usually use Gaussian kernel functions for processing numeric data and linear kernel functions for handling text in natural language processing applications. Here we use a Gaussian kernel.

The function cancer-data-svm-example-read-data defined on line 40 differs from how we read and processed data earlier because we need to separate out the positive and negative training examples. The data is split in the lexically scoped function in lines 42-52. The last block of code in lines 54-82 is just top-level test code that gets executed when the file clml_svm_classifier.lisp is loaded.

The sample code prints the prediction values for the test data which I will not show here. Here are the last four lines of output showing the cumulative statistics for the test data:

### CLML Wrap Up

The CLML machine learning library is under fairly active development and I showed you enough to get started: understanding the data APIs and examples for KNN clustering and SVM classification.

A good alternative to CLML is MGL that supports backpropagation neural networks, boltzmann machines, and gaussian processes.

In the next two chapters we continue with the topic of machine learning with backpropagation andf Hopfield neural networks.

## Backpropagation Neural Networks

Let’s start with an overview of how these networks work and then fill in more detail later. Backpropagation networks are trained by applying training inputs to the network input layer, propagate values through the network to the output neurons, compare the errors (or differences) between these propagated output values and the training data output values. These output errors are backpropagated though the network and the magnitude of backpropagated errors are used to adjust the weights in the network.

The example we look at here uses the plotlib package from an earlier chapter and the source code for the example is the file loving_snippet/backprop_neural_network.lisp.

We will use the following diagram to make this process more clear. There are four weights in this very simple network:

• W1,1 is the floating point number representing the connection strength between input_neuron1 and output_neuron1
• W2,1 connects input_neuron2 to output_neuron1
• W1,2 connects input_neuron1 to output_neuron2
• W2,2 connects input_neuron2 to output_neuron2

Before any training the weight values are all small random numbers.

Consider a training data element where the input neurons have values [0.1, 0.9] and the desired output neuron values are [0.9 and 0.1], that is flipping the input values. If the propagated output values for the current weights are [0.85, 0.5] then the value of the first output neuron has a small error abs(0.85 - 0.9) which is 0.05. However the propagated error of the second output neuron is high: abs(0.5 - 0.1) which is 0.4. Informally we see that the weights feeding input output neuron 1 (W1,1 and W2,1) don’t need to be changed much but the neuron that feeding input neuron 2 (W1,2 and W2,2) needs modification (the value of W2,2 is too large).

Of course, we would never try to manually train a network like this but it is important to have at least an informal understanding of how weights connect the flow of value (we will call this activation value later) between neurons.

In this neural network see in the first figure we have four weights connecting the input and output neurons. Think of these four weights forming a four-dimensional space where the range in each dimension is constrained to small positive and negative floating point values. At any point in this “weight space”, the numeric values of the weights defines a model that maps the inputs to the outputs. The error seen at the output neurons is accumulated for each training example (applied to the input neurons). The training process is finding a point in this four-dimensional space that has low errors summed across the training data. We will use gradient descent to start with a random point in the four-dimensional space (i.e., an initial random set of weights) and move the point towards a local minimum that represents the weights in a model that is (hopefully) “good enough” at representing the training data.

This process is simple enough but there are a few practical considerations:

• Sometimes the accumulated error at a local minimum is too large even after many training cycles and it is best to just restart the training process with new random weights.
• If we don’t have enough training data then the network may have enough memory capacity to memorize the training examples. This is not what we want: we want a model with just enough memory capacity (as represented by the number of weights) to form a generalized predictive model, but not so specific that it just memorizes the training examples. The solution is to start with small networks (few hidden neurons) and increase the number of neurons until the training data can be learned. In general, having a lot of training data is good and it is also good to use as small a network as possible.

In practice using backpropagation networks is an iterative process of experimenting with the size of a network.

In the example program (in the file backprop_neural_network.lisp) we use the plotting library developed earlier to visualize neuron activation and connecting weight values while the network trains.

The following three screen shots from running the function test3 defined at the bottom of the file backprop_neural_network.lisp illustrate the process of starting with random weights, getting random outputs during initial training, and as delta weights are used to adjust the weights in a network, then the training examples are learned:

In the last figure the initial weights are random so we get random mid-range values at the output neurons.

As we start to train the network, adjusting the weights, we start to see variation in the output neurons as a function of what the inputs are.

In the last figure the network is trained sufficiently well to map inputs [0, 0, 0, 1] to output values that are approximately [0.8, 0.2, 0.2, 0.3] which is close to the expected value [1, 0, 0, 0].

The example source file backprop_neural_network.lisp is long so we will only look at the more interesting parts here. Specifically we will not look at the code to plot neural networks using plotlib.

The activation values of individual neurons are limited to the range [0, 1] by first calculating their values based on the sum activation values of neurons in the previous layer times the values of the connecting weights and then using the Sigmoid function to map the sums to the desired range. The Sigmoid function and the derivative of the Sigmoid function (dSigmoid) look like:

Here are the definitions of these functions:

The function NewDeltaNetwork creates a new neual network object. This code allocates storage for input, hidden, output layers (I sometimes refer to neuron layers as “slabs”), and the connection weights. Connection weights are initialized to small random values.

In the following listing the function DeltaLearn processes one pass through all of the training data. Function DeltaLearn is called repeatedly until the return value is below a desired error threshold. The main loop over each training example is implemented in lines 69-187. Inside this outer loop there are two phases of training for each training example: a forward pass propagating activation from the input neurons to the output neurons via any hidden layers (lines 87-143) and then the weight correcting backpropagation of output errors while making small adjustments to weights (lines 148-187):

The function DeltaRecall in the next listing can be used with a trained network to calculate outputs for new input values:

We saw three output plots earlier that were produced during a training run using the following code:

Here the function test3 defines training data for a very small test network for a moderately difficult function to learn: to rotate the values in the input neurons to the right, wrapping around to the first neuron. The start of the main loop in line calls the training function 3000 times, creating a plot of the network every 400 times through the main loop.

Backpropagation networks have been used sucessfully in production for about 25 years. In the next chapter we will look at a less practical type of network, Hopfield networks, that are still interesting because the in some sense Hopfield networks model how our brains work. In the final chapter we will look at deep learning neural networks.

## Hopfield Neural Networks

A Hopfield network (named after John Hopfield) is a recurrent network since the flow of activation through the network has loops. These networks are trained by applying input patterns and letting the network settle in a state that stores the input patterns.

The example code is in the file src/loving_snippets/Hopfield_neural_network.lisp.

The example we look at recognizes patterns that are similar to the patterns seen in training examples and maps input patterns to a similar training input pattern. The following figure shows output from the example program showing an original training pattern, a similar pattern with one cell turned on and other off, and the reconstructed pattern:

To be clear, we have taken one of the original input patterns the network has learned, slightly altered it, and applied it as input to the network. After cycling the network, the slightly scrambled input pattern we just applied will be used as an associative memory key, look up the original pattern, and rewrite to input values with the original learned pattern. These Hopfield networks are very different than backpropagation networks: neuron activation are forced to values of -1 or +1 and not be differentiable and there are no separate output neurons.

The next example has the values of three cells modified from the original and the original pattern is still reconstructed correctly:

This last example has four of the original cells modified:

The following example program shows a type of content-addressable memory. After a Hopfield network learns a set of input patterns then it can reconstruct the original paterns when shown similar patterns. This reconstruction is not always perfecrt.

The following function Hopfield-Init (in file Hopfield_neural_network.lisp) is passed a list of lists of training examples that will be remembered in the network. This function returns a list containing the data defining a Hopfield neural network. All data for the network is encapsulated in the list returned by this function, so multiple Hopfield neural networks can be used in an application program.

In lines 9-12 we allocate global arrays for data storage and in lines 14-18 the training data is copied.

The inner function adjustInput on lines 20-29 adjusts data values to values of -1.0 or +1.0. In lines 31-33 we are initializing all of the weights in the Hopfield network to zero.

The last nested loop, on lines 35-52, calculates the autocorrelation weight matrix from the input test patterns.

On lines 54-56, the function returns a representation of the Hopfield network that will be used later in the function HopfieldNetRecall to find the most similar “remembered” pattern given a new (fresh) input pattern.

The following function HopfieldNetRecall iterates the network to let it settle in a stable pattern which we hope will be the original training pattern most closely resembling the noisy test pattern.

The inner (lexically scoped) function deltaEnergy defined on lines 9-12 calculates a change in energy from old input values and the autocorrelation weight matrix. The main code uses the inner functions to iterate over the input cells, possibly modifying the cell at index i delta energy is greater than zero. Remember that the lexically scoped inner functions have access to the variables for the number of inputs, the number of training examples, the list of training examples, the input cell values, tempoary storage, and the Hopfield network weights.

Function test in the next listing uses three different patterns for each test. Note that only the last pattern gets plotted to the output graphics PNG file for the purpose of producing figures for this chapter. If you want to produce plots of other patterns, edit just the third pattern defined on line AAAAA. The following plotting functions are inner lexically scoped so they have access to the data defined in the enclosing let expression in lines 16-21:

• plotExemplar - plots a vector of data
• plot-original-inputCells - plots the original input cells from training data
• plot-inputCells - plots the modified input cells (a few cells randomly flipped in value)
• modifyInput - scrambles training inputs

The plotting functions in lines 23-62 use the plotlib library to make the plots you saw earlier. The function modifyInput in lines 64-69 randomly flips the values of the input cells, taking an original pattern and slightly modifying it.

Hopfield neural networks, at least to some extent, seem to model some aspects of human brains in the sense that they can function as content-addressable (also called associative) memories. Ideally a partial input pattern from a remembered input can reconstruct the complete original pattern. Another interesting feature of Hopfield networks is that these memories really are stored in a distributed fashion: some of the weights can be randomly altered and patterns are still remembered, but with more recall errors.

## Using Python Deep Learning Models In Common Lisp With a Web Services Interface

In older editions of this book I had an example of using the Java DeepLearning4J deep learning library using Armed Bear Common Lisp, implemented in Java. I no longer use hybrid Java and Common Lisp applications in my own work and I decided to remove this example and replace it with two projects that use simple Python web services that act as wrappers for state of the art deep learning models with Common Lisp clients in the subdirectories:

• src/spacy_web_client: use the spaCy deep learning models for general NLP. I sometimes use my own pure Common Lisp NLP libraries we saw in earlier chapters and sometimes I use a Common Lisp client calling deep learning libraries like spaCy and TensorFlow.
• src/coref_web_client: coreference or anaphora resolution is the act of replacing pronouns in text with the original nouns that they refer to. This has traditionally been a very difficult and only partially solved problem until recent advances in deep learning models like BERT.

Note: in the next chapter we will cover similar functionality but we will use the py4cl library to more directly use Python and libraries like spaCy by starting another Python process and using streams for communication.

### Setting up the Python Web Services Used in this Chapter

You will need python and pip installed on your system. The source e code for the Python web services is found in the directory loving-common-lisp/python.

### Installing the spaCY NLP Services

I assume that you have some familiarity with using Python. If not, you will still be able to follow these directions assuming that you have the utilities pip, and python installed. I recommend installing Python and Pip using Anaconda.

The server code is in the subdirectory python/python_spacy_nlp_server where you will work when performing a one time initialization. After the server is installed you can then run it from the command line from any directory on your laptop.

I recommend that you use virtual Python environments when using Python applications to separate the dependencies required for each application or development project. Here I assume that you are running in a Python version 3.6 or higher environment. First you must install the dependencies:

Then change directory to the subdirectory python/python_spacy_nlp_server in the git repo for this book and install the NLP server:

Once you install the server, you can run it from any directory on your laptop or server using:

I use deep learning models written in Python using TensorFlow or PyTorch and provide Python web services that can be used in applications I write in Haskell or Common Lisp using web client interfaces for the services written in Python. While it is possible to directly embed models in Haskell and Common Lisp, I find it much easier and developer friendly to wrap deep learning models I use a REST services as I have done here. Often deep learning models only require about a gigabyte of memory and using pre-trained models has lightweight CPU resource needs so while I am developing on my laptop I might have two or three models running and available as wrapped REST services. For production, I configure both the Python services and my Haskell and Common Lisp applications to start automatically on system startup.

This is not a Python programming book and I will not discuss the simple Python wrapping code but if you are also a Python developer you can easily read and understand the code.

### Installing the Coreference NLP Services

I recommend that you use virtual Python environments when using Python applications to separate the dependencies required for each application or development project. Here I assume that you are running in a Python version 3.6 environment. First you should install the dependencies:

As I write this chapter the neuralcoref model and library require a slightly older version of SpaCy (the current latest version is 2.1.4).

Then change directory to the subdirectory python/python_coreference_anaphora_resolution_server in the git repo for this book and install the coref server:

Once you install the server, you can run it from any directory on your laptop or server using:

While. as we saw in the last example, it is possible to directly embed models in Haskell and Common Lisp, I find it much easier and developer friendly to wrap deep learning models I use a REST services as I have done here. Often deep learning models only require about a gigabyte of memory and using pre-trained models has lightweight CPU resource needs so while I am developing on my laptop I might have two or three models running and available as wrapped REST services. For production, I configure both the Python services and my Haskell and Common Lisp applications to start automatically on system startup.

This is not a Python programming book and I will not discuss the simple Python wrapping code but if you are also a Python developer you can easily read and understand the code.

### Common Lisp Client for the spaCy NLP Web Services

Before looking at the code, I will show you typical output from running this example:

The client library is implemented in the file src/spacy_web_client/spacy-web-client.lisp:

On line 3 we define base URL for accessing the spaCy web service, assuming that it is running on your laptop and not a remote server. On line 5 we define a defstruct named spacy-data that has two fields: a list of entities in the input text and a list of word tokens in the input text.

The function spacy-client builds a query string on lines 10-12 that consists of the base-url and the input query text URL encoded. The drakma library, that we used before, is used to make a HTTP request from the Python spaCy server. Lines 14-15 uses the flexi-streams package to convert raw byte data to UTF8 characters. Lines 16-17 use the json package to parse the UTF8 encoded string, getting two lists of strings. I left the debug printout expression in line 18 so that you can see the results of parsing the JSON data. The function make-spacy-data was generated for us by the defstruct statement on line 5.

### Common Lisp Client for the Coreference NLP Web Services

Let’s look at some typical output from this example, then we will look at the code:

Notice that pronouns in the input text are correctly replaced by the noun phrases that the pronoun refer to.

The implementation for the core client is in the file src/coref_web_client/coref.lisp:

This code is similar to the example in the last section for setting up a call to http-request but is simpler: here the Python coreference web service accepts a string as input and returns a string as output with pronouns replaced by the nouns or noun phrases that they refer to. The example in the last section had to parse returned JSON data, this example does not.

### Trouble Shooting Possible Problems - Skip if this Example Works on Your System

If you run Common Lisp in an IDE (for example in LispWorks’ IDE or VSCode with a Common Lisp plugin) make sure you start the IDE from the command line so your PATH environment variable will be set as it is in our bash or zsh shell.

Make sure you are starting your Common Lisp program or running a Common Lisp repl with the same Python installation (if you have Quicklisp installed, then you also have the package uiop installed):

### Python Interop Wrap-up

Much of my professional work in the last five years involved deep learning models and currently most available software is written in Python. While there are available libraries for calling Python code from Common Lisp, these libraries tend to not work well for Python code using libraries like TensorFlow, spaCy, PyTorch, etc., especially if the Python code is configured to use GPUs via CUDA of special hardware like TPUs. I find it simpler to simply wrap functionality implemented in Python as a simple web service.

## Using the PY4CL Library to Embed Python in Common Lisp

We will tackle the same problem as the previous chapter but take a different approach. Now we will use Ben Dudson’s project Py4CL that automatically starts a Python process and communicates with the Python process via a stream interface. The approach we took before is appropriate for large scale systems where you might want scale horizontally by having Python processes running on different servers than the servers used for the Common Lisp parts of your application. The approach we now take is much more convenient for what I call “laptop development” where the management of a Python process and communication is handled for you by the Py4CL library. If you need to build multi-server distributed systems for scaling reasons then use the examples in the last chapter.

While Py4CL provides a lot of flexibility for passing primitive types between Common Lisp and Python (in both directions), I find it easiest to write small Python wrappers that only use lists, arrays, numbers, and strings as arguments and return types. You might want to experiment with the examples on the Py4CL GitHub page that let you directly call Python libraries without writing wrappers. When I write code for my own projects I try to make code as simple as possible so when I need to later revisit my own code it is immediately obvious what it is doing. Since I have been using Common Lisp for almost 40 years, I often find myself reusing bits of my own old code and I optimize for making this as easy as possible. In other words I favor readability over “clever” code.

### Project Structure, Building the Python Wrapper, and Running an Example

The packaging of the Lisp code for my spacy-py4cl package is simple. Here is the listing of package.lisp for this project:

Listing of spacy-py4cl.asd:

You need to run a Python setup procedure to install the Python wrapper for space-py4cl on your system. Some output is removed for conciseness:

You only need to do this once unless you update to a later version of Python on your system.

If you are not familiar with Python, it is worth looking at the wrapper implementation, otherwise skip the next few paragraphs.

Here is the implementation of setup.py that specifies how to build and install the wrapper globally for use on your system:

The definition of the library in file PYTHON_SPACY_SETUP_install/spacystub/spacystub/parse.py:

Here is a Common Lisp repl session showing you how to use the library implemented in the next section:

Entities in text are identified with the starting and ending character indices that refer to the input string. For example, the entity “Mexico” starts at character position 17 and character index 23 is the character after the entity name in the input string. The entity type “GPE” refers to a country name and “PERSON” refers to a person’s name in the input text.

### Implementation of spacy-py4cl

The Common Lisp implementation for this package is simple. In line 5 the call to py4cl:python-exec starts a process to run Python and imports the function parse from my Python wrapper. The call to py4cl:import-function in line 6 finds a function named “parse” in the attached Python process and generates a Common Lisp function with the same name that handles calling into Python and converting handling the returned values to Common Lisp values:

While it is possible to call Python libraries directly using Py4CL, when I need to frequently use Python libraries like spaCY, TensorFlow, fast.ai, etc. in Common Lisp, I like to use wrappers that use simple as possible data types and APIs to communicate between a Common Lisp process and the spawned Python process.

### Trouble Shooting Possible Problems - Skip if this Example Works on Your System

When you install my wrapper library in Python on the command line whatever your shell if (bash, zsh, etc.) you should then try to import the library in a Python repl:

If this works and the Common Lisp library spacy-py4cl does not, then make sure you are starting your Common Lisp program or running a Common Lisp repl with the same Python installation (if you have Quicklisp installed, then you also have the package uiop installed):

If you run Common Lisp in an IDE (for example in LispWorks’ IDE or VSCode with a Common Lisp plugin) make sure you start the IDE from the command line so your PATH environment variable will be set as it is in our bash or zsh shell.

### Wrap-up for Using Py4CL

While I prefer Common Lisp for general development and also AI research, there are useful Python libraries that I want to integrate into my projects. I hope that the last chapter and this chapter provide you with two solid approaches for you to use in your own work to take advantage of Python libraries.

## Semantic Web and Linked Data

I have written two previous books on the semantic web and linked data and most of my programming books have semantic web examples. Please note that the background material here on the semantic web standards RDF, RDFS, and SPARQL is shared with my book Practical Artificial Intelligence Programming With Java so if you have read that book then the first several pages of this chapter will seem familiar.

Construction of Knowledge Graphs, as we will do in later chapters, is a core technology at many corporations and organizations to prevent data silos where different database systems are poorly connected and not as useful in combination than they could be. The use of RDF data stores is a powerful technique for data interoperability within organizations. Semantic Web standards like RDF, RDFS, and SPARQL support both building Knowledge Graphs and also key technologies for automating the collection and use of web data.

I worked as a contractor at Google on an internal Knowledge Graph project and I currently work at Olive AI on their Knowledge Graph team.

The semantic web is intended to provide a massive linked set of data for use by software systems just as the World Wide Web provides a massive collection of linked web pages for human reading and browsing. The semantic web is like the web in that anyone can generate any content that they want. This freedom to publish anything works for the web because we use our ability to understand natural language to interpret what we read – and often to dismiss material that based upon our own knowledge we consider to be incorrect.

Semantic web and linked data technologies are also useful for smaller amounts of data, an example being a Knowledge Graph containing information for a business. We will further explore Knowledge Graphs in the next two chapters.

The core concept for the semantic web is data integration and use from different sources. As we will soon see, the tools for implementing the semantic web are designed for encoding data and sharing data from many different sources.

I cover the semantic web in this book because I believe that semantic web technologies are complementary to AI systems for gathering and processing data on the web. As more web pages are generated by applications (as opposed to simply showing static HTML files) it becomes easier to produce both HTML for human readers and semantic data for software agents.

There are several very good semantic web toolkits for the Java language and platform. Here we use Apache Jena because it is what I often use in my own work and I believe that it is a good starting technology for your first experiments with semantic web technologies. This chapter provides an incomplete coverage of semantic web technologies and is intended as a gentle introduction to a few useful techniques and how to implement those techniques in Java. This chapter is the start of a journey in the technology that I think is as important as technologies like deep learning that get more public mindshare.

The following figure shows a layered hierarchy of data models that are used to implement semantic web applications. To design and implement these applications we need to think in terms of physical models (storage and access of RDF, RDFS, and perhaps OWL data), logical models (how we use RDF and RDFS to define relationships between data represented as unique URIs and string literals and how we logically combine data from different sources) and conceptual modeling (higher level knowledge representation and reasoning using OWL). Originally RDF data was serialized as XML data but other formats have become much more popular because they are easier to read and manually create. The top three layers in the figure might be represented as XML, or as LD-JSON (linked data JSON) or formats like N-Triples and N3 that we will use later.

### Resource Description Framework (RDF) Data Model

The Resource Description Framework (RDF) is used to encode information and the RDF Schema (RDFS) facilitates using data with different RDF encodings without the need to convert one set of schemas to another. Later, using OWL we can simply declare that one predicate is the same as another, that is, one predicate is a sub-predicate of another (e.g., a property containsCity can be declared to be a sub-property of containsPlace so if something contains a city then it also contains a place), etc. The predicate part of an RDF statement often refers to a property.

RDF data was originally encoded as XML and intended for automated processing. In this chapter we will use two simple to read formats called “N-Triples” and “N3.” Apache Jena can be used to convert between all RDF formats so we might as well use formats that are easier to read and understand. RDF data consists of a set of triple values:

• subject
• predicate
• object

Some of my work with semantic web technologies deals with processing news stories, extracting semantic information from the text, and storing it in RDF. I will use this application domain for the examples in this chapter and the next chapter when we implement code to automatically generate RDF for Knowledge Graphs. I deal with triples like:

• subject: a URL (or URI) of a news article.
• predicate: a relation like “containsPerson”.
• object: a literal value like “Bill Clinton” or a URI representing Bill Clinton.

In the next chapter we will use the entity recognition library we developed in an earlier chapter to create RDF from text input.

We will use either URIs or string literals as values for objects. We will always use URIs for representing subjects and predicates. In any case URIs are usually preferred to string literals. We will see an example of this preferred use but first we need to learn the N-Triple and N3 RDF formats.

I proposed the idea that RDF was more flexible than Object Modeling in programming languages, relational databases, and XML with schemas. If we can tag new attributes on the fly to existing data, how do we prevent what I might call “data chaos” as we modify existing data sources? It turns out that the solution to this problem is also the solution for encoding real semantics (or meaning) with data: we usually use unique URIs for RDF subjects, predicates, and objects, and usually with a preference for not using string literals. The definitions of predicates are tied to a namespace and later with OWL we will state the equivalence of predicates in different namespaces with the same semantic meaning. I will try to make this idea more clear with some examples and Wikipedia has a good writeup on RDF.

Any part of a triple (subject, predicate, or object) is either a URI or a string literal. URIs encode namespaces. For example, the containsPerson predicate in the last example could be written as:

The first part of this URI is considered to be the namespace for this predicate “containsPerson.” When different RDF triples use this same predicate, this is some assurance to us that all users of this predicate understand to the same meaning. Furthermore, we will see later that we can use RDFS to state equivalency between this predicate (in the namespace http://knowledgebooks.com/ontology/) with predicates represented by different URIs used in other data sources. In an “artificial intelligence” sense, software that we write does not understand predicates like “containsCity”, “containsPerson”, or “isLocation” in the way that a human reader can by combining understood common meanings for the words “contains”, “city”, “is”, “person”, and “location” but for many interesting and useful types of applications that is fine as long as the predicate is used consistently. We will see shortly that we can define abbreviation prefixes for namespaces which makes RDF and RDFS files shorter and easier to read.

The Jena library supports most serialization formats for RDF:

• Turtle
• N3
• N-Triples
• TriG
• JSON-LD
• RDF/XML
• RDF/JSON
• TriX
• RDF Binary

A statement in N-Triple format consists of three URIs (two URIs and a string literals for the object) followed by a period to end the statement. While statements are often written one per line in a source file they can be broken across lines; it is the ending period which marks the end of a statement. The standard file extension for N-Triple format files is *.nt and the standard format for N3 format files is *.n3.

My preference is to use N-Triple format files as output from programs that I write to save data as RDF. N-Triple files don’t use any abbreviations and each RDF statement is self-contained. I often use tools like the command line commands in Jena or RDF4J to convert N-Triple files to N3 or other formats if I will be reading them or even hand editing them. Here is an example using the N3 syntax:

The N3 format adds prefixes (abbreviations) to the N-Triple format. In practice it would be better to use the URI http://dbpedia.org/resource/China instead of the literal value “China.”

Here we see the use of an abbreviation prefix “kb:” for the namespace for my company KnowledgeBooks.com ontologies. The first term in the RDF statement (the subject) is the URI of a news article. The second term (the predicate) is “containsCountry” in the “kb:” namespace. The last item in the statement (the object) is a string literal “China.” I would describe this RDF statement in English as, “The news article at URI http://news.com/201234 mentions the country China.”

This was a very simple N3 example which we will expand to show additional features of the N3 notation. As another example, let’s look at the case if this news article also mentions the USA. Instead of adding a whole new statement like this we can combine them using N3 notation. Here we have two separate RDF statements:

We can collapse multiple RDF statements that share the same subject and optionally the same predicate:

The indentation and placement on separate lines is arbitrary - use whatever style you like that is readable. We can also add in additional predicates that use the same subject (I am going to use string literals here instead of URIs for objects to make the following example more concise but in practice prefer using URIs):

This single N3 statement represents ten individual RDF triples. Each section defining triples with the same subject and predicate have objects separated by commas and ending with a period. Please note that whatever RDF storage system you use (we will be using Jena) it makes no difference if we load RDF as XML, N-Triple, of N3 format files: internally subject, predicate, and object triples are stored in the same way and are used in the same way. RDF triples in a data store represent directed graphs that may not all be connected.

I promised you that the data in RDF data stores was easy to extend. As an example, let us assume that we have written software that is able to read online news articles and create RDF data that captures some of the semantics in the articles. If we extend our program to also recognize dates when the articles are published, we can simply reprocess articles and for each article add a triple to our RDF data store using a form like:

Here we just represent the date as a string. We can add a type to the object representing a specific date:

Furthermore, if we do not have dates for all news articles that is often acceptable because when constructing SPARQL queries you can match optional patterns. If for example you are looking up articles on a specific subject then some results may have a publication date attached to the results for that article and some might not. In practice RDF supports types and we would use a date type as seen in the last example, not a string. However, in designing the example programs for this chapter I decided to simplify our representation of URIs and often use string literals as simple Java strings. For many applications this isn’t a real limitation.

### Extending RDF with RDF Schema

RDF Schema (RDFS) supports the definition of classes and properties based on set inclusion. In RDFS classes and properties are orthogonal. Let’s start with looking at an example using additional namespaces:

Because the semantic web is intended to be processed automatically by software systems it is encoded as RDF. There is a problem that must be solved in implementing and using the semantic web: everyone who publishes semantic web data is free to create their own RDF schemas for storing data; for example, there is usually no single standard RDF schema definition for topics like news stories and stock market data. The SKOS is a namespace containing standard schemas and the most widely used standard is schema.org. Understanding the ways of integrating different data sources using different schemas helps to understand the design decisions behind the semantic web applications. In this chapter I often use my own schemas in the knowledgebooks.com namespace for the simple examples you see here. When you build your own production systems part of the work is searching through schema.org and SKOS to use standard name spaces and schemas when possible. The use of standard schemas helps when you link internal proprietary Knowledge Graphs used in organization with public open data from sources like WikiData and DBPedia.

We will start with an example that is an extension of the example in the last section that also uses RDFS. We add a few additional RDF statements:

The last three lines declare that:

• The property containsCity is a sub-property of containsPlace.
• The property containsCountry is a sub-property of containsPlace.
• The property containsState is a sub-property of containsPlace.

Why is this useful? For at least two reasons:

• You can query an RDF data store for all triples that use property containsPlace and also match triples with properties equal to containsCity, containsCountry, or containsState. There may not even be any triples that explicitly use the property containsPlace.
• Consider a hypothetical case where you are using two different RDF data stores that use different properties for naming cities: cityName and city. You can define cityName to be a sub-property of city and then write all queries against the single property name city. This removes the necessity to convert data from different sources to use the same Schema. You can also use OWL to state property and class equivalency.

In addition to providing a vocabulary for describing properties and class membership by properties, RDFS is also used for logical inference to infer new triples, combine data from different RDF data sources, and to allow effective querying of RDF data stores. We will see examples of all of these features of RDFS when we later start using the Jena libraries to perform SPARQL queries.

### The SPARQL Query Language

SPARQL is a query language used to query RDF data stores. While SPARQL may initially look like SQL, we will see that there are some important differences like support for RDFS and OWL inferencing and graph-based instead of relational matching operations. We will cover the basics of SPARQL in this section and then see more examples later when we learn how to embed Jena in Java applications, and see more examples in the last chapter Knowledge Graph Navigator.

We will use the N3 format RDF file test_data/news.n3 for the examples. I created this file automatically by spidering Reuters news stories on the news.yahoo.com web site and automatically extracting named entities from the text of the articles. We saw techniques for extracting named entities from text in earlier chapters. In this chapter we use these sample RDF files.

You have already seen snippets of this file and I list the entire file here for reference, edited to fit line width: you may find the file news.n3 easier to read if you are at your computer and open the file in a text editor so you will not be limited to what fits on a book page:

In the following examples, we will use the main method in the class JenaApi (developed in the next section) that allows us to load multiple RDF input files and then to interactively enter SPARQL queries.

We will start with a simple SPARQL query for subjects (news article URLs) and objects (matching countries) with the value for the predicate equal to containsCountry. Variables in queries start with a question mark character and can have any names:

It is important for you to understand what is happening when we apply the last SPARQL query to our sample data. Conceptually, all the triples in the sample data are scanned, keeping the ones where the predicate part of a triple is equal to http://knowledgebooks.com/ontology#containsCountry. In practice RDF data stores supporting SPARQL queries index RDF data so a complete scan of the sample data is not required. This is analogous to relational databases where indices are created to avoid needing to perform complete scans of database tables.

In practice, when you are exploring a Knowledge Graph like DBPedia or WikiData (that are just very large collections of RDF triples), you might run a query and discover a useful or interesting entity URI in the triple store, then drill down to find out more about the entity. In a later chapter Knowledge Graph Navigator we attempt to automate this exploration process using the DBPedia data as a Knowledge Graph.

We will be using the same code to access the small example of RDF statements in our sample data as we will for accessing DBPedia or WikiData.

We can make this last query easier to read and reduce the chance of misspelling errors by using a namespace prefix:

Later in the chapter Knowledge Graph Navigator we will write an application that automatically generates SPARQL queries for the DBPedia public knowledge Graph. These queries will be be more complex than the simpler examples here. Reading this chapter before Knowledge Graph Navigator is recommended.

### Case Study: Using SPARQL to Find Information about Board of Directors Members of Corporations and Organizations

Before we write software to automate the process of using SPARQL queries to find information on DBPedia, let’s perform a few manual queries for finding information on board of directors of corportations. To start with, we would like to find an RDF property that indicates board membership. There is a common expression for finding information on the web using search engines and also for using SPARQL queries: “follow your nose,” that is, when you see something interesting, dig down with more queries on whatever interests you.

We will find the property:

RDF

The property http://dbpedia.org/ontology/board is what we are looking for. Let’s keep “following our nose” to find examples of board members and the companies they server:

The results are:

Let’s see what information we can find on the founder of WikiPedi Jimmy Wales:

A few of the many results are:

### Installing the Apache Jena Fuseki RDF Server

TBD

I have a github repository mark-watson/fuseki-semantic-web-dev-setupmthat you shoud clone:

This will run the SPARQL server Fuseki locally on your laptop and the default graph is “news” and you will see output like:

You can access a web interface for SPARQL queries by accessing localhost:3030 or http:127.0.0.1:3030.

### Common Lisp Client Examples for the Apache Jena Fuseki RDF Server

Later in the chapter “Knowledge Graph Navigator” we will develop a simple Common Lisp SPARQL query library and use it for querying DBPedia. Here we will use it to query our local Fuseki server.

Here is an example of using the same library to query the public DBPedia SPARQL endpoint (most output is not shown):

The SPARQL library in the github repository for this book also supports the commercial products AllegroGraph and Stardog RDF servers.

## Automatically Generating Data for Knowledge Graphs

We develop a complete application. The Knowledge Graph Creator (KGcreator) is a tool for automating the generation of data for Knowledge Graphs from raw text data. We will see how to create a single standalone executable file using SBCL Common Lisp. The application can also be run during development from a repl. This application also implements a web application interface. In addition to the KGcreator application we will close the chapter with a utiity library that processes a file of RDF in N-Triple format and generates an extention file with triples pulled from DBedia defining URIs found in the input data file.

Data created by KGcreator generates data in two formats:

• Neo4j graph database format (text format)

This example application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on natural language processing (NLP) and we will reuse this code. We will discuss later three strategies for reusing code from different projects.

When I originally wrote KGCreator I intended to develop a commercial product. I wrote two research prototypes, one in Common Lisp (the example in this chapter) and one in Haskell (which I also use as an example in my book Haskell Tutorial and Cookbook. I decided to open source both versions of KGCreator and if you work with Knowledge Graphs I hope you find KGCreator useful in your work.

The following figure shows part of a Neo4j Knowledge Graph created with the example code. This graph has shortened labels in displayed nodes but Neo4j offers a web browser-based console that lets you interactively explore Knowledge Graphs. We don’t cover setting up Neo4j here so please use the Neo4j documentation. As an introduction to RDF data, the semantic web, and linked data you can get free copies of my two books Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition.

Here is a detail view:

### Implementation Notes

As seen in the file src /kgcreator/package.lisp this application uses several other packages:

The implementation of the packages shown on line 3 were in a previous chapter. The package myutils are mostly miscellaneous string utilities that we won’t look at here; I leave it to you to read the source code.

As seen in the configuration file src/kgcreator/kgcreator.asd we split the implementation of the application into four source files:

The application is separated into four source files:

• kgcreator.lisp: top level APIs and functionality. Uses the code in neo4j.lisp and rdf.lisp. Later we will generate a standalone application that uses these top level APIs
• neo4j.lisp: generates Cyper text files that can be imported into Neo4j
• rdf.lisp: generates RDF text data that can be loaded or imported into RDF data stores
• web.lisp: a simple web application for running KGCreator

### Generating RDF Data

I leave it to you find a tutorial on RDF data on the web, or you can get a PDF for my book “Practical Semantic Web and Linked Data Applications, Common Lisp Edition” and read the tutorial sections on RDF.

RDF data is comprised of triples, where the value for each triple are a subject, a predicate, and an object. Subjects are URIs, predicates are usually URIs, and objects are either literal values or URIs. Here are two triples written by this example application:

The following listing of the file src/kgcreator/rdf.lisp generates RDF data:

You can load all of KGCreator but just execute the test function at the end of this file using:

This code works on a list of paired files for text data and the meta data for each text file. As an example, if there is an input text file test123.txt then there would be a matching meta file test123.meta that contains the source of the data in the file test123.txt. This data source will be a URI on the web or a local file URI. The top level function rdf-from-files takes an output file path for writing the generated RDF data and a list of pairs of text and meta file paths.

A global variable *rdf-nodes-hash* will be used to remember the nodes in the RDF graph as it is generated. Please note that the function rdf-from-files is not re-entrant: it uses the global *rdf-nodes-hash* so if you are writing multi-threaded applications it will not work to execute the function rdf-from-files simultaneously in multiple threads of execution.

The function rdf-from-files (and the nested functions) are straightforward. I left a few debug printout statements in the code and when you run the test code that I left in the bottom of the file, hopefully it will be clear what rdf.lisp is doing.

### Generating Data for the Neo4j Graph Database

Now we will generate Neo4J Cypher data. In order to keep the implementation simple, both the RDF and Cypher generation code starts with raw text and performs the NLP analysis to find entities. This example could be refactored to perform the NLP analysis just one time but in practice you will likely be working with either RDF or NEO4J and so you will probably extract just the code you need from this example (i.e., either the RDF or Cypher generation code).

Before we look at the code, let’s start with a few lines of generated Neo4J Cypher import data:

The following listing of file src/kgcreator/neo4j.lisp is similar to the code that generated RDF in the last section:

You can load all of KGCreator but just execute the test function at the end of this file using:

### Implementing the Top Level Application APIs

The code in the file src/kgcreator/kgcreator.lisp uses both rdf.lisp and neo4j.lisp that we saw in the last two sections. The function get-files-and-meta looks at the contents of an input directory to generate a list of pairs, each pair containing the path to a text file and the meta file for the corresponding text file.

We are using the opts package to parse command line arguments. This will be used when we build a single file standalone executable file for the entire KGCreator application, including the web application that we will see in a later section.

You can load all of KGCreator but just execute the three test functions at the end of this file using:

### Implementing The Web Interface

When we build a standalone single file application for KGCreator, we include a simple web application interface that allows users to enter input text and see generated RDF and Neo4j Cypher data.

The file src/kgcreator/web.lisp uses the libraries cl-who hunchentoot parenscript that we used earlier. The function write-files-run-code** (lines 8-43) takes raw text, and writes generated RDF and Neo4j Cypher data to local temporary files that are then read and formatted to HTML for display. The code in rdf.lisp and neo4j.lisp is file oriented, and I wrote web.lisp as an afterthought so it was easier writing temporary files than refactoring rdf.lisp and neo4j.lisp to write to strings.

You can load all of KGCreator and start the web application using:

You can access the web app at http://localhost:3000.

### Creating a Standalone Application Using SBCL

When I originally wrote KGCreator I intended to develop a commercial product so it was important to be able to create standalone single file executables. This is simple to do using SBCL:

As an example, you could run the application on the command line using:

### Augmenting RDF Triples in a Knowledge Graph Using DBPedia

You can augment RDF-based Knowledge Graphs that you build with the KGcreator application by using the library in the directory kg-add-dbpedia-triples.

As seen in the kg-add-dbpedia-triples.asd and package.lisp configuration files, we use two other libraries developed in this book:

The library is implemented in the file kg-add-dbpedia-triples.lisp:

TBD

### KGCreator Wrap Up

When developing applications or systems using Knowledge Graphs it is useful to be able to quickly generate test data which is the primary purpose of KGCreator. A secondary use is to generate Knowledge Graphs for production use using text data sources. In this second use case you will want to manually inspect the generated data to verify its correctness or usefulness for your application.

## Knowledge Graph Sampler for Creating Small Custom Knowledge Graphs

I find it convenient to be able to “sample” small parts of larger knowledge graphs. The example program in this chapter accepts a list of DBPedia entity URIs, attempts top find links between these entities, and writes these nodes and discovered edges to a RDF triples file.

The code is in the directory src/kgsampler. As seen in the configuration files kg-add-dbpedia-triples.asd and package.lisp, we will use the sparql library we developed earlier as well as the libraries uiop and drakma:

The program starts with a list of entities and tries to find links on DBPedia between the entities. A small sample graph of the input entities and any discovered links is written to a file. The function dbpedia-as-nt spawns a process to use the curl utility to make a HTTP request to DBPedia. The function construct-from-dbpedia takes a list of entities and writes SPARQL CONSTRUCT statements with the entity as the subject and the object filtered to a string value in the English language to an output stream. The function find-relations runs at O(N^2) where N is the number of input entities so you should avoid using this program with a large number of input entities.

I offer this code with little explanation since much of it is similar to the techniques you saw in the previous chapter Knowledge Graph Navigator.

Let’s start by running the two helper functions interactively so you can see their output (output edited for brevity). The top level function kgsampler:sample for this example takes a list of entity URIs and an output file name, and uses the functions construct-from-dbpedia entity-uri-list and find-relations to write triples for the entities and then for the relationships discovered between entities. The following listing also calls the helper function kgsampler::find-relations to show you what its output looks like.

We now use the main function to generate an output RDF triple file:

Output RDF N-Triple data is written to the file sample-KG.nt. A very small part of this file is listed here:

The same data in Turtle RDF format can be seen in the file sample-KG.ttl that was produced by importing the triples file into the free edition of GraphDB exporting it to the Turtle file sample-KG.ttl that I find easier to read. GraphDB has visualization tools which I use here to generate an interactive graph display of this data:

This example is also set up for people and companies. I may expand it in the future to other types of entities as I need them.

This example program takes several minutes to run since many SPARQL queries are made to DBPedia. I am a non-corporate member of the DBPedia organization. Here is a membership application if you are interested in joining me there.

## Knowledge Graph Navigator Common Library Implementation

The Knowledge Graph Navigator (which I will often refer to as KGN) is a tool for processing a set of entity names and automatically exploring the public Knowledge Graph DBPedia using SPARQL queries. I started to write KGN for my own use, to automate some things I used to do manually when exploring Knowledge Graphs, and later thought that KGN might be also useful for educational purposes. KGN shows the user the auto-generated SPARQL queries so hopefully the user will learn by seeing examples. KGN uses NLP code developed in earlier chapters and we will reuse that code with a short review of using the APIs.

In previous versions of this book, this example was hard-wired to use LispWork CAPI for the user interface. This old version is in src/kgn in the main GitHub repository for this book: https://github.com/mark-watson/loving-common-lisp and has a few UI components like a progress bar that I removed since the previous edition. The new version has separate GitHub repositories for:

If you followed the code example setup instructions in the book Preface or in the README file in the main repo https://github.com/mark-watson/loving-common-lisp then all three of these projects are available for loading via Quicklisp on your computer.

After looking at SPARQL generated by this example for an example query, we will start a process of bottom up development, first writing low level functions to automate SPARQL queries, writing utilities we will need for the UIs developed in later chapters.

Since the DBPedia SPARQL queries are time consuming, we will also implement a caching layer using SQLite that will make the app more responsive. The cache is especially helpful during development when the same queries are repeatedly used for testing.

The code for this reusable library is in the directory src/kgn-common. This is common library that will be used for user interfaces developed in later chapters. There is a lot of code in the following program listings and I hope to provide you with a roadmap overview of the code, diving in on code that you might want to reuse for your own projects and some representative code for generating SPARQL queries.

Let’s start by looking at the files for the common library:

• Makefile - contains development shortcuts.
• data - data used to remove stop words from text.
• kgn-common.lisp - main code file for library.
• package.lisp - standard Common Lisp package definition.
• utils.lisp - miscelanious utility functions.
• kgn-common.asd - standard Common Lisp ASDF definition.

### Example Output

Before we get started studying the implementation, let’s look at sample output in order to help give meaning to the code we will look at later. Consider a query that a user might type into the top query field in the KGN app:

The system will try to recognize entities in a query. If you know the DBPedia URI of an entity, like the company Apple in this example, you can use that directly. Note that in the SPARQL URIs are surrounded with angle bracket characters.

The application prints out automatically generated SPARQL queries. For the above listed example query the following output will be generated (some editing to fit page width):

Remember, the SPARQL is generated by KGN from natural language queries. Some more examples:

Once KGN has identified DBPedia entire URIs, it also searches for relationships between these entities:

After listing the generated SPARQL for finding information for the entities in the query, KGN searches for relationships between these entities. These discovered relationships can be seen at the end of the last listing. Please note that this step makes SPARQL queries on O(n^2) where n is the number of entities. Local caching of SPARQL queries to DBPedia helps make processing many entities possible.

In addition to showing generated SPARQL and discovered relationships in the middle text pane of the application, KGN also generates formatted results that are also displayed in the bottom text pane:

Hopefully after reading through sample output and seeing the screen shot of the application, you now have a better idea what this example application does. Now we will look at project configuration and then implementation.

### Project Configuration and Running the Application

The following listing of kgn.asd shows the ten packages this example depends on (five of these are also examples in this book, and five are in the public Quicklisp repository):

Listing of package.lisp:

We use ql:quickload to load the KGN common library and call a few APIs (some output removed for brevity):

In this last example, using :message-stream nil effectively turns off printing generated SPARQL queries used by these APIs. You can use :message-stream t to see generated SPARQL.

Every time the KGN common library makes a web service call to DBPedia the query and response are cached in a SQLite database in ~/.kgn_cache.db which can greatly speed up the program, especially in development mode when testing a set of queries. This caching also takes some load off of the public DBPedia endpoint, which is a polite thing to do.

### Review of NLP Utilities Used in Application

Here is a quick review of NLP utilities we saw in an earlier chapter:

• kbnlp:make-text-object
• kbnlp::text-human-names
• kbnlp::text-place-name
• entity-uris:find-entities-in-text
• entity-uris:pp-entities

The following code snippets show example calls to the relevant NLP functions and the generated output:

The code using loop at the end of the last repl listing that prints keys and values of a hash table is from the Common Lisp Cookbook web site in the section “Traversing a Hash Table.”

### Developing Low-Level SPARQL Utilities

I use the standard command line curl utility program with the Common Lisp package uiop to make HTML GET requests to the DBPedia public Knowledge Graph and the package drakma to url-encode parts of a query. The source code is in a separate Quicklisp library located in src/sparql-cache/sparql.lisp. A non-caching library is also available in src/sparql/sparql.lisp.

In the following listing of src/sparql-cache/sparql.lisp, lines 8, 24, 39, and 55 I use some caching code that we will look at later. The nested replace-all statements in lines 12-13 are a kluge to remove Unicode characters that occasionally caused runtime errors in the KGN application.

The code for replacing Unicode characters is messy but prevents problems later when we are using the query results in the example application.

The code (json-as-list (json:decode-json s)) on line 28 converts a deeply nested JSON response to nested Common Lisp lists. You may want to print out the list to better understand the mapcar expression on lines 31-35. There is no magic to writing expressions like this, in a repl I set json-as-list to the results of one query, and I spent a minute or two experimenting with the nested mapcar expression to get it to work with my test case.

The implementation for sparql-ask-dbpedia in lines 38-58 is simpler because we don’t have to fully parse the returned SPARQL query results. A SPARQL ask type query returns a true/false answer to a query. We will use this to determine the types of entities in query text. While our NLP library identifies entity types, making additional ask queries to DBPedia to verify entity types will provide better automated results.

### Implementing the Caching Layer

While developing KGN and also using it as an end user, many SPARQL queries to DBPedia contain repeated entity names so it makes sense to write a caching layer. We use a SQLite database “~/.kgn_cache.db” to store queries and responses.

The caching layer is implemented in the file src/sparql-cache/sparql.lisp and some of the relevant code is listed here: