Leanpub: Publish Early, Publish Often

Knowledge Graph Navigator

TBD: CONVERT EXAMPLES FROM Hylang to Python

The Knowledge Graph Navigator (which I will often refer to as KGN) is a tool for processing a set of entity names and automatically exploring the public Knowledge Graph DBPedia using SPARQL queries. I wrote KGN in Common Lisp for my own use to automate some things I used to do manually when exploring Knowledge Graphs, and later thought that KGN might be useful also for educational purposes. KGN uses NLP code developed in earlier chapters and we will reuse that code with a short review of using the APIs.

Please note that the example is a simplified version that I first wrote in Common Lisp and is also an example in my book Loving Common Lisp, or the Savvy Programmer’s Secret Weapon that you can read for free online. If you are interested you can see screen shots of the Common Lisp version here.

The following two screen shots show the text based user interface for this example. This example application asks the user for a list of entity names and uses SPARQL queries to discover potential matches in DBPedia. We use the python library PyInquirer for requesting entity names and then to show the user a list of matches from DBPedia. The following screen shot shows these steps:

Initial user interaction with Knowledge Graph Navigator example

To select the entities of interest, the user uses a space character to select or deselect an entity and the return (or enter) key to accept the list selections.

After the user selects entities from the list, the list disappears. The next screen shot shows the output from this example after the user finishes selecting entities of interest:

After the user selects entities of interest

The code for this application is in the directory kgn. You will need to install the following Python library that supports console/text user interfaces:

pip install PyInquirer

You will also need the spacy library and language model that we used in the earlier chapter on natural language processing. If you have not already done so, install these requirements:

pip install spacy
python -m spacy download en_core_web_sm

After listing the generated SPARQL for finding information for the entities in the query, KGN searches for relationships between these entities. These discovered relationships can be seen at the end of the last screen shot. Please note that this step makes SPARQL queries on O(n^2) where n is the number of entities. Local caching of SPARQL queries to DBPedia helps make processing many entities possible.

Every time KGN makes a SPARQL query web service call to DBPedia the query and response are cached in a SQLite database in ~/.kgn_hy_cache.db which can greatly speed up the program, especially in development mode when testing a set of queries. This caching also takes some load off of the public DBPedia endpoint, which is a polite thing to do.

Review of NLP Utilities Used in Application

We covered NLP in a previous chapter, so the following is just a quick review. The NLP code we use is near the top of the file kgn.hy:

(import spacy)

(setv nlp-model (spacy.load "en"))

(defn entities-in-text [s]
  (setv doc (nlp-model s))
  (setv ret {})
  (for
    [[ename etype] (lfor entity doc.ents [entity.text entity.label_])]   
    (if (in etype ret)
        (setv (get ret etype) (+ (get ret etype) [ename]))
        (assoc ret etype [ename])))
  ret)

Here is an example use of this function:

=> (kgn.entities-in-text "Bill Gates, Microsoft, Seattle")
{'PERSON': ['Bill Gates'], 'ORG': ['Microsoft'], 'GPE': ['Seattle']}

The entity type “GPE” indicates that the entity is some type of location.

Developing Low-Level Caching SPARQL Utilities

While developing KGN and also using it as an end user, many SPARQL queries to DBPedia contain repeated entity names so it makes sense to write a caching layer. We use a SQLite database “~/.kgn_hy_cache.db” to store queries and responses. We covered using SQLite in some detail in the chapter on datastores.

The caching layer is implemented in the file cache.hy:

 1 (import [sqlite3 [connect version Error ]])
 2 (import json)
 3 
 4 (setv *db-path* "kgn_hy_cache.db")
 5 
 6 (defn create-db []
 7   (try
 8     (setv conn (connect *db-path*))
 9     (print version)
10     (setv cur (conn.cursor))
11     (cur.execute "CREATE TABLE dbpedia (query string  PRIMARY KEY ASC, data json)")
12     (conn.close)
13     (except [e Exception] (print e))))
14 
15 (defn save-query-results-dbpedia [query result]
16   (try
17     (setv conn (connect *db-path*))
18     (setv cur (conn.cursor))
19     (cur.execute "insert into dbpedia (query, data) values (?, ?)" 
20                  [query (json.dumps result)])
21     (conn.commit)
22     (conn.close)
23     (except [e Exception] (print e))))
24  
25 (defn fetch-result-dbpedia [query]
26   (setv results [])
27   (setv conn (connect *db-path*))
28   (setv cur (conn.cursor))
29   (cur.execute "select data from dbpedia where query = ? limit 1" [query])
30   (setv d (cur.fetchall))
31   (if (> (len d) 0)
32       (setv results (json.loads (first (first d)))))
33   (conn.close)
34   results)
35  
36 (create-db)

Here we store structured data from SPARQL queries as JSON data serialized as string values.

SPARQL Utilities

We will use the caching code from the last section and also the standard Python library requests to access the DBPedia servers. The following code is found in the file sparql.hy and also provides support for using both DBPedia and WikiData. We only use DBPedia in this chapter but when you start incorporating SPARQL queries into applications that you write, you will also probably want to use WikiData.

The function do-query-helper contains generic code for SPARQL queries and is used in functions wikidata-sparql and dbpedia-sparql:

(import json)
(import requests)
(require [hy.contrib.walk [let]])

(import [cache [fetch-result-dbpedia save-query-results-dbpedia]])

(setv wikidata-endpoint "https://query.wikidata.org/bigdata/namespace/wdq/sparql")
(setv dbpedia-endpoint "https://dbpedia.org/sparql")

(defn do-query-helper [endpoint query]
  ;; check cache:
  (setv cached-results (fetch-result-dbpedia query))
  (if (> (len cached-results) 0)
      (let ()
        (print "Using cached query results")
        (eval cached-results))
      (let ()
        ;; Construct a request
        (setv params { "query" query "format" "json"})
        
        ;; Call the API
        (setv response (requests.get endpoint :params params))
        
        (setv json-data (response.json))
        
        (setv vars (get (get json-data "head") "vars"))
        
        (setv results (get json-data "results"))
        
        (if (in "bindings" results)
            (let [bindings (get results "bindings")
                  qr
                  (lfor binding bindings
                        (lfor var vars
                              [var (get (get binding var) "value")]))]
              (save-query-results-dbpedia query qr)
              qr)
            []))))

(defn wikidata-sparql [query]
  (do-query-helper wikidata-endpoint query))

(defn dbpedia-sparql [query]
  (do-query-helper dbpedia-endpoint query))

Here is an example query (manually formatted for page width):

$ hy
hy 0.18.0 using CPython(default) 3.7.4 on Darwin
=> (import sparql)
table dbpedia already exists
=> (sparql.dbpedia-sparql
     "select ?s ?p ?o { ?s ?p ?o } limit 1")
[[['s', 'http://www.openlinksw.com/virtrdf-data-formats#default-iid'],
  ['p', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'],
  ['o', 'http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat']]]
=>

This is a wild-card SPARQL query that will match any of the 9.5 billion RDF triples in DBPedia and return just one result.

This caching layer greatly speeds up my own personal use of KGN. Without caching, queries that contain many entity references simply take too long to run.

Utilities to Colorize SPARQL and Generated Output

When I first had the basic functionality of KGN working, I was disappointed by how the application looked as normal text. Every editor and IDE I use colorizes text in an appropriate way so I used standard ANSI terminal escape sequences to implement color hilting SPARQL queries.

The code in the following listing is in the file colorize.hy.

(require [hy.contrib.walk [let]])
(import [io [StringIO]])

;; Utilities to add ANSI terminal escape sequences to colorize text.
;; note: the following 5 functions return string values that then need to
;;       be printed.

(defn blue [s] (.format "{}{}{}" "\033[94m" s "\033[0m"))
(defn red [s] (.format "{}{}{}" "\033[91m" s "\033[0m"))
(defn green [s] (.format "{}{}{}" "\033[92m" s "\033[0m"))
(defn pink [s] (.format "{}{}{}" "\033[95m" s "\033[0m"))
(defn bold [s] (.format "{}{}{}" "\033[1m" s "\033[0m"))

(defn tokenize-keep-uris [s]
  (.split s))

(defn colorize-sparql [s]
  (let [tokens
        (tokenize-keep-uris
          (.replace (.replace (.replace s "{" " { ") "}" " } ") "." " . "))
        ret (StringIO)] ;; ret is an output stream for a string buffer
    (for [token tokens]
      (if (> (len token) 0)
          (if (= (get token 0) "?")
              (.write ret (red token))
              (if (in
                    token
                    ["where" "select" "distinct" "option" "filter"
                     "FILTER" "OPTION" "DISTINCT" "SELECT" "WHERE"])
                  (.write ret (blue token))
                  (if (= (get token 0) "<")
                      (.write ret (bold token))
                      (.write ret token)))))
      (if (not (= token "?"))
          (.write ret " ")))
    (.seek ret 0)
    (.read ret)))

You have seen colorized SPARQL in the two screen shots at the beginning of this chapter.

Text Utilities for Queries and Results

The application low level utility functions are in the file kgn-utils.hy. The function dbpedia-get-entities-by-name requires two arguments:

The name of an entity to search for.
A URI representing the entity type that we are looking for.

We embed a SPARQL query that has placeholders for the entity name and type. The filter expression specifies that we only want triple results with comment values in the English language by using (lang(?comment) = ‘en’):

 1 #!/usr/bin/env hy
 2 
 3 (import [sparql [dbpedia-sparql]])
 4 (import [colorize [colorize-sparql]])
 5 
 6 (import [pprint [pprint]])
 7 (require [hy.contrib.walk [let]])
 8 
 9 (defn dbpedia-get-entities-by-name [name dbpedia-type]
10   (let [sparql
11         (.format "select distinct ?s ?comment {{ ?s ?p \"{}\"@en . ?s <http://www.w3\
12 .org/2000/01/rdf-schema#comment>  ?comment  . FILTER  (lang(?comment) = 'en') . ?s <\
13 http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {} . }} limit 15" name dbpedia-type\
14 )]
15     (print "Generated SPARQL to get DBPedia entity URIs from a name:")
16     (print (colorize-sparql sparql))
17     (dbpedia-sparql sparql)))

Here is an example:

Getting entities by name with colorized SPARL query script

Finishing the Main Function for KGN

We already looked at the NLP code near the beginning of the file kgn.hy. Let’s look at the remainder of the implementation.

We need a dictionary (or hash table) to convert spaCy entity type names to DBPedia type URIs:

1 (setv entity-type-to-type-uri
2       {"PERSON" "<http://dbpedia.org/ontology/Person>"
3        "GPE" "<http://dbpedia.org/ontology/Place>"
4        "ORG" "<http://dbpedia.org/ontology/Organisation>"
5        })

When we get entity results from DBPedia, the comments describing entities can be a few paragraphs of text. We want to shorten the comments so they fit in a single line of the entity selection list that we have seen earlier. The following code defines a comment shortening function and also a global variable that we will use to store the entity URIs for each shortened comment:

1 (setv short-comment-to-uri {})
2 
3 (defn shorten-comment [comment uri]
4   (setv sc (+ (cut comment 0 70) "..."))
5   (assoc short-comment-to-uri sc uri)
6   sc)

In line 5, we use the function assoc to add a key and value pair to an existing dictionary short-comment-to-uri.

Finally, let’s look at the main application loop. In line 4 we are using the function get-query (defined in file textui.hy) to get a list of entity names from the user. In line 7 we use the function entities-in-text that we saw earlier to map text to entity types and names. In the nested loops in lines 13-26 we build one line descriptions of people, place, and organizations that we will use to show the user a menu for selecting entities found in DBPedia from the original query. We are giving the use a chance to select only the discovered entities that they are interested in.

In lines 33-35 we are converting the shortened comment strings the user selected back to DBPedia entity URIs. Finally in line 36 we use the function entity-results->relationship-links to find relationships between the user selected entities.

 1 (defn kgn []
 2   (while
 3     True
 4     (let [query (get-query)
 5           emap {}]
 6       (if (or (= query "quit") (= query "q"))
 7           (break))
 8       (setv elist (entities-in-text query))
 9       (setv people-found-on-dbpedia [])
10       (setv places-found-on-dbpedia [])
11       (setv organizations-found-on-dbpedia [])
12       (global short-comment-to-uri)
13       (setv short-comment-to-uri {})
14       (for [key elist]
15         (setv type-uri (get entity-type-to-type-uri key))
16         (for [name (get elist key)]
17           (setv dbp (dbpedia-get-entities-by-name name type-uri))
18           (for [d dbp]
19             (setv short-comment (shorten-comment (second (second d)) 
20             (second (first d))))
21             (if (= key "PERSON")
22                 (.extend people-found-on-dbpedia [(+ name  " || " short-comment)]))
23             (if (= key "GPE")
24                 (.extend places-found-on-dbpedia [(+ name  " || " short-comment)]))
25             (if (= key "ORG")
26                 (.extend organizations-found-on-dbpedia 
27                 [(+ name  " || " short-comment)])))))
28       (setv user-selected-entities
29             (select-entities
30               people-found-on-dbpedia
31               places-found-on-dbpedia
32               organizations-found-on-dbpedia))
33       (setv uri-list [])
34       (for [entity (get user-selected-entities "entities")]
35         (setv short-comment (cut entity (+ 4 (.index entity " || "))))
36         (.extend uri-list [(get short-comment-to-uri short-comment)]))
37       (setv relation-data (entity-results->relationship-links uri-list))
38       (print "\nDiscovered relationship links:")
39       (pprint relation-data))))

If you have not already done so, I hope you experiment running this example application. The first time you specify an entity name expect some delay while DBPedia is accessed. Thereafter the cache will make the application more responsive when you use the same name again in a different query.

Wrap-up

If you enjoyed running and experimenting with this example and want to modify it for your own projects then I hope that I provided a sufficient road map for you to do so.

I got the idea for the KGN application because I was spending quite a bit of time manually setting up SPARQL queries for DBPedia and other public sources like WikiData, and I wanted to experiment with partially automating this exploration process.

Up next

DBPedia Question Answering System Using SparQL and Deep Learning