Knowledge Graph Sampler for Creating Small Custom Knowledge Graphs
I find it convenient to be able to “sample” small parts of larger knowledge graphs. The example program in this chapter accepts a list of DBPedia entity URIs, attempts top find links between these entities, and writes these nodes and discovered edges to a RDF triples file.
The code is in the directory src/kgsampler. As seen in the configuration files kg-add-dbpedia-triples.asd and package.lisp, we will use the sparql library we developed earlier as well as the libraries uiop and drakma:
1 ;;;; kgsampler.asd
2
3 (asdf:defsystem #:kgsampler
4 :description "sample knowledge graphs"
5 :author "Mark Watson markw@markwatson.com"
6 :license "Apache 2"
7 :depends-on (#:uiop #:drakma #:sparql)
8 :components ((:file "package")
9 (:file "kgsampler")))
1 ;;;; package.lisp
2
3 (defpackage #:kgsampler
4 (:use #:cl #:uiop #:sparql)
5 (:export #:sample))
The program starts with a list of entities and tries to find links on DBPedia between the entities. A small sample graph of the input entities and any discovered links is written to a file. The function dbpedia-as-nt spawns a process to use the curl utility to make a HTTP request to DBPedia. The function construct-from-dbpedia takes a list of entities and writes SPARQL CONSTRUCT statements with the entity as the subject and the object filtered to a string value in the English language to an output stream. The function find-relations runs at O(N^2) where N is the number of input entities so you should avoid using this program with a large number of input entities.
I offer this code with little explanation since much of it is similar to the techniques you saw in the previous chapter Knowledge Graph Navigator.
1 ;; kgsampler main program
2
3 (in-package #:kgsampler)
4
5 (defun dbpedia-as-nt (query)
6 (print query)
7 (uiop:run-program
8 (list
9 "curl"
10 (concatenate 'string
11 "https://dbpedia.org/sparql?format=text/ntriples&query="
12 ;; formats that work: csv, text/ntriples, text/ttl
13 (drakma:url-encode query :utf-8)))
14 :output :string))
15
16 (defun construct-from-dbpedia (entity-uri-list &key (output-stream t))
17 (dolist (entity-uri entity-uri-list)
18 (format output-stream "~%~%# ENTITY NAME: ~A~%~%" entity-uri)
19 (format
20 output-stream
21 (dbpedia-as-nt
22 (format nil
23 "CONSTRUCT { ~A ?p ?o } where { ~A ?p ?o . FILTER(lang(?o) = 'en') }"
24 entity-uri entity-uri)))))
25
26 (defun ensure-angle-brackets (s)
27 "make sure URIs have angle brackets"
28 (if (equal #\< (char s 0))
29 s
30 (concatenate 'string "<" s ">")))
31
32 (defun find-relations (entity-uri-list &key (output-stream t))
33 (dolist (entity-uri1 entity-uri-list)
34 (dolist (entity-uri2 entity-uri-list)
35 (if (not (equal entity-uri1 entity-uri2))
36 (let ((possible-relations
37 (mapcar #'cadar
38 (sparql::dbpedia
39 (format nil
40 "select ?p where { ~A ?p ~A . filter(!regex(str(?p), \"page\", \"i\"))} limit 50"
41 entity-uri1 entity-uri2)))))
42 (print "** possible-relations:") (print possible-relations)
43 (dolist (pr possible-relations)
44 (format output-stream "~A ~A ~a .~%"
45 entity-uri1
46 (ensure-angle-brackets pr)
47 entity-uri2)))))))
48
49 (defun sample (entity-uri-list output-filepath)
50 (with-open-file (ostream (pathname output-filepath) :direction :output :if-exists :supersede)
51 (construct-from-dbpedia entity-uri-list :output-stream ostream)
52 (find-relations entity-uri-list :output-stream ostream)))
Let’s start by running the two helper functions interactively so you can see their output (output edited for brevity). The top level function kgsampler:sample for this example takes a list of entity URIs and an output file name, and uses the functions construct-from-dbpedia entity-uri-list and find-relations to write triples for the entities and then for the relationships discovered between entities. The following listing also calls the helper function kgsampler::find-relations to show you what its output looks like.
1 $ sbcl
2 * (ql:quickload "kgsampler")
3 * (kgsampler::construct-from-dbpedia '("<http://dbpedia.org/resource/Bill_Gates>" "<http://dbpedia.org/resource/Steve_Jobs>") :output-stream nil)
4
5 "CONSTRUCT { <http://dbpedia.org/resource/Bill_Gates> ?p ?o } where { <http://dbpedia.org/resource/Bill_Gates> ?p ?o . FILTER (lang(?o) = 'en') }"
6 "CONSTRUCT { <http://dbpedia.org/resource/Bill_Gates> <http://purl.org/dc/terms/subject> ?o } where { <http://dbpedia.org/resource/Bill_Gates> <http://purl.org/dc/terms/subject> ?o }"
7
8 ...
9
10 * (kgsampler::find-relations '("<http://dbpedia.org/resource/Bill_Gates>" "<http://dbpedia.org/resource/Microsoft>") :output-stream nil)
11
12 ("dbpedia SPARQL:"
13 "select ?p where { <http://dbpedia.org/resource/Bill_Gates> ?p <http://dbpedia.org/resource/Microsoft> . filter(!regex(str(?p), \"page\", \"i\"))} limit 50"
14 "n")
15 "** possible-relations:"
16 ("http://dbpedia.org/ontology/knownFor")
17 "http://dbpedia.org/ontology/knownFor"
18 ("dbpedia SPARQL:"
19 "select ?p where { <http://dbpedia.org/resource/Microsoft> ?p <http://dbpedia.org/resource/Bill_Gates> . filter(!regex(str(?p), \"page\", \"i\"))} limit 50"
20 "n")
21 "** possible-relations:"
22 ("http://dbpedia.org/property/founders" "http://dbpedia.org/ontology/foundedBy")
23 "http://dbpedia.org/property/founders"
24 "http://dbpedia.org/ontology/foundedBy"
25 nil
We now use the main function to generate an output RDF triple file:
1 $ sbcl
2 * (ql:quickload "kgsampler")
3 * (kgsampler:sample '("<http://dbpedia.org/resource/Bill_Gates>" "<http://dbpedia.org/resource/Steve_Jobs>" "<http://dbpedia.org/resource/Microsoft>") "test.nt")
4 "CONSTRUCT { <http://dbpedia.org/resource/Bill_Gates> ?p ?o } where { <http://dbpedia.org/resource/Bill_Gates> ?p ?o . FILTER (lang(?o) = 'en') }"
5 ("ndbpedia SPARQL:n"
6 "select ?p where { <http://dbpedia.org/resource/Bill_Gates> ?p <http://dbpedia.org/resource/Microsoft> . filter(!regex(str(?p), \"page\", \"i\"))} limit 50"
7 "n")
8 "** possible-relations:"
9 ("http://dbpedia.org/ontology/board")
10 ("dbpedia SPARQL:"
11 "select ?p where { <http://dbpedia.org/resource/Steve_Jobs> ?p <http://dbpedia.org/resource/Bill_Gates> . filter(!regex(str(?p), \"page\", \"i\"))} limit 50"
12 "n")
Output RDF N-Triple data is written to the file sample-KG.nt. A very small part of this file is listed here:
1 # ENTITY NAME: <http://dbpedia.org/resource/Bill_Gates>
2
3 <http://dbpedia.org/resource/Bill_Gates> <http://dbpedia.org/ontology/abstract> "William Henry \"Bill\" Gates III (born October 28, 1955) is an American business magnate,...."@en .
4 <http://dbpedia.org/resource/Bill_Gates>
5 <http://xmlns.com/foaf/0.1/name>
6 "Bill Gates"@en .
7 <http://dbpedia.org/resource/Bill_Gates>
8 <http://xmlns.com/foaf/0.1/surname>
9 "Gates"@en .
10 <http://dbpedia.org/resource/Bill_Gates>
11 <http://dbpedia.org/ontology/title>
12 "Co-Chairmanof theBill & Melinda Gates Foundation"@en .
The same data in Turtle RDF format can be seen in the file sample-KG.ttl that was produced by importing the triples file into the free edition of GraphDB exporting it to the Turtle file sample-KG.ttl that I find easier to read. GraphDB has visualization tools which I use here to generate an interactive graph display of this data:
This example is also set up for people and companies. I may expand it in the future to other types of entities as I need them.
This example program takes several minutes to run since many SPARQL queries are made to DBPedia. I am a non-corporate member of the DBPedia organization. Here is a membership application if you are interested in joining me there.