Automatically Generating Data for Knowledge Graphs
We develop a complete application. The Knowledge Graph Creator (KGcreator) is a tool for automating the generation of data for Knowledge Graphs from raw text data. We will see how to create a single standalone executable file using SBCL Common Lisp. The application can also be run during development from a repl. This application also implements a web application interface. In addition to the KGcreator application we will close the chapter with a utiity library that processes a file of RDF in N-Triple format and generates an extention file with triples pulled from DBedia defining URIs found in the input data file.
Data created by KGcreator generates data in two formats:
- Neo4j graph database format (text format)
- RDF triples suitable for loading into any linked data/semantic web data store.
This example application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on natural language processing (NLP) and we will reuse this code. We will discuss later three strategies for reusing code from different projects.
When I originally wrote KGCreator I intended to develop a commercial product. I wrote two research prototypes, one in Common Lisp (the example in this chapter) and one in Haskell (which I also use as an example in my book Haskell Tutorial and Cookbook. I decided to open source both versions of KGCreator and if you work with Knowledge Graphs I hope you find KGCreator useful in your work.
The following figure shows part of a Neo4j Knowledge Graph created with the example code. This graph has shortened labels in displayed nodes but Neo4j offers a web browser-based console that lets you interactively explore Knowledge Graphs. We don’t cover setting up Neo4j here so please use the Neo4j documentation. As an introduction to RDF data, the semantic web, and linked data you can get free copies of my two books Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition.
Here is a detail view:
Implementation Notes
As seen in the file src /kgcreator/package.lisp this application uses several other packages:
1 (defpackage #:kgcreator
2 (:use #:cl
3 #:entities_dbpedia #:categorize_summarize #:myutils
4 #:cl-who #:hunchentoot #:parenscript)
5 (:export kgcreator))
The implementation of the packages shown on line 3 were in a previous chapter. The package myutils are mostly miscellaneous string utilities that we won’t look at here; I leave it to you to read the source code.
As seen in the configuration file src/kgcreator/kgcreator.asd we split the implementation of the application into four source files:
1 ;;;; kgcreator.asd
2
3 (asdf:defsystem #:kgcreator
4 :description "Describe plotlib here"
5 :author "Mark Watson <mark.watson@gmail.com>"
6 :license "AGPL version 3"
7 :depends-on (#:entities_dbpedia #:categorize_summarize
8 #:myutils #:unix-opts #:cl-who
9 #:hunchentoot #:parenscript)
10 :components
11 ((:file "package")
12 (:file "kgcreator")
13 (:file "neo4j")
14 (:file "rdf")
15 (:file "web"))
16 )
The application is separated into four source files:
- kgcreator.lisp: top level APIs and functionality. Uses the code in neo4j.lisp and rdf.lisp. Later we will generate a standalone application that uses these top level APIs
- neo4j.lisp: generates Cyper text files that can be imported into Neo4j
- rdf.lisp: generates RDF text data that can be loaded or imported into RDF data stores
- web.lisp: a simple web application for running KGCreator
Generating RDF Data
I leave it to you find a tutorial on RDF data on the web, or you can get a PDF for my book “Practical Semantic Web and Linked Data Applications, Common Lisp Edition” and read the tutorial sections on RDF.
RDF data is comprised of triples, where the value for each triple are a subject, a predicate, and an object. Subjects are URIs, predicates are usually URIs, and objects are either literal values or URIs. Here are two triples written by this example application:
1 <http://dbpedia.org/resource/The_Wall_Street_Journal>
2 <http://knowledgebooks.com/schema/aboutCompanyName>
3 "Wall Street Journal" .
4 <https://newsshop.com/june/z902.html>
5 <http://knowledgebooks.com/schema/containsCountryDbPediaLink>
6 <http://dbpedia.org/resource/Canada> .
The following listing of the file src/kgcreator/rdf.lisp generates RDF data:
1 (in-package #:kgcreator)
2
3 (let ((*rdf-nodes-hash*))
4
5 (defun rdf-from-files (output-file-path text-and-meta-pairs)
6 (setf *rdf-nodes-hash* (make-hash-table :test #'equal :size 200))
7 (print (list "==> rdf-from-files" output-file-path text-and-meta-pairs ))
8 (with-open-file
9 (str output-file-path
10 :direction :output
11 :if-exists :supersede
12 :if-does-not-exist :create)
13
14 (defun rdf-from-files-handle-single-file (text-input-file meta-input-file)
15 (let* ((text (file-to-string text-input-file))
16 (words (myutils:words-from-string text))
17 (meta (file-to-string meta-input-file)))
18
19 (defun generate-original-doc-node-rdf ()
20 (let ((node-name (node-name-from-uri meta)))
21 (if (null (gethash node-name *rdf-nodes-hash*))
22 (let* ((cats (categorize words))
23 (sum (summarize words cats)))
24 (print (list "$$$$$$ cats:" cats))
25 (setf (gethash node-name *rdf-nodes-hash*) t)
26 (format str (concatenate 'string "<" meta
27 "> <http:knowledgebooks.com/schema/summary> \""
28 sum "\" . ~%"))
29 (dolist (cat cats)
30 (let ((hash-check (concatenate 'string node-name (car cat))))
31 (if (null (gethash hash-check *rdf-nodes-hash*))
32 (let ()
33 (setf (gethash hash-check *rdf-nodes-hash*) t)
34 (format str
35 (concatenate 'string "<" meta
36 "> <http://knowledgebooks.com/schema/"
37 "topicCategory> "
38 "<http://knowledgebooks.com/schema/"
39 (car cat) "> . ~%"))))))))))
40
41 (defun generate-dbpedia-contains-rdf (key value)
42 (generate-original-doc-node-rdf)
43 (let ((relation-name (concatenate 'string key "DbPediaLink")))
44 (dolist (entity-pair value)
45 (let* ((node-name (node-name-from-uri meta))
46 (object-node-name (node-name-from-uri (cadr entity-pair)))
47 (hash-check (concatenate 'string node-name object-node-name)))
48 (if (null (gethash hash-check *rdf-nodes-hash*))
49 (let ()
50 (setf (gethash hash-check *rdf-nodes-hash*) t)
51 (format str (concatenate 'string "<" meta
52 "> <http://knowledgebooks.com/schema/contains/"
53 key "> " (cadr entity-pair) " .~%"))))))))))
54
55
56 ;; start code for rdf-from-files (output-file-path text-and-meta-pairs)
57 (dolist (pair text-and-meta-pairs)
58 (rdf-from-files-handle-single-file (car pair) (cadr pair))
59 (let ((h (entities_dbpedia:find-entities-in-text (file-to-string (car pair)))))
60 (entities_dbpedia:entity-iterator #'generate-dbpedia-contains-rdf h))))))
61
62
63 (defvar test_files '((#P"~/GITHUB/common-lisp/kgcreator/test_data/test3.txt"
64 #P"~/GITHUB/common-lisp/kgcreator/test_data/test3.meta")))
65 (defvar test_filesZZZ '((#P"~/GITHUB/common-lisp/kgcreator/test_data/test3.txt"
66 #P"~/GITHUB/common-lisp/kgcreator/test_data/test3.meta")
67 (#P"~/GITHUB/common-lisp/kgcreator/test_data/test2.txt"
68 #P"~/GITHUB/common-lisp/kgcreator/test_data/test2.meta")
69 (#P"~/GITHUB/common-lisp/kgcreator/test_data/test1.txt"
70 #P"~/GITHUB/common-lisp/kgcreator/test_data/test1.meta")))
71
72 (defun test3a ()
73 (rdf-from-files "out.rdf" test_files))
You can load all of KGCreator but just execute the test function at the end of this file using:
1 (ql:quickload "kgcreator")
2 (in-package #:kgcreator)
3 (kgcreator:test3a)
This code works on a list of paired files for text data and the meta data for each text file. As an example, if there is an input text file test123.txt then there would be a matching meta file test123.meta that contains the source of the data in the file test123.txt. This data source will be a URI on the web or a local file URI. The top level function rdf-from-files takes an output file path for writing the generated RDF data and a list of pairs of text and meta file paths.
A global variable *rdf-nodes-hash* will be used to remember the nodes in the RDF graph as it is generated. Please note that the function rdf-from-files is not re-entrant: it uses the global *rdf-nodes-hash* so if you are writing multi-threaded applications it will not work to execute the function rdf-from-files simultaneously in multiple threads of execution.
The function rdf-from-files (and the nested functions) are straightforward. I left a few debug printout statements in the code and when you run the test code that I left in the bottom of the file, hopefully it will be clear what rdf.lisp is doing.
Generating Data for the Neo4j Graph Database
Now we will generate Neo4J Cypher data. In order to keep the implementation simple, both the RDF and Cypher generation code starts with raw text and performs the NLP analysis to find entities. This example could be refactored to perform the NLP analysis just one time but in practice you will likely be working with either RDF or NEO4J and so you will probably extract just the code you need from this example (i.e., either the RDF or Cypher generation code).
Before we look at the code, let’s start with a few lines of generated Neo4J Cypher import data:
1 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCompanyDbPediaLink]->(Wall_Street_Journal)
2 CREATE (Canada:Entity {name:"Canada", uri:"<http://dbpedia.org/resource/Canada>"})
3 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCountryDbPediaLink]->(Canada)
4 CREATE (summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahoma_texas_storyid63146361:Summary {name:"summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahoma_texas_storyid63146361", uri:"<https://abcnews.go.com/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/story?id=63146361>", summary:"Part of the system that delivered severe weather to the central U.S. over the weekend is moving into the Northeast today, producing strong to severe storms -- damaging winds, hail or isolated tornadoes can't be ruled out. Severe weather is forecast to continue on Tuesday, with the western storm moving east into the Midwest and parts of the mid-Mississippi Valley."})
The following listing of file src/kgcreator/neo4j.lisp is similar to the code that generated RDF in the last section:
1 (in-package #:kgcreator)
2
3 (let ((*entity-nodes-hash*))
4
5 (defun cypher-from-files (output-file-path text-and-meta-pairs)
6 (setf *entity-nodes-hash* (make-hash-table :test #'equal :size 200))
7 ;;(print (list "==> cypher-from-files"output-file-path text-and-meta-pairs ))
8 (with-open-file
9 (str output-file-path
10 :direction :output
11 :if-exists :supersede
12 :if-does-not-exist :create)
13
14 (defun generateNeo4jCategoryNodes ()
15 (let* ((names categorize_summarize::categoryNames))
16 (dolist (name names)
17 (format str
18 (myutils:replace-all
19 (concatenate
20 'string "CREATE (" name ":CategoryType {name:\"" name "\"})~%")
21 "/" "_"))))
22 (format str "~%"))
23
24
25 (defun cypher-from-files-handle-single-file (text-input-file meta-input-file)
26 (let* ((text (file-to-string text-input-file))
27 (words (myutils:words-from-string text))
28 (meta (file-to-string meta-input-file)))
29
30 (defun generate-original-doc-node ()
31 (let ((node-name (node-name-from-uri meta)))
32 (if (null (gethash node-name *entity-nodes-hash*))
33 (let* ((cats (categorize words))
34 (sum (summarize words cats)))
35 (setf (gethash node-name *entity-nodes-hash*) t)
36 (format str (concatenate 'string "CREATE (" node-name ":News {name:\""
37 node-name "\", uri: \"" meta
38 "\", summary: \"" sum "\"})~%"))
39 (dolist (cat cats)
40 (let ((hash-check (concatenate 'string node-name (car cat))))
41 (if (null (gethash hash-check *entity-nodes-hash*))
42 (let ()
43 (setf (gethash hash-check *entity-nodes-hash*) t)
44 (format str (concatenate 'string "CREATE (" node-name
45 ")-[:Category]->("
46 (car cat) ")~%"))))))))))
47
48 (defun generate-dbpedia-nodes (key entity-pairs)
49 (dolist (entity-pair entity-pairs)
50 (if (null (gethash (node-name-from-uri (cadr entity-pair))
51 *entity-nodes-hash*))
52 (let ()
53 (setf (gethash (node-name-from-uri (cadr entity-pair)) *entity-nodes-hash*) t)
54 (format str
55 (concatenate 'string "CREATE (" (node-name-from-uri (cadr entity-pair)) ":"
56 key " {name: \"" (car entity-pair)
57 "\", uri: \"" (cadr entity-pair) "\"})~%"))))))
58
59 (defun generate-dbpedia-contains-cypher (key value)
60 (generate-original-doc-node)
61 (generate-dbpedia-nodes key value)
62 (let ((relation-name (concatenate 'string key "DbPediaLink")))
63 (dolist (entity-pair value)
64 (let* ((node-name (node-name-from-uri meta))
65 (object-node-name (node-name-from-uri (cadr entity-pair)))
66 (hash-check (concatenate 'string node-name object-node-name)))
67 (if (null (gethash hash-check *entity-nodes-hash*))
68 (let ()
69 (setf (gethash hash-check *entity-nodes-hash*) t)
70 (format str (concatenate 'string
71 "CREATE (" node-name ")-[:"
72 relation-name "]->(" object-node-name ")~%"))))))))))
73
74
75 ;; start code for cypher-from-files (output-file-path text-and-meta-pairs)
76 (generateNeo4jCategoryNodes) ;; just once, not for every input file
77 (dolist (pair text-and-meta-pairs)
78 (cypher-from-files-handle-single-file (car pair) (cadr pair))
79 (let ((h (entities_dbpedia:find-entities-in-text (file-to-string (car pair)))))
80 (entities_dbpedia:entity-iterator #'generate-dbpedia-contains-cypher h))))))
81
82
83 (defvar test_files '((#P"~/GITHUB/common-lisp/kgcreator/test_data/test3.txt"
84 #P"~/GITHUB/common-lisp/kgcreator/test_data/test3.meta")
85 (#P"~/GITHUB/common-lisp/kgcreator/test_data/test2.txt"
86 #P"~/GITHUB/common-lisp/kgcreator/test_data/test2.meta")
87 (#P"~/GITHUB/common-lisp/kgcreator/test_data/test1.txt"
88 #P"~/GITHUB/common-lisp/kgcreator/test_data/test1.meta")))
89
90 (defun test2a ()
91 (cypher-from-files "out.cypher" test_files))
You can load all of KGCreator but just execute the test function at the end of this file using:
1 (ql:quickload "kgcreator")
2 (in-package #:kgcreator)
3 (kgcreator:test2a)
Implementing the Top Level Application APIs
The code in the file src/kgcreator/kgcreator.lisp uses both rdf.lisp and neo4j.lisp that we saw in the last two sections. The function get-files-and-meta looks at the contents of an input directory to generate a list of pairs, each pair containing the path to a text file and the meta file for the corresponding text file.
We are using the opts package to parse command line arguments. This will be used when we build a single file standalone executable file for the entire KGCreator application, including the web application that we will see in a later section.
1 ;; KGCreator main program
2
3 (in-package #:kgcreator)
4
5 (ensure-directories-exist "temp/")
6
7 (defun get-files-and-meta (fpath)
8 (let ((data (directory (concatenate 'string fpath "/" "*.txt")))
9 (meta (directory (concatenate 'string fpath "/" "*.meta"))))
10 (if (not (equal (length data) (length meta)))
11 (let ()
12 (princ "Error: must be matching *.meta files for each *.txt file")
13 (terpri)
14 '())
15 (let ((ret '()))
16 (dotimes (i (length data))
17 (setq ret (cons (list (nth i data) (nth i meta)) ret)))
18 ret))))
19
20 (opts:define-opts
21 (:name :help
22 :description
23 "KGcreator command line example: ./KGcreator -i test_data -r out.rdf -c out.cyper"
24 :short #\h
25 :long "help")
26 (:name :rdf
27 :description "RDF output file name"
28 :short #\r
29 :long "rdf"
30 :arg-parser #'identity ;; <- takes an argument
31 :arg-parser #'identity) ;; <- takes an argument
32 (:name :cypher
33 :description "Cypher output file name"
34 :short #\c
35 :long "cypher"
36 :arg-parser #'identity) ;; <- takes an argument
37 (:name :inputdir
38 :description "Cypher output file name"
39 :short #\i
40 :long "inputdir"
41 :arg-parser #'identity)) ;; <- takes an argument
42
43
44 (defun kgcreator () ;; don't need: &aux args sb-ext:*posix-argv*)
45 (handler-case
46 (let* ((opts (opts:get-opts))
47 (input-path
48 (if (find :inputdir opts)
49 (nth (1+ (position :inputdir opts)) opts)))
50 (rdf-output-path
51 (if (find :rdf opts)
52 (nth (1+ (position :rdf opts)) opts)))
53 (cypher-output-path
54 (if (find :cypher opts)
55 (nth (1+ (position :cypher opts)) opts))))
56 (format t "input-path: ~a rdf-output-path: ~a cypher-output-path:~a~%"
57 input-path rdf-output-path cypher-output-path)
58 (if (not input-path)
59 (format t "You must specify an input path.~%")
60 (locally
61 (declare #+sbcl(sb-ext:muffle-conditions sb-kernel:redefinition-warning))
62 (handler-bind
63 (#+sbcl(sb-kernel:redefinition-warning #'muffle-warning))
64 ;; stuff that emits redefinition-warning's
65 (let ()
66 (if rdf-output-path
67 (rdf-from-files rdf-output-path (get-files-and-meta input-path)))
68 (if cypher-output-path
69 (cypher-from-files cypher-output-path (get-files-and-meta input-path))))))))
70 (t (c)
71 (format t "We caught a runtime error: ~a~%" c)
72 (values 0 c)))
73 (format t "~%Shutting down KGcreator - done processing~%~%"))
74
75 (defun test1 ()
76 (get-files-and-meta
77 "~/GITHUB/common-lisp/kgcreator/test_data"))
78
79 (defun print-hash-entry (key value)
80 (format t "The value associated with the key ~S is ~S~%" key value))
81
82 (defun test2 ()
83 (let ((h (entities_dbpedia:find-entities-in-text "Bill Clinton and George Bush went to Mexico and England and watched Univision. They enjoyed Dakbayan sa Dabaw and shoped at Best Buy and listened to Al Stewart. They agree on República de Nicaragua and support Sweden Democrats and Leicestershire Miners Association and both sent their kids to Darul Uloom Deoband.")))
84 (entities_dbpedia:entity-iterator #'print-hash-entry h)))
85
86 (defun test7 ()
87 (rdf-from-files "out.rdf" (get-files-and-meta "test_data")))
You can load all of KGCreator but just execute the three test functions at the end of this file using:
1 (ql:quickload "kgcreator")
2 (in-package #:kgcreator)
3 (kgcreator:test1)
4 (kgcreator:test2)
5 (kgcreator:test7)
Implementing The Web Interface
When we build a standalone single file application for KGCreator, we include a simple web application interface that allows users to enter input text and see generated RDF and Neo4j Cypher data.
The file src/kgcreator/web.lisp uses the libraries cl-who hunchentoot parenscript that we used earlier. The function write-files-run-code** (lines 8-43) takes raw text, and writes generated RDF and Neo4j Cypher data to local temporary files that are then read and formatted to HTML for display. The code in rdf.lisp and neo4j.lisp is file oriented, and I wrote web.lisp as an afterthought so it was easier writing temporary files than refactoring rdf.lisp and neo4j.lisp to write to strings.
1 (in-package #:kgcreator)
2
3 (ql:quickload '(cl-who hunchentoot parenscript))
4
5
6 (setf (html-mode) :html5)
7
8 (defun write-files-run-code (a-uri raw-text)
9 (if (< (length raw-text) 10)
10 (list "not enough text" "not enough text")
11 ;; generate random file number
12 (let* ((filenum (+ 1000 (random 5000)))
13 (meta-name (concatenate 'string "temp/" (write-to-string filenum) ".meta"))
14 (text-name (concatenate 'string "temp/" (write-to-string filenum) ".txt"))
15 (rdf-name (concatenate 'string "temp/" (write-to-string filenum) ".rdf"))
16 (cypher-name (concatenate 'string "temp/" (write-to-string filenum) ".cypher"))
17 ret)
18 ;; write meta file
19 (with-open-file (str meta-name
20 :direction :output
21 :if-exists :supersede
22 :if-does-not-exist :create)
23 (format str a-uri))
24 ;; write text file
25 (with-open-file (str text-name
26 :direction :output
27 :if-exists :supersede
28 :if-does-not-exist :create)
29 (format str raw-text))
30 ;; generate rdf and cypher files
31 (rdf-from-files rdf-name (list (list text-name meta-name)))
32 (cypher-from-files cypher-name (list (list text-name meta-name)))
33 ;; read files and return results
34 (setf ret
35 (list
36 (replace-all
37 (replace-all
38 (uiop:read-file-string rdf-name)
39 ">" ">")
40 "<" "<")
41 (uiop:read-file-string cypher-name)))
42 (print (list "ret:" ret))
43 ret)))
44
45 (defvar *h* (make-instance 'easy-acceptor :port 3000))
46
47 ;; define a handler with the arbitrary name my-greetings:
48
49 (define-easy-handler (my-greetings :uri "/") (text)
50 (setf (hunchentoot:content-type*) "text/html")
51 (let ((rdf-and-cypher (write-files-run-code "http://test.com/1" text)))
52 (print (list "*** rdf-and-cypher:" rdf-and-cypher))
53 (with-html-output-to-string
54 (*standard-output* nil :prologue t)
55 (:html
56 (:head (:title "KGCreator Demo")
57 (:link :rel "stylesheet" :href "styles.css" :type "text/css"))
58 (:body
59 :style "margin: 90px"
60 (:h1 "Enter plain text for the demo to create RDF and Cypher")
61 (:p "For more information on the KGCreator product please visit the web site:"
62 (:a :href "https://markwatson.com/products/" "Mark Watson's commercial products"))
63 (:p "The KGCreator product is a command line tool that processes all text "
64 "web applications and files in a source directory and produces both RDF data "
65 "triples for semantic Cypher input data files for the Neo4j graph database. "
66 "For the purposes of this demo the URI for your input text is hardwired to "
67 "<http://test.com/1> but the KGCreator product offers flexibility "
68 "for assigning URIs to data sources and further, "
69 "creates links for relationships between input sources.")
70 (:p :style "text-align: left"
71 "To try the demo paste plain text into the following form that contains "
72 "information on companies, news, politics, famous people, broadcasting "
73 "networks, political parties, countries and other locations, etc. ")
74 (:p "Do not include and special characters or character sets:")
75 (:form
76 :method :post
77 (:textarea
78 :rows "20"
79 :cols "90"
80 :name "text"
81 :value text)
82 (:br)
83 (:input :type :submit :value "Submit text to process"))
84 (:h3 "RDF:")
85 (:pre (str (car rdf-and-cypher)))
86 (:h3 "Cypher:")
87 (:pre (str (cadr rdf-and-cypher))))))))
88
89 (defun kgcweb ()
90 (hunchentoot:start *h*))
You can load all of KGCreator and start the web application using:
1 (ql:quickload "kgcreator")
2 (in-package #:kgcreator)
3 (kgcweb)
You can access the web app at http://localhost:3000.
Creating a Standalone Application Using SBCL
When I originally wrote KGCreator I intended to develop a commercial product so it was important to be able to create standalone single file executables. This is simple to do using SBCL:
1 $ sbcl
2 (ql:quickload "kgcreator")
3 (in-package #:kgcreator)
4 (sb-ext:save-lisp-and-die "KGcreator"
5 :toplevel #'kgcreator :executable t)
As an example, you could run the application on the command line using:
1 ./KGcreator -i test_data -r out.rdf -c out.cyper
Augmenting RDF Triples in a Knowledge Graph Using DBPedia
You can augment RDF-based Knowledge Graphs that you build with the KGcreator application by using the library in the directory kg-add-dbpedia-triples.
As seen in the kg-add-dbpedia-triples.asd and package.lisp configuration files, we use two other libraries developed in this book:
1 ;;;; kg-add-dbpedia-triples.asd
2
3 (asdf:defsystem #:kg-add-dbpedia-triples
4 :description "Add DBPedia triples from an input N-Triples RDF file"
5 :author "markw@markwatson.com"
6 :license "Apache 2"
7 :depends-on (#:myutils #:sparql)
8 :components ((:file "package")
9 (:file "add-dbpedia-triples")))
1 ;;;; package.lisp
2
3 (defpackage #:kg-add-dbpedia-triples
4 (:use #:cl #:myutils #:sparql)
5 (:export #:add-triples))
The library is implemented in the file kg-add-dbpedia-triples.lisp:
1 (in-package #:kg-add-dbpedia-triples)
2
3 (defun augmented-triples (a-uri ostream)
4 (let ((results
5 (sparql:dbpedia
6 (format nil "construct { ~A ?p ?o } where { ~A ?p ?o } limit 5" a-uri a-uri))))
7 (dolist (x results)
8 (dolist (sop x)
9 (let ((val (second sop)))
10 (if (and
11 (stringp val)
12 (> (length val) 9)
13 (or
14 (equal (subseq val 0 7) "http://")
15 (equal (subseq val 0 8) "https://")))
16 (format ostream "<~A> " val)
17 (format ostream "~A " val))))
18 (format ostream " .~%"))))
19
20 (defun add-triples (in-file-name out-file-name)
21 (let* ((nt-data (myutils:file-to-string in-file-name))
22 (tokens (myutils:tokenize-string-keep-uri nt-data))
23 (uris
24 (remove-duplicates
25 (mapcan #'(lambda (s) (if
26 (and
27 (stringp s)
28 (> (length s) 19)
29 (equal (subseq s 0 19) "<http://dbpedia.org"))
30 (list s)))
31 tokens)
32 :test #'equal)))
33 (with-open-file (str out-file-name
34 :direction :output
35 :if-exists :supersede
36 :if-does-not-exist :create)
37 (dolist (uri uris)
38 (augmented-triples uri str)))))
TBD
1
KGCreator Wrap Up
When developing applications or systems using Knowledge Graphs it is useful to be able to quickly generate test data which is the primary purpose of KGCreator. A secondary use is to generate Knowledge Graphs for production use using text data sources. In this second use case you will want to manually inspect the generated data to verify its correctness or usefulness for your application.