Resolve Entity Names to DBPedia References
Dear reader, the material in this chapter is somewhat dated but I still use DBPedia as a public information source so I decided to leave this older chapter in this book (for now). As a personal research project I have collected a large data set that maps entity names (e.g., people’s names, city names, names of music groups, company names, etc.) to the DBPedia URI for each entity. I have developed libraries to use this data in Common Lisp, Haskell, and Java. Here we use the Java version of this library.
The Java library is found in the directory ner_dbpedia in the GitHub repository. The raw data for these entity to URI mappings are found in the directory ner_dbpedia/dbpedia_as_text.
This example shows the use of a standard Java and Maven packaging technique: building a JAR file that contains resource files in addition to compiled Java code. The example code reads the required data resources from the JAR file (or the temporary target directory during development). This makes the JAR file self contained when we use this example library in later chapters.
DBPedia Entities
DBPedia is the structured RDF database that is automatically created from WikiPedia info boxes. We will go into some detail on RDF data in the later chapter Semantic Web. The raw data for these entity to URI mappings is found in the directory ner_dbpedia/dbpedia_as_text files have the format (for people in this case):
1 Al Stewart <http://dbpedia.org/resource/Al_Stewart>
2 Alan Watts <http://dbpedia.org/resource/Alan_Watts>
If you visit any or these URIs using a web browser, for example http://dbpedia.org/page/Al_Stewart you will see the DBPedia data for the entity formatted for human reading but to be clear the primary purpose of information in DBPedia is for use by software, not humans.
There are 58953 entities defined with their DBPedia URI and the following listing shows the breakdown of number of entities by entity type by counting the number of lines in each resource file:
1 ner_dbpedia: $ wc -l ./src/main/resources/*.txt
2 108 ./src/main/resources/BroadcastNetworkNamesDbPedia.txt
3 2580 ./src/main/resources/CityNamesDbpedia.txt
4 1786 ./src/main/resources/CompanyNamesDbPedia.txt
5 167 ./src/main/resources/CountryNamesDbpedia.txt
6 14315 ./src/main/resources/MusicGroupNamesDbPedia.txt
7 35606 ./src/main/resources/PeopleDbPedia.txt
8 555 ./src/main/resources/PoliticalPartyNamesDbPedia.txt
9 351 ./src/main/resources/TradeUnionNamesDbPedia.txt
10 3485 ./src/main/resources/UniversityNamesDbPedia.txt
11 58953 total
The URI for each entity defines a unique identifier for real world entities as well as concepts.
Library Implementation
The class com.markwatson.ner_dbpedia.NerMaps is a utility for reading the raw entity mapping data files and creating hash tables for these mappings:
1 package com.markwatson.ner_dbpedia;
2
3 import java.io.BufferedReader;
4 import java.io.IOException;
5 import java.io.InputStream;
6 import java.io.InputStreamReader;
7 import java.io.UncheckedIOException;
8 import java.nio.charset.StandardCharsets;
9 import java.util.HashMap;
10 import java.util.Map;
11
12 /**
13 * Copyright Mark Watson 2020. Apache 2 license,
14 */
15 public class NerMaps {
16
17 private static String enforceAngleBrackets(String s) {
18 if (s.startsWith("<")) return s;
19 return "<" + s + ">";
20 }
21
22 private static Map<String, String> textFileToMap(String nerFileName) {
23 var ret = new HashMap<String, String>();
24 try (InputStream in = ClassLoader.getSystemResourceAsStream(nerFileName);
25 BufferedReader reader = new BufferedReader(
26 new InputStreamReader(in, StandardCharsets.UTF_8))) {
27 reader.lines().forEach(line -> {
28 String[] tokens = line.split("\t");
29 if (tokens.length > 1) {
30 ret.put(tokens[0], enforceAngleBrackets(tokens[1]));
31 }
32 });
33 } catch (IOException ex) {
34 throw new UncheckedIOException(
35 "Failed to load NER resource file: " + nerFileName, ex);
36 }
37 return Map.copyOf(ret);
38 }
39
40 static public final Map<String, String> broadcastNetworks = textFileToMap("BroadcastNetworkNamesDbPedia.txt");
41 static public final Map<String, String> cityNames = textFileToMap("CityNamesDbpedia.txt");
42 static public final Map<String, String> companyNames = textFileToMap("CompanyNamesDbPedia.txt");
43 static public final Map<String, String> countryNames = textFileToMap("CountryNamesDbpedia.txt");
44 static public final Map<String, String> musicGroupNames = textFileToMap("MusicGroupNamesDbPedia.txt");
45 static public final Map<String, String> personNames = textFileToMap("PeopleDbPedia.txt");
46 static public final Map<String, String> politicalPartyNames = textFileToMap("PoliticalPartyNamesDbPedia.txt");
47 static public final Map<String, String> tradeUnionNames = textFileToMap("TradeUnionNamesDbPedia.txt");
48 static public final Map<String, String> universityNames = textFileToMap("UniversityNamesDbPedia.txt");
49
50 /**
51 * Keep legacy field name as an alias so downstream code that references
52 * {@code NerMaps.companyames} (the original typo) still compiles.
53 */
54 @Deprecated(forRemoval = true)
55 static public final Map<String, String> companyames = companyNames;
56
57 public static void main(String[] args) {
58 System.out.println(
59 textFileToMap("CityNamesDbpedia.txt"));
60 }
61 }
The class com.markwatson.ner_dbpedia.TextToDbpediaUris processes an input string and uses public fields to output found entity names and matching DBPedia URIs. We will use this code later in the chapter Automatically Generating Data for Knowledge Graphs.
The code in the class TextToDbpediaUris is simple and repeats two common patterns for each entity type. We will look at some of the code here.
1 package com.markwatson.ner_dbpedia;
2
3 import java.util.ArrayList;
4 import java.util.LinkedHashSet;
5 import java.util.List;
6 import java.util.Map;
7 import java.util.Set;
8 import java.util.regex.Pattern;
9
10 public class TextToDbpediaUris {
11
12 /**
13 * Represents a named-entity category with its lookup map and matched results.
14 */
15 private record EntityCategory(String name, Map<String, String> lookupMap,
16 Set<String> uriSet, List<String> uris, List<String> names) {
17 void addIfAbsent(String uri, String ngram) {
18 if (uriSet.add(uri)) {
19 uris.add(uri);
20 names.add(ngram);
21 }
22 }
23 }
24
25 // Precompiled patterns for tokenization
26 private static final Pattern DOT = Pattern.compile("\\.");
27 private static final Pattern COMMA = Pattern.compile(",");
28 private static final Pattern QMARK = Pattern.compile("\\?");
29 private static final Pattern SEMI = Pattern.compile(";");
30 private static final Pattern NL = Pattern.compile("\n");
31 private static final Pattern MULTI_SPACE = Pattern.compile(" +");
32
33 // Entity categories — order matters for priority
34 private final List<EntityCategory> categories;
35
36 // Public accessors preserving the original API
37 public final List<String> personUris;
38 public final List<String> personNames;
39 public final List<String> companyUris;
40 public final List<String> companyNames;
41 public final List<String> cityUris;
42 public final List<String> cityNames;
43 public final List<String> countryUris;
44 public final List<String> countryNames;
45 public final List<String> broadcastNetworkUris;
46 public final List<String> broadcastNetworkNames;
47 public final List<String> musicGroupUris;
48 public final List<String> musicGroupNames;
49 public final List<String> politicalPartyUris;
50 public final List<String> politicalPartyNames;
51 public final List<String> tradeUnionUris;
52 public final List<String> tradeUnionNames;
53 public final List<String> universityUris;
54 public final List<String> universityNames;
55
56 @SuppressWarnings("unused")
57 private TextToDbpediaUris() {
58 this("");
59 }
60
61 public TextToDbpediaUris(String text) {
62 // Initialize entity categories with their lookup maps
63 var person = makeCat("person", NerMaps.personNames);
64 var city = makeCat("city", NerMaps.cityNames);
65 var company = makeCat("company", NerMaps.companyNames);
66 var country = makeCat("country", NerMaps.countryNames);
67 var broadcastNetwork = makeCat("broadcastNetwork", NerMaps.broadcastNetworks);
68 var musicGroup = makeCat("musicGroup", NerMaps.musicGroupNames);
69 var politicalParty = makeCat("politicalParty", NerMaps.politicalPartyNames);
70 var tradeUnion = makeCat("tradeUnion", NerMaps.tradeUnionNames);
71 var university = makeCat("university", NerMaps.universityNames);
72
73 categories = List.of(person, city, company, country,
74 broadcastNetwork, musicGroup, politicalParty, tradeUnion, university);
75
76 // Wire public fields to category lists for backward compatibility
77 personUris = person.uris();
78 personNames = person.names();
79 companyUris = company.uris();
80 companyNames = company.names();
81 cityUris = city.uris();
82 cityNames = city.names();
83 countryUris = country.uris();
84 countryNames = country.names();
85 broadcastNetworkUris = broadcastNetwork.uris();
86 broadcastNetworkNames = broadcastNetwork.names();
87 musicGroupUris = musicGroup.uris();
88 musicGroupNames = musicGroup.names();
89 politicalPartyUris = politicalParty.uris();
90 politicalPartyNames = politicalParty.names();
91 tradeUnionUris = tradeUnion.uris();
92 tradeUnionNames = tradeUnion.names();
93 universityUris = university.uris();
94 universityNames = university.names();
95
96 processText(text);
97 }
98
99 private static EntityCategory makeCat(String name, Map<String, String> lookupMap) {
100 return new EntityCategory(name, lookupMap,
101 new LinkedHashSet<>(), new ArrayList<>(), new ArrayList<>());
102 }
103
104 private void processText(String text) {
105 String[] tokens = tokenize(text + " . . .");
106 for (int i = 0, size = tokens.length - 2; i < size; i++) {
107 String n3gram = tokens[i] + " " + tokens[i + 1] + " " + tokens[i + 2];
108 String n2gram = tokens[i] + " " + tokens[i + 1];
109
110 // Check 3-grams first (longest match wins)
111 int skip = tryMatch(n3gram, 3, i);
112 if (skip > 0) { i += skip - 1; continue; }
113
114 // Check 2-grams
115 skip = tryMatch(n2gram, 2, i);
116 if (skip > 0) { i += skip - 1; continue; }
117
118 // Check 1-grams
119 tryMatch(tokens[i], 1, i);
120 }
121 }
122
123 /**
124 * Try to match an n-gram against all entity categories.
125 * @return the number of extra tokens to skip (n-1), or 0 if no match.
126 */
127 private int tryMatch(String ngram, int n, int startIndex) {
128 for (var cat : categories) {
129 String uri = cat.lookupMap().get(ngram);
130 if (uri != null) {
131 if (!uri.startsWith("<")) uri = "<" + uri + ">";
132 System.out.println(cat.name() + "\t" + startIndex + "\t" + (startIndex + n - 1) + "\t" + ngram + "\t" + uri);
133 cat.addIfAbsent(uri, ngram);
134 return n;
135 }
136 }
137 return 0;
138 }
139
140 private String[] tokenize(String s) {
141 String result = DOT.matcher(s).replaceAll(" . ");
142 result = COMMA.matcher(result).replaceAll(" , ");
143 result = QMARK.matcher(result).replaceAll(" ? ");
144 result = NL.matcher(result).replaceAll(" ");
145 result = SEMI.matcher(result).replaceAll(" ; ");
146 return MULTI_SPACE.matcher(result).replaceAll(" ").split(" ");
147 }
148
149 @Override
150 public String toString() {
151 var sb = new StringBuilder("TextToDbpediaUris {\n");
152 for (var cat : categories) {
153 if (!cat.names().isEmpty()) {
154 sb.append(" ").append(cat.name()).append(": ")
155 .append(cat.names()).append(" -> ").append(cat.uris())
156 .append('\n');
157 }
158 }
159 sb.append('}');
160 return sb.toString();
161 }
162 }
The empty constructor is private since it makes no sense to create an instance of TextToDbpediaUris without text input. The code supports nine entity types. Here we show the definition of public output fields for just two entity types (people and companies).
As a matter of programming style I generally no longer use getter and setter methods, preferring a more concise coding style. I usually make output fields package default visibility (i.e., no private or public specification so the fields are public within a package and private from other packages). Here I make them public because the package nerdbpedia developed here is meant to be used by other packages. If you prefer using getter and setter methods, modern IDEs like IntelliJ and Eclipse can generate those for you for the example code in this book.We will handle entity names comprised of one, two, and three word sequences (n-grams). We check for longer word sequences before shorter sequences (longest-match-first priority) across all categories:
1 private void processText(String text) {
2 String[] tokens = tokenize(text + " . . .");
3 for (int i = 0, size = tokens.length - 2; i < size; i++) {
4 String n3gram = tokens[i] + " " + tokens[i + 1] + " " + tokens[i + 2];
5 String n2gram = tokens[i] + " " + tokens[i + 1];
6
7 // Check 3-grams first (longest match wins)
8 int skip = tryMatch(n3gram, 3, i);
9 if (skip > 0) { i += skip - 1; continue; }
10
11 // Check 2-grams
12 skip = tryMatch(n2gram, 2, i);
13 if (skip > 0) { i += skip - 1; continue; }
14
15 // Check 1-grams
16 tryMatch(tokens[i], 1, i);
17 }
18 }
To clean up the code and avoid repeating lookup logic for each of the nine entity categories, we use a helper method tryMatch that iterates through all registered EntityCategory instances and records matching entities:
1 private int tryMatch(String ngram, int n, int startIndex) {
2 for (var cat : categories) {
3 String uri = cat.lookupMap().get(ngram);
4 if (uri != null) {
5 if (!uri.startsWith("<")) uri = "<" + uri + ">";
6 System.out.println(cat.name() + "\t" + startIndex + "\t" + (startIndex + n - 1) + "\t" + ngram + "\t" + uri);
7 cat.addIfAbsent(uri, ngram);
8 return n;
9 }
10 }
11 return 0;
12 }
For tokenization, we use precompiled regular expression Pattern instances (like DOT, COMMA, QMARK, SEMI, and NL) for performance efficiency:
1 // Precompiled patterns for tokenization
2 private static final Pattern DOT = Pattern.compile("\\.");
3 private static final Pattern COMMA = Pattern.compile(",");
4 private static final Pattern QMARK = Pattern.compile("\\?");
5 private static final Pattern SEMI = Pattern.compile(";");
6 private static final Pattern NL = Pattern.compile("\n");
7 private static final Pattern MULTI_SPACE = Pattern.compile(" +");
8
9 private String[] tokenize(String s) {
10 String result = DOT.matcher(s).replaceAll(" . ");
11 result = COMMA.matcher(result).replaceAll(" , ");
12 result = QMARK.matcher(result).replaceAll(" ? ");
13 result = NL.matcher(result).replaceAll(" ");
14 result = SEMI.matcher(result).replaceAll(" ; ");
15 return MULTI_SPACE.matcher(result).replaceAll(" ").split(" ");
16 }
The following listing shows the code snippet from the unit test code in the class TextToDbpediaUrisTest that calls the TextToDbpediaUris constructor with a text sample (junit boilerplate code is not shown):
@Test @DisplayName(“Recognises known entities in a sentence”) void recognisesKnownEntities() { String s = “PTL Satellite Network covered President Bill Clinton going to Guatemala and visiting the Coca Cola Company.”; TextToDbpediaUris result = new TextToDbpediaUris(s); System.out.println(result); }
1 The object **result** contains public fields for accessing the entity names and corresponding URIs. We will use these fields in the later chapters [Automatically Generating Data for Knowledge Graphs](#kgcreator) and [Knowledge Graph Navigator](#kgn).
2
3 Here is the output from running the unit test code:
4
5 {linenos=off}
broadcastNetwork 0 2 PTL Satellite Network http://dbpedia.org/resource/PTL_Satellite_Network person 5 6 Bill Clinton http://dbpedia.org/resource/Bill_Clinton country 9 10 Guatemala http://dbpedia.org/resource/Guatemala company 13 14 Coca Cola http://dbpedia.org/resource/Coca-Cola
1 ## Wrap-up for Resolving Entity Names to DBPedia References
2
3 The idea behind this example is simple but useful for information processing applications using raw text input. We will use this library later in two semantic web examples.
4
5
6
7
8 # Semantic Web {#semantic-web}
9
10 We will start with a tutorial on semantic web data standards like RDF, RDFS, and OWL, then implement a wrapper for the Apache Jena library, and finally take a deeper dive into some examples. You will learn how to do the following:
11
12 - Understand RDF data formats.
13 - Understand SPARQL queries for RDF data stores (both local and remote).
14 - Use the Apache Jena library to use local RDF data and perform SPARQL queries that return pure Java data structures.
15 - Use the Apache Jena library to query remote SPARQL endpoints like DBPedia and WikiData.
16 - Use the Apache Derby relational database to cache SPARQL remote queries for both efficiency and for building systems that may have intermittent access to the Internet.
17 - Take a deeper dive into RDF, RDFS, and OWL reasoners.
18
19 The semantic web is intended to provide a massive linked set of data for use by software systems just as the World Wide Web provides a massive collection of linked web pages for human reading and browsing. The semantic web is like the web in that anyone can generate any content that they want. This freedom to publish anything works for the web because we use our ability to understand natural language to interpret what we read – and often to dismiss material that based upon our own knowledge we consider to be incorrect.
20
21 Semantic web and linked data technologies are also useful for smaller amounts of data, an example being a Knowledge Graph containing information for a business. We will further explore Knowledge Graphs in the next two chapters.
22
23 The core concept for the semantic web is data integration and use from different sources. As we will soon see, the tools for implementing the semantic web are designed for encoding data and sharing data from many different sources.
24
25 I cover the semantic web in this book because I believe that semantic web technologies are complementary to AI systems for gathering and processing data on the web. As more web pages are generated by applications (as opposed to simply showing static HTML files) it becomes easier to produce both HTML for human readers and semantic data for software agents.
26
27 There are several very good semantic web toolkits for the Java language and platform. Here we use Apache Jena because it is what I often use in my own work and I believe that it is a good starting technology for your first experiments with semantic web technologies. This chapter provides an incomplete coverage of semantic web technologies and is intended as a gentle introduction to a few useful techniques and how to implement those techniques in Java. This chapter is the start of a journey in the technology that I think is as important as technologies like deep learning that get more public mindshare.
28
29 The following figure shows a layered hierarchy of data models that are used to implement semantic web applications. To design and implement these applications we need to think in terms of physical models (storage and access of RDF, RDFS, and perhaps OWL data), logical models (how we use RDF and RDFS to define relationships between data represented as unique URIs and string literals and how we logically combine data from different sources) and conceptual modeling (higher level knowledge representation and reasoning using OWL). Originally RDF data was serialized as XML data but other formats have become much more popular because they are easier to read and manually create. The top three layers in the figure might be represented as XML, or as LD-JSON (linked data JSON) or formats like N-Triples and N3 that we will use later.
30
31 {#semantic-web-data-models}
32 {width: "60%"}
33 
34
35 {width: "80%"}
36 
37
38 This chapter is meant to get you interested in this technology but is not intended as a complete guide. RDF data is the bedrock of the semantic web. I am also lightly covering RDFS/OWL modeling, and Descriptive Logic Reasoners which are important topics for more advanced semantic web projects.
39
40 ## Available Tools
41
42 In the previous edition of this book I used the open source Sesame library for the material on RDF. Sesame is now called RDF4J and is part of the Eclipse organization's projects.
43
44 I decided to use the Apache Jena project in this new edition because I think Jena is slightly easier to set up a light weight development environment. If you need to set up an RDF server I recommend using the [Fuseki](https://jena.apache.org/documentation/fuseki2/) server which is part of the Apache Jena project. For client applications we will use the Jena library for working with RDF and performing SPARQL queries using the example classss **JenaApis** that we implement later and also for querying remote SPARQL endpoints (i.e., public RDF data sources with SPARQL query interfaces) like DBPedia and WikiData.
45
46 ## Relational Database Model Has Problems Dealing with Rapidly Changing Data Requirements {#rdms-problems}
47
48 When people are first introduced to semantic web technologies their first reaction is often something like, “I can just do that with a database.” The relational database model is an efficient way to express and work with slowly changing data schemas. There are some clever tools for dealing with data change requirements in the database world (ActiveRecord and migrations being a good example) but it is awkward to have end users and even developers tagging on new data attributes to relational database tables.
49
50 This same limitation also applies to object oriented programming and object modeling. Even with dynamic languages that facilitate modifying classes at runtime, the options for adding attributes to existing models are just too limiting. The same argument can be made against the use of XML constrained by conformance to either DTDs or XML Schemas. It is true that RDF and RDFS can be serialized to XML using many pre existing XML namespaces for different knowledge sources and schemas but it turns out that this is done in a way that does not reduce the flexibility for extending data models. XML storage is really only a serialization of RDF and many developers who are just starting to use semantic web technologies initially get confused trying to read XML serialization of RDF – almost like trying to read a PDF file with a plain text editor and something to be avoided. We will use the N-Triple and N3 formats that are simpler to read and understand.
51
52 One goal for the rest of this chapter is convincing you that modeling data with RDF and RDFS facilitates freely extending data models and also allows fairly easy integration of data from different sources using different schemas without explicitly converting data from one schema to another for reuse. You are free to add new data properties and add information to existing graphs (which we refer to a *models*).
53
54 ## RDF: The Universal Data Format
55
56 The Resource Description Framework (RDF) is used to encode information and the RDF Schema (RDFS) facilitates using data with different RDF encodings without the need to convert one set of schemas to another. Later, using OWL we can simply declare that one predicate is the same as another, that is, one predicate is a sub-predicate of another (e.g., a property **containsCity** can be declared to be a sub-property of **containsPlace** so if something contains a city then it also contains a place), etc. The predicate part of an RDF statement often refers to a property.
57
58 RDF data was originally encoded as XML and intended for automated processing. In this chapter we will use two simple to read formats called "N-Triples" and "N3." Apache Jena can be used to convert between all RDF formats so we might as well use formats that are easier to read and understand. RDF data consists of a set of triple values:
59
60 - subject
61 - predicate
62 - object
63
64 Some of my work with semantic web technologies deals with processing news stories, extracting semantic information from the text, and storing it in RDF. I will use this application domain for the examples in this chapter and the next chapter when we implement code to automatically generate RDF for Knowledge Graphs. I deal with triples like:
65
66 - subject: a URL (or URI) of a news article.
67 - predicate: a relation like "containsPerson".
68 - object: a literal value like "Bill Clinton" or a URI representing Bill Clinton.
69
70 In the next chapter we will use the entity recognition library we developed in an earlier chapter to create RDF from text input.
71
72 We will use either URIs or string literals as values for objects. We will always use URIs for representing subjects and predicates. In any case URIs are usually preferred to string literals. We will see an example of this preferred use but first we need to learn the N-Triple and N3 RDF formats.
73
74 I proposed the idea that RDF was more flexible than Object Modeling in programming languages, relational databases, and XML with schemas. If we can tag new attributes on the fly to existing data, how do we prevent what I might call “data chaos” as we modify existing data sources? It turns out that the solution to this problem is also the solution for encoding real semantics (or meaning) with data: we usually use unique URIs for RDF subjects, predicates, and objects, and usually with a preference for not using string literals. The definitions of predicates are tied to a namespace and later with OWL we will state the equivalence of predicates in different namespaces with the same semantic meaning. I will try to make this idea more clear with some examples and [Wikipedia has a good writeup on RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework).
75
76 Any part of a triple (subject, predicate, or object) is either a URI or a string literal. URIs encode namespaces. For example, the containsPerson predicate in the last example could be written as:
77
78 {lang="sparql",linenos=off}
http://knowledgebooks.com/ontology/#containsPerson
1 The first part of this URI is considered to be the namespace for this predicate “containsPerson.” When different RDF triples use this same predicate, this is some assurance to us that all users of this predicate understand to the same meaning. Furthermore, we will see later that we can use RDFS to state equivalency between this predicate (in the namespace http://knowledgebooks.com/ontology/) with predicates represented by different URIs used in other data sources. In an “artificial intelligence” sense, software that we write does not understand predicates like "containsCity", "containsPerson", or "isLocation" in the way that a human reader can by combining understood common meanings for the words "contains", "city", "is", "person", and "location" but for many interesting and useful types of applications that is fine as long as the predicate is used consistently. We will see shortly that we can define abbreviation prefixes for namespaces which makes RDF and RDFS files shorter and easier to read.
2
3 The Jena library supports most serialization formats for RDF:
4
5 - Turtle
6 - N3
7 - N-Triples
8 - NQuads
9 - TriG
10 - JSON-LD
11 - RDF/XML
12 - RDF/JSON
13 - TriX
14 - RDF Binary
15
16 A statement in N-Triple format consists of three URIs (two URIs and a string literals for the object) followed by a period to end the statement. While statements are often written one per line in a source file they can be broken across lines; it is the ending period which marks the end of a statement. The standard file extension for N-Triple format files is \*.nt and the standard format for N3 format files is \*.n3.
17
18 My preference is to use N-Triple format files as output from programs that I write to save data as RDF. N-Triple files don't use any abbreviations and each RDF statement is self-contained. I often use tools like the command line commands in Jena or RDF4J to convert N-Triple files to N3 or other formats if I will be reading them or even hand editing them. Here is an example using the N3 syntax:
19
20 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology#
http://news.com/201234/ kb:containsCountry “China” .
1 The N3 format adds prefixes (abbreviations) to the N-Triple format. In practice it would be better to use the URI **<http://dbpedia.org/resource/China>** instead of the literal value "China."
2
3 Here we see the use of an abbreviation prefix “kb:” for the namespace for my company KnowledgeBooks.com ontologies. The first term in the RDF statement (the subject) is the URI of a news article. The second term (the predicate) is “containsCountry” in the “kb:” namespace. The last item in the statement (the object) is a string literal “China.” I would describe this RDF statement in English as, “The news article at URI http://news.com/201234 mentions the country China.”
4
5 This was a very simple N3 example which we will expand to show additional features of the N3 notation. As another example, let's look at the case if this news article also mentions the USA. Instead of adding a whole new statement like this we can combine them using N3 notation. Here we have two separate RDF statements:
6
7 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# .
http://news.com/201234/ kb:containsCountry http://dbpedia.org/resource/China .
http://news.com/201234/ kb:containsCountry http://dbpedia.org/resource/United_States .
1 We can collapse multiple RDF statements that share the same subject and optionally the same predicate:
2
3 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# .
http://news.com/201234/ kb:containsCountry http://dbpedia.org/resource/China , http://dbpedia.org/resource/United_States .
1 The indentation and placement on separate lines is arbitrary - use whatever style you like that is readable. We can also add in additional predicates that use the same subject (I am going to use string literals here instead of URIs for objects to make the following example more concise but in practice prefer using URIs):
2
3 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# .
http://news.com/201234/ kb:containsCountry “China” , “USA” . kb:containsOrganization “United Nations” ; kb:containsPerson “Ban Ki-moon” , “Gordon Brown” , “Hu Jintao” , “George W. Bush” , “Pervez Musharraf” , “Vladimir Putin” , “Mahmoud Ahmadinejad” .
1 This single N3 statement represents ten individual RDF triples. Each section defining triples with the same subject and predicate have objects separated by commas and ending with a period. Please note that whatever RDF storage system you use (we will be using Jena) it makes no difference if we load RDF as XML, N-Triple, of N3 format files: internally subject, predicate, and object triples are stored in the same way and are used in the same way. RDF triples in a data store represent directed graphs that may not all be connected.
2
3 I promised you that the data in RDF data stores was easy to extend. As an example, let us assume that we have written software that is able to read online news articles and create RDF data that captures some of the semantics in the articles. If we extend our program to also recognize dates when the articles are published, we can simply reprocess articles and for each article add a triple to our RDF data store using a form like:
4
5 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# .
http://news.com/201234/ kb:datePublished “2008-05-11” .
1 Here we just represent the date as a string. We can add a type to the object representing a specific date:
2
3 {lang="sparql",linenos=off}
@prefix xsd: http://www.w3.org/2001/XMLSchema# . @prefix kb: http://knowledgebooks.com/ontology# .
http://news.com/201234/ kb:datePublished “2008-05-11”^^xsd:date .
1 Furthermore, if we do not have dates for all news articles that is often acceptable because when constructing SPARQL queries you can match optional patterns. If for example you are looking up articles on a specific subject then some results may have a publication date attached to the results for that article and some might not. In practice RDF supports types and we would use a date type as seen in the last example, not a string. However, in designing the example programs for this chapter I decided to simplify our representation of URIs and often use string literals as simple Java strings. For many applications this isn't a real limitation.
2
3 ## Extending RDF with RDF Schema {#rdfs}
4
5 RDF Schema (RDFS) supports the definition of classes and properties based on set inclusion. In RDFS classes and properties are orthogonal. Let's start with looking at an example using additional namespaces:
6
7 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# @prefix dbo: http://dbpedia.org/ontology/
http://news.com/201234/ kb:containsCountry http://dbpedia.org/resource/China .
http://news.com/201234/ kb:containsCountry http://dbpedia.org/resource/United_States .
http://dbpedia.org/resource/China rdfs:label “China”@en, rdf:type dbo:Place , rdf:type dbo:Country .
1 Because the semantic web is intended to be processed automatically by software systems it is encoded as RDF. There is a problem that must be solved in implementing and using the semantic web: everyone who publishes semantic web data is free to create their own RDF schemas for storing data; for example, there is usually no single standard RDF schema definition for topics like news stories and stock market data. The [SKOS](https://www.w3.org/2009/08/skos-reference/skos.html) is a namespace containing standard schemas and the most widely used standard is [schema.org](https://schema.org/docs/schemas.html). Understanding the ways of integrating different data sources using different schemas helps to understand the design decisions behind the semantic web applications. In this chapter I often use my own schemas in the knowledgebooks.com namespace for the simple examples you see here. When you build your own production systems part of the work is searching through **schema.org** and **SKOS** to use standard name spaces and schemas when possible. The use of standard schemas helps when you link internal proprietary Knowledge Graphs used in organization with public open data from sources like [WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page) and [DBPedia](https://wiki.dbpedia.org/about).
2
3 We will start with an example that is an extension of the example in the last section that also uses RDFS. We add a few additional RDF statements:
4
5 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
kb:containsCity rdfs:subPropertyOf kb:containsPlace . kb:containsCountry rdfs:subPropertyOf kb:containsPlace . kb:containsState rdfs:subPropertyOf kb:containsPlace .
1 The last three lines declare that:
2
3 - The property containsCity is a sub-property of containsPlace.
4 - The property containsCountry is a sub-property of containsPlace.
5 - The property containsState is a sub-property of containsPlace.
6
7 Why is this useful? For at least two reasons:
8
9 - You can query an RDF data store for all triples that use property containsPlace and also match triples with properties equal to containsCity, containsCountry, or containsState. There may not even be any triples that explicitly use the property containsPlace.
10 - Consider a hypothetical case where you are using two different RDF data stores that use different properties for naming cities: **cityName** and **city**. You can define **cityName** to be a sub-property of **city** and then write all queries against the single property name **city**. This removes the necessity to convert data from different sources to use the same Schema. You can also use OWL to state property and class equivalency.
11
12 In addition to providing a vocabulary for describing properties and class membership by properties, RDFS is also used for logical inference to infer new triples, combine data from different RDF data sources, and to allow effective querying of RDF data stores. We will see examples of all of these features of RDFS when we later start using the Jena libraries to perform SPARQL queries.
13
14
15 ## The SPARQL Query Language
16
17 SPARQL is a query language used to query RDF data stores. While SPARQL may initially look like SQL, we will see that there are some important differences like support for RDFS and OWL inferencing and graph-based instead of relational matching operations. We will cover the basics of SPARQL in this section and then see more examples later when we learn how to embed Jena in Java applications, and see more examples in the last chapter [Knowledge Graph Navigator](#kgn).
18
19 We will use the N3 format RDF file test\_data/news.n3 for the examples. I created this file automatically by spidering Reuters news stories on the news.yahoo.com web site and automatically extracting named entities from the text of the articles. We saw techniques for extracting named entities from text in earlier chapters. In this chapter we use these sample RDF files.
20
21 You have already seen snippets of this file and I list the entire file here for reference, edited to fit line width: you may find the file news.n3 easier to read if you are at your computer and open the file in a text editor so you will not be limited to what fits on a book page:
22
23 {lang="sparql",linenos=off}
@prefix kb: http://knowledgebooks.com/ontology# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
kb:containsCity rdfs:subPropertyOf kb:containsPlace .
kb:containsCountry rdfs:subPropertyOf kb:containsPlace .
kb:containsState rdfs:subPropertyOf kb:containsPlace .
http://yahoo.com/20080616/usa_flooding_dc_16/ kb:containsCity “Burlington” , “Denver” , “St. Paul” ,“ Chicago“ , “Quincy” , “CHICAGO” , “Iowa City” ; kb:containsRegion “U.S. Midwest” , “Midwest” ; kb:containsCountry “United States” , “Japan” ; kb:containsState “Minnesota” , “Illinois” , “Mississippi” , “Iowa” ; kb:containsOrganization “National Guard” , “U.S. Department of Agriculture” , “White House” , “Chicago Board of Trade” , “Department of Transportation” ; kb:containsPerson “Dena Gray-Fisher” , “Donald Miller” , “Glenn Hollander” , “Rich Feltes” , “George W. Bush” ; kb:containsIndustryTerm “food inflation” , “food” , “finance ministers” , “oil” .
http://yahoo.com/78325/ts_nm/usa_politics_dc_2/ kb:containsCity “Washington” , “Baghdad” , “Arlington” , “Flint” ; kb:containsCountry “United States” , “Afghanistan” , “Iraq” ; kb:containsState “Illinois” , “Virginia” , “Arizona” , “Michigan” ; kb:containsOrganization “White House” , “Obama administration” , “Iraqi government” ; kb:containsPerson “David Petraeus” , “John McCain” , “Hoshiyar Zebari” , “Barack Obama” , “George W. Bush” , “Carly Fiorina” ; kb:containsIndustryTerm “oil prices” .
http://yahoo.com/10944/ts_nm/worldleaders_dc_1/ kb:containsCity “WASHINGTON” ; kb:containsCountry “United States” , “Pakistan” , “Islamic Republic of Iran” ; kb:containsState “Maryland” ; kb:containsOrganization “University of Maryland” , “United Nations” ; kb:containsPerson “Ban Ki-moon” , “Gordon Brown” , “Hu Jintao” , “George W. Bush” , “Pervez Musharraf” , “Vladimir Putin” , “Steven Kull” , “Mahmoud Ahmadinejad” .
http://yahoo.com/10622/global_economy_dc_4/ kb:containsCity “Sao Paulo” , “Kuala Lumpur” ; kb:containsRegion “Midwest” ; kb:containsCountry “United States” , “Britain” , “Saudi Arabia” , “Spain” , “Italy” , India“ , ““France” , “Canada” , “Russia” , “Germany” , “China” , “Japan” , “South Korea” ; kb:containsOrganization “Federal Reserve Bank” , “European Union” , “European Central Bank” , “European Commission” ; kb:containsPerson “Lee Myung-bak” , “Rajat Nag” , “Luiz Inacio Lula da Silva” , “Jeffrey Lacker” ; kb:containsCompany “Development Bank Managing” , “Reuters” , “Richmond Federal Reserve Bank” ; kb:containsIndustryTerm “central bank” , “food” , “energy costs” , “finance ministers” , “crude oil prices” , “oil prices” , “oil shock” , “food prices” , “Finance ministers” , “Oil prices” , “oil” .
1 In the following examples, we will use the main method in the class **JenaApi** (developed in the next section) that allows us to load multiple RDF input files and then to interactively enter SPARQL queries.
2
3 We will start with a simple SPARQL query for subjects (news article URLs) and objects (matching countries) with the value for the predicate equal to **containsCountry**. Variables in queries start with a question mark character and can have any names:
4
5 {lang="sparql",linenos=off}
SELECT ?subject ?object WHERE { ?subject http://knowledgebooks.com/ontology#containsCountry ?object . }
1 It is important for you to understand what is happening when we apply the last SPARQL query to our sample data. Conceptually, all the triples in the sample data are scanned, keeping the ones where the predicate part of a triple is equal to **<http://knowledgebooks.com/ontology#containsCountry>**. In practice RDF data stores supporting SPARQL queries index RDF data so a complete scan of the sample data is not required. This is analogous to relational databases where indices are created to avoid needing to perform complete scans of database tables.
2
3 In practice, when you are exploring a Knowledge Graph like DBPedia or WikiData (that are just very large collections of RDF triples), you might run a query and discover a useful or interesting entity URI in the triple store, then drill down to find out more about the entity. In a later chapter [Knowledge Graph Navigator](#kgn) we attempt to automate this exploration process using the DBPedia data as a Knowledge Graph.
4
5 We will be using the same code to access the small example of RDF statements in our sample data as we will for accessing DBPedia or WikiData.
6
7 We can make this last query easier to read and reduce the chance of misspelling errors by using a namespace prefix:
8
9 {lang="sparql",linenos=off}
PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject ?object WHERE { ?subject kb:containsCountry ?object . }
1 **Using the command line option in the Jena wrapper example**
2
3 We will later implement the Java class **JenaApis**. You can run the method **main** in the Java class **JenaApis** using the following to load RDF input files and interactively make SPARQL queries against the RDF data in the input files:
4
5 {lang="bash",linenos=on}
$ mvn exec:java -Dexec.mainClass=“com.markwatson.semanticweb.JenaApis”
-Dexec.args=“data/news.n3 data/sample_news.nt”
1 The command line argument in line 3 starting with **-Dexec.args=** is one way to pass command line arguments to the method **main**. The backslash character at the end of line 2 is the way to continue a long command line request in bash or zsh.
2
3 Here is an interactive example of the last SPARQL example:
4
5 {lang="bash",linenos=off}
$ mvn exec:java -Dexec.mainClass=“com.markwatson.semanticweb.JenaApis”
-Dexec.args=“data/news.n3”
Multi-line queries are OK but don’t use blank lines. Enter a blank line to process query. Enter a SPARQL query: PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject ?object WHERE { ?subject kb:containsCountry ?object . }
[QueryResult vars:[subject, object] Rows: [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Russia] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, Japan] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, India] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, United States] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, Afghanistan] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Saudi Arabia] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, United States] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, France] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, Iraq] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, Pakistan] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Spain] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Italy] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, Islamic Republic of Iran] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Canada] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Britain] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, United States] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, South Korea] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Germany] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, United States] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, China] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Japan]
Enter a SPARQL query:
1 We could have filtered on any other predicate, for instance **containsPlace**. Here is another example using a match against a string literal to find all articles exactly matching the text “Maryland.”
2
3 {lang="sparql",linenos=off}
PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject WHERE { ?subject kb:containsState “Maryland” . }
1 The output is:
2
3 {lang="bash",linenos=off}
Enter a SPARQL query: PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject WHERE { ?subject kb:containsState “Maryland” . }
[QueryResult vars:[subject] Rows: [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/]
1 We can also match partial string literals against regular expressions:
2
3 {lang="sparql",linenos=off}
PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject ?object WHERE { ?subject kb:containsOrganization ?object FILTER regex(?object, “University”) . }
1 The output is:
2
3 {lang="bash",linenos=off}
Enter a SPARQL query: PREFIX kb: http://knowledgebooks.com/ontology# SELECT ?subject ?object WHERE { ?subject kb:containsOrganization ?object FILTER regex(?object, “University”) . }
[QueryResult vars:[subject, object] Rows: [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, University of Maryland]
1 Prior to this last example query we only requested that the query return values for subject and predicate for triples that matched the query.
2 However, we might want to return all triples whose subject (in this case a news article URI) is in one of the matched triples. Note that there are two matching triples, each terminated with a period:
3
4 {lang="sparql",linenos=off}
PREFIX kb: http://knowledgebooks.com/ontology# SELECT DISTINCT ?subject ?a_predicate ?an_object WHERE { ?subject kb:containsOrganization ?object FILTER regex(?object,“University”) . ?subject ?a_predicate ?an_object . } ORDER BY ?a_predicate ?an_object LIMIT 10 OFFSET 5
1 When WHERE clauses contain more than one triple pattern to match, this is equivalent to a Boolean “and” operation. The DISTINCT clause removes duplicate results. The ORDER BY clause sorts the output in alphabetical order: in this case first by predicate (containsCity, containsCountry, etc.) and then by object. The LIMIT modifier limits the number of results returned and the OFFSET modifier sets the number of matching results to skip.
2
3 The output is:
4
5 {lang="bash",linenos=off}
Enter a SPARQL query: PREFIX kb: http://knowledgebooks.com/ontology# SELECT DISTINCT ?subject ?a_predicate ?an_object WHERE { ?subject kb:containsOrganization ?object FILTER regex(?object,“University”) . ?subject ?a_predicate ?an_object . } ORDER BY ?a_predicate ?an_object LIMIT 10 OFFSET 5
[QueryResult vars:[subject, a_predicate, an_object] Rows: [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsOrganization, University of Maryland] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Ban Ki-moon] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, George W. Bush] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Gordon Brown] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Hu Jintao] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Mahmoud Ahmadinejad] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Pervez Musharraf] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Steven Kull] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsPerson, Vladimir Putin] [http://news.yahoo.com/s/nm/20080616/ts_nm/worldleaders_trust_dc_1/, http://knowledgebooks.com/ontology#containsState, Maryland]
1 We are finished with our quick tutorial on using the SELECT query form. There are three other query forms that I am not covering in this chapter:
2
3 - [CONSTRUCT](https://www.w3.org/TR/rdf-sparql-query/#construct) – returns a new RDF graph of query results
4 - [ASK](https://www.w3.org/TR/rdf-sparql-query/#ask) – returns Boolean true or false indicating if a query matches
5 any triples
6 - [DESCRIBE](https://www.w3.org/TR/rdf-sparql-query/#describe) – returns a new RDF graph containing matched resources
7
8 A common matching pattern that I don't cover in this chapter is [optional](https://www.w3.org/TR/rdf-sparql-query/#optionals) but the **optional** matching pattern is used in the examples in the later chapter [Knowledge Graph Navigator](#kgn).
9
10 ## Using Jena
11
12 Apache Jena is a complete Java library for developing RDF/RDFS/OWL applications and we will use it in this chapter. Other available libraries that we don't use here include RDF4J (used to be Sesame), OWLAPI, AllegroGraph, Protege library, etc.
13
14 The following figure shows a UML diagram for the wrapper classes and interface that I wrote for Jena to make it easier for you to get started. My wrapper uses an in-memory RDF repository that supports inference, loading RDF/RDFS/OWL files, and performing both local and remote SPARQL queries. If you decide to use semantic web technologies in your development you will eventually want to use the full Jena APIs for programmatically creating new RDF triples, finer control of the type of repository (options are in-memory, disk based, and database), [type definitions](https://www.w3.org/TR/swbp-xsch-datatypes/) and inferencing, and programmatically using query results. That said, using my wrapper library is a good place for you to start experimenting.
15
16 Referring to the following figure, the class constructor **JenaApis** opens a new in-memory RDF triple store and supplies the public APIs we will use later. The data class **QueryResults** has public class variables for variable names used in a query and a list or rows, one row for each query result. The class **Cache** is used internally to cache SPARQL query results for later to improve performance and use without having online access a remote SPARQL endpoint like DBPedia or WikiData.
17
18 {width: "80%"}
19 
20
21 We will look in some detail at the code in this UML Class Diagram. To improve portability to alternative RDF libraries, I wrote two wrapper classes for Jena, one class to represent query results and the other to wrap the Jena APIs that I use.
22
23 The following screen shot shows the free IntelliJ Community Edition IDE used to edit one of the unit tests and run it:
24
25 {width: "80%"}
26 
27
28 We will now look at the Java implementation of the examples for this chapter.
29
30 ### Java Wrapper for Jena APIs and an Example
31
32 For portability to other RDF and semantic web libraries, when we wrap the Jena APIs we want the results to be in standard Java data classes. The following listing shows the class **QueryResult** that contains the variables used in a SPARQL query and a list or rows containing matched value bindings for these variables:
33
34 {lang="java",linenos=off}
package com.markwatson.semanticweb;
import java.io.Serializable; import java.util.ArrayList; import java.util.List; import java.util.StringJoiner;
public class QueryResult implements Serializable { private QueryResult() { } public QueryResult(List variableList) { this.variableList = List.copyOf(variableList); } public List variableList; public List<List> rows = new ArrayList<>();
public List getVariableList() { return variableList; }
public List<List> getRows() { return rows; }
public String toString() { var sb = new StringBuilder(“[QueryResult vars:” + variableList + “\nRows:\n”); for (List row : rows) { sb.append(“ “).append(row).append(”\n“); } return sb.toString(); } }
1 I defined a **toString** method so when you print an instance of the class **QueryResult** you see the contained data.
2
3 The following listing shows the wrapper class **JenaApis**:
4
5 {lang="java",linenos=on}
package com.markwatson.semanticweb;
import org.apache.commons.lang3.SerializationUtils; import org.apache.jena.query.; import org.apache.jena.rdf.model.; import org.apache.jena.riot.RDFDataMgr; import org.apache.jena.riot.RDFFormat;
import java.io.FileOutputStream; import java.io.IOException; import java.sql.SQLException; import java.util.ArrayList; import java.util.List; import java.util.Scanner;
public class JenaApis implements AutoCloseable {
public JenaApis() { //model = ModelFactory.createDefaultModel(); // use if OWL reasoning not required model = ModelFactory.createOntologyModel(); // use OWL reasoner }
public Model model() { return model; }
public void loadRdfFile(String fpath) { model.read(fpath); }
public void saveModelToTurtleFormat(String outputPath) throws IOException { try (var fos = new FileOutputStream(outputPath)) { RDFDataMgr.write(fos, model, RDFFormat.TRIG_PRETTY); } }
public void saveModelToN3Format(String outputPath) throws IOException { try (var fos = new FileOutputStream(outputPath)) { RDFDataMgr.write(fos, model, RDFFormat.NTRIPLES); } }
public QueryResult query(String sparqlQuery) { try (QueryExecution qexec = QueryExecution.model(model) .query(sparqlQuery) .build()) { ResultSet results = qexec.execSelect(); var qr = new QueryResult(results.getResultVars()); for (; results.hasNext(); ) { QuerySolution solution = results.nextSolution(); List newResultRow = new ArrayList<>(); for (String var : qr.variableList) { newResultRow.add(solution.get(var).toString()); } qr.rows.add(newResultRow); } return qr; } }
public QueryResult queryRemote(String service, String sparqlQuery) throws SQLException { if (cache == null) cache = new Cache(); byte[] b = cache.fetchResultFromCache(sparqlQuery); if (b != null) { //System.out.println(“Found query in cache.”); return SerializationUtils.deserialize(b); } try (QueryExecution qexec = QueryExecution.service(service) .query(sparqlQuery) .build()) { ResultSet results = qexec.execSelect(); var qr = new QueryResult(results.getResultVars()); for (; results.hasNext(); ) { QuerySolution solution = results.nextSolution(); List newResultRow = new ArrayList<>(); for (String var : qr.variableList) { newResultRow.add(solution.get(var).toString()); } qr.rows.add(newResultRow); } byte[] serialized = SerializationUtils.serialize(qr); cache.saveQueryResultInCache(sparqlQuery, serialized); return qr; } }
@Override public void close() throws SQLException { if (cache != null) { cache.close(); } }
private Cache cache = null; private final Model model;
public static void main(String[] args) {
/*
Execute using, for example:
mvn exec:java -Dexec.mainClass=“com.markwatson.semanticweb.JenaApis”
-Dexec.args=“data/news.n3”
*/
JenaApis ja = new JenaApis();
System.out.println(args.length);
if (args.length == 0) {
// no RDF input file names on command line so use a default file:
ja.loadRdfFile(“data/news.n3”);
} else {
for (String fpath : args) {
ja.loadRdfFile(fpath);
}
}
System.out.println(“Multi-line queries are OK but don’t use blank lines.”);
System.out.println(“Enter a blank line to process query.”);
while (true) {
System.out.println(“Enter a SPARQL query:”);
Scanner sc = new Scanner(System.in);
StringBuilder sb = new StringBuilder();
while (sc.hasNextLine()) { //until no other inputs to proceed
String s = sc.nextLine();
if (s.equalsIgnoreCase(“quit”) || s.equalsIgnoreCase(“exit”))
System.exit(0);
if (s.isEmpty()) break;
sb.append(s);
sb.append(“\n”);
}
QueryResult qr = ja.query(sb.toString());
System.out.println(qr);
}
}
}
1 This code is largely self-explanatory. Line 21 or 22 should be commented out, depending on whether you want to enable OWL reasoning. In method **queryRemote** on line 62 we check to see if an instance of **Cache** has been created and if not, create one. The argument **service** for the method **queryRemote** is a SPARQL endpoint (e.g., "https://dbpedia.org/sparql"). The class **QueryResult** implemented **Serializable** so it can be converted and stored in the Derby cache database.
2
3 The method **main** implements a command line interface for accepting multiple lines of input. When the user enters a blank line then the previously entered non-blank lines are passed as a SPARQL local query. When run from the command line, you can enter one or more RDF input files to load prior to the SPARQL query loop.
4
5 The following class shows the unit test class **JenaApisTest** that provides examples for:
6
7 - Create an instance of **JenaApis**.
8 - Run a SPARQL query against the remote public DBPedia service endpoint.
9 - Repeat the remote SPARQL query to show query caching using the Apache Derby relational database.
10 - Load three input data files in N-Triple and N3 format.
11 - Run a SPARQL query against the RDF data that we just loaded.
12 - Save the current model as RDF text files in both N-Triple and N3 format.
13 - Making SPARQL queries that require OWL reasoning.
14
15 {lang="java",linenos=off}
package com.markwatson.semanticweb;
import org.junit.jupiter.api.DisplayName; import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class JenaApisTest {
@Test @DisplayName(“Remote SPARQL query against DBPedia with caching”) void testRemoteSparqlQuery() throws Exception { try (var jenaApis = new JenaApis()) { // test remote SPARQL queries against DBPedia SPARQL endpoint QueryResult qrRemote = jenaApis.queryRemote( “https://dbpedia.org/sparql”, “““ SELECT ?p WHERE { http://dbpedia.org/resource/Bill_Gates ?p http://dbpedia.org/resource/Microsoft . } LIMIT 10“““); System.out.println(“qrRemote:” + qrRemote); assertNotNull(qrRemote, “Remote query result should not be null”); assertFalse(qrRemote.getVariableList().isEmpty(), “Should have at least one variable”);
1 System.out.println("Repeat query to test caching:");
2 qrRemote = jenaApis.queryRemote(
3 "https://dbpedia.org/sparql",
4 "select distinct ?s { ?s ?p <http://dbpedia.org/resource/Parks> } LIMIT 10");
5 System.out.println("qrRemote (hopefully from cache):" + qrRemote);
6 assertNotNull(qrRemote, "Cached query result should not be null");
7
8 jenaApis.loadRdfFile("data/rdfs_business.nt");
9 jenaApis.loadRdfFile("data/sample_news.nt");
10 jenaApis.loadRdfFile("data/sample_news.n3");
11
12 QueryResult qr = jenaApis.query(
13 "select ?s ?o where { ?s <http://knowledgebooks.com/title> ?o } limit 15");
14 System.out.println("qr:" + qr);
15 assertNotNull(qr, "Local query result should not be null");
16
17 jenaApis.saveModelToTurtleFormat("model_save.nt");
18 jenaApis.saveModelToN3Format("model_save.n3");
19 }
}
@Test @DisplayName(“OWL reasoning with RDFS inference”) void testOwlReasoning() throws Exception { try (var jenaApis = new JenaApis()) { jenaApis.loadRdfFile(“data/news.n3”);
1 QueryResult qr = jenaApis.query("""
2 PREFIX kb: <http://knowledgebooks.com/ontology#>
3 SELECT ?s ?o WHERE { ?s kb:containsCity ?o }""");
4 System.out.println("qr:" + qr);
5 assertNotNull(qr, "OWL query result should not be null");
6
7 qr = jenaApis.query("""
8 PREFIX kb: <http://knowledgebooks.com/ontology#>
9 SELECT ?s ?o WHERE { ?s kb:containsPlace ?o }""");
10 System.out.println("qr:" + qr);
11 assertNotNull(qr, "Inferred place query result should not be null");
12
13 qr = jenaApis.query("""
14 PREFIX kb: <http://knowledgebooks.com/ontology#>
15 SELECT ?o (COUNT(*) AS ?count) WHERE {
16 ?s kb:containsPlace ?o
17 } GROUP BY ?o""");
18 System.out.println("qr:" + qr);
19 assertNotNull(qr, "Aggregation query result should not be null");
20 assertFalse(qr.getRows().isEmpty(), "Should have aggregated results");
21 }
} }
1 To reuse the example code in this section, I recommend that you clone the entire directory **semantic_web_apache_jena** because it is set up for using Maven and and default logging. If you want to use the code in an existing Java project then copy the dependencies from the file **pom.xml** to your project. If you run **mvn install** then you will have a local copy installed on your system and can just install the dependency with Maven group ID **com.markwatson** and artifact **semanticweb**.
2
3
4 ## OWL: The Web Ontology Language {#owl}
5
6 We have already seen a few examples of using RDFS to define sub-properties in this chapter. The Web Ontology Language (OWL) extends the expressive power of RDFS. We now look at a few OWL examples and then look at parts of the Java unit test showing three SPARQL queries that use OWL reasoning. The following RDF data stores support at least some level of OWL reasoning:
7
8 - ProtegeOwlApis - compatible with the Protege Ontology editor
9 - Pellet - DL reasoner
10 - Owlim - OWL DL reasoner compatible with some versions of Sesame
11 - Jena - General purpose library
12 - OWLAPI - a simpler API using many other libraries
13 - Stardog - a commercial OWL and RDF reasoning system and datastore
14 - Allegrograph - a commercial RDF+ and RDF reasoning system and datastore
15
16 OWL is more expressive than RDFS in that it supports cardinality, richer class relationships, and Descriptive Logic (DL) reasoning. OWL treats the idea of classes very differently than object oriented programming languages like Java and Smalltalk. In OWL, instances of a class are referred to as individuals and class membership is determined by a set of properties that allow a DL reasoner to infer class membership of an individual (this is called entailment.)
17
18
19 We have been using the RDF file news.n3 in previous examples and we will layer new examples by adding new triples that represent RDF, RDFS, and OWL. We saw in news.n3 the definition of three triples using **rdfs:subPropertyOf** properties to create a more general kb:containsPlace property:
20
21 {lang="sparql",linenos=off}
kb:containsCity rdfs:subPropertyOf kb:containsPlace . kb:containsCountry rdfs:subPropertyOf kb:containsPlace . kb:containsState rdfs:subPropertyOf kb:containsPlace .
kb:containsPlace rdf:type owl:transitiveProperty .
kbplace:UnitedStates kb:containsState kbplace:Illinois . kbplace:Illinois kb:containsCity kbplace:Chicago .
1 We can also infer that:
2
3 {lang="sparql",linenos=off}
kbplace:UnitedStates kb:containsPlace kbplace:Chicago .
1 We can also model inverse properties in OWL. For example, here we add an inverse property kb:containedIn, adding it to the example in the last listing:
2
3 {lang="sparql",linenos=off}
kb:containedIn owl:inverseOf kb:containsPlace .
1 Given an RDF container that supported extended OWL DL SPARQL queries, we can now execute SPARQL queries matching the property kb:containedIn and “match” triples in the RDF triple store that have never been asserted but are inferred by the OWL reasoner.
2
3 OWL DL is a very large subset of full OWL. From reading the chapter on Reasoning and the very light coverage of OWL in this section, you should understand the concept of class membership not by explicitly stating that an object (or individual) is a member of a class, but rather because an individual has properties that can be used to infer class membership.
4
5 The World Wide Web Consortium has defined three versions of the OWL language that are in increasing order of complexity: OWL Lite, OWL DL, and OWL Full. OWL DL (supports Description Logic) is the most widely used (and recommended) version of OWL. OWL Full is not computationally decidable since it supports full logic, multiple class inheritance, and other things that probably make it computationally intractable for all but smaller problems.
6
7 We will now look at some Java code from the method **testOwlReasoning** in the unit test class **JenaApisTest**.
8
9 The following is not affected by using an OWL reasoner because the property **kb:containsCity** occurs directly in the input RDF data:
10
11 {lang="java",linenos=off}
1 try (var jenaApis = new JenaApis()) {
2 jenaApis.loadRdfFile("data/news.n3");
3
4 QueryResult qr = jenaApis.query("""
5 PREFIX kb: <http://knowledgebooks.com/ontology#>
6 SELECT ?s ?o WHERE { ?s kb:containsCity ?o }""");
7 System.out.println("qr:" + qr);
1 The following has been edited to keep just a few output lines per result set:
2
3 {lang="sparql",linenos=off}
qr:[QueryResult vars:[s, o] Rows: [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, St. Paul] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, FLINT] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, CHICAGO]
1 ... output removed. note: there were 15 results for query
2
3 [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, Quincy]
4 [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, Iowa City]
1 Here we use a query that is affected by using an OWL reasoner (i.e., if OWL is not enabled there will be no query results):
2
3 {lang="java",linenos=off}
1 qr = jenaApis.query("""
2 PREFIX kb: <http://knowledgebooks.com/ontology#>
3 SELECT ?s ?o WHERE { ?s kb:containsPlace ?o }""");
4 System.out.println("qr:" + qr);
1 The code in the GitHub repo for this book is configured to use OWL by default. If you edited lines 21-22 in the file **JenaApis.jav** to disable OWL reasoning then revert your changes and rebuild the project.
2
3 The following has been edited to just keep a few output lines per result set:
4
5 {lang="sparql",linenos=off}
qr:[QueryResult vars:[s, o] Rows: [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, St. Paul] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, FLINT] [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, CHICAGO] [http://news.yahoo.com/s/nm/20080616/bs_nm/global_economy_dc_4/, Kuala Lumpur]
1 ... output removed. note: there were 46 results for query
2
3 global_economy_dc_4/, United States]
4 [http://news.yahoo.com/s/nm/20080616/bs_nmglobal_economy_dc_4/, Germany]
5 [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, United States]
1 We now group (aggregate) query results and count the number of times each place name has occurred in the result (this query requires an OWL reasoner):
2
3 {lang="java",linenos=off}
1 qr = jenaApis.query("""
2 PREFIX kb: <http://knowledgebooks.com/ontology#>
3 SELECT ?o (COUNT(*) AS ?count) WHERE {
4 ?s kb:containsPlace ?o
5 } GROUP BY ?o""");
6 System.out.println("qr:" + qr);
1 {lang="sparql",linenos=off}
qr:[QueryResult vars:[o, count] Rows: [Chicago, 1http://www.w3.org/2001/XMLSchema#integer] [Illinois, 2http://www.w3.org/2001/XMLSchema#integer] [Arizona, 1^^http://www.w3.org/2001/XMLSchema#integer]
1 ... output removed. note: there were 40 results for query
2
3 [United States, 4^^http://www.w3.org/2001/XMLSchema#integer]
4 [Iowa, 1^^http://www.w3.org/2001/XMLSchema#integer]
5 [Japan, 2^^http://www.w3.org/2001/XMLSchema#integer]
6 [Spain, 1^^http://www.w3.org/2001/XMLSchema#integer]
1 Note the type **http://www.w3.org/2001/XMLSchema#integer** using the **^^** notation
2 for integer values bound to the variable **count**.
3
4 ## Semantic Web Wrap-up
5
6 Writing semantic web applications in Java is a very large topic, worthy of an entire book. I have covered in this chapter what for my work has been the most useful semantic web techniques: storing and querying RDF and RDFS for a specific application and using OWL when required. We will see in the next two chapters the use of RDF when automatically creating Knowledge Graphs from text data and for automatic navigation of Knowledge Graphs.
7
8
9
10
11 # Automatically Generating Data for Knowledge Graphs {#kgcreator}
12
13 Here we develop a complete application using the package developed in the earlier chapter [Resolve Entity Names to DBPedia References](#ner). The Knowledge Graph Creator (KGcreator) is a tool for automating the generation of data for Knowledge Graphs from raw text data. Here we generate RDF data for a Knowledge Graph. You might also be interested in the Knowledge Graph Creator implementation in [my Common Lisp book](https://leanpub.com/lovinglisp) that generates data for the Neo4J open source graph database in addition to generating RDF data.
14
15 Data created by KGcreator generates data in RDF triples suitable for loading into any linked data/semantic web data store.
16
17 This example application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on making named entities to DBPedia URIs and we will reuse this code.
18
19 I originally wrote KGCreator as two research prototypes, one in Common Lisp (see my [Common Lisp book](https://leanpub.com/lovinglisp)) and one in [Haskell](https://leanpub.com/haskell-cookbook/). The example in this chapter is a port of these systems to Java.
20
21 ## Implementation Notes
22
23 The implementation is contained in a single Java class **KGC** and the **junit** test class **KgcTest** is used to process the test files included with this example.
24
25 {width: "80%"}
26 
27
28 As can be seen in the following figure I have defined final static strings for each type of entity type URI. For example, **personTypeUri** has the value **<http://www.w3.org/2000/01/rdf-schema#person>**.
29
30 {width: "80%"}
31 
32
33 The following figure shows a screen shot of this example project in the free Community Edition of IntelliJ.
34
35 {width: "80%"}
36 
37
38 Notice in this screen shot that there are several test files in the directory **test_data**. The files with the file extension **.meta** contain a single line which is the URI for the source of the text in the matching text file. For example, the meta file **test1.meta** provides the URI for the source of the text in the file **test1.txt**.
39
40
41 ## Generating RDF Data
42
43 RDF data is comprised of triples, where the value for each triple are a subject, a predicate, and an object. Subjects are URIs, predicates are usually URIs, and objects are either literal values or URIs. Here are two triples written by this example application:
44
45 {linenos=off}
http://dbpedia.org/resource/The_Wall_Street_Journal http://knowledgebooks.com/schema/aboutCompanyName “Wall Street Journal” . https://newsshop.com/june/z902.html http://knowledgebooks.com/schema/containsCountryDbPediaLink http://dbpedia.org/resource/Canada .
1 The following listing of the file **KGC.java** contains the implementation the main Java class for generating RDF data. Code for different entity types is similar so the following listing only shows the code for handling entity types for people and companies. The following is reformatted to fit the page width:
2
3 {lang="java",linenos=on}
package com.knowledgegraphcreator;
import com.markwatson.ner_dbpedia.TextToDbpediaUris;
import java.io.*; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; import java.util.List;
/**
- Java implementation of Knowledge Graph Creator.
- Copyright 2020 Mark Watson. All Rights Reserved. Apache 2 license.
- For documentation see my book “Practical Artificial Intelligence Programming
- With Java“, chapter “Automatically Generating Data for Knowledge Graphs”
- at https://leanpub.com/javaai that can be read free online.
*/
public class KGC {
1 private static final System.Logger LOG = System.getLogger(KGC.class.getName());
2
3 private static final String SUBJECT_URI = "<http://www.w3.org/1999/02/22-rdf-syntax-ns#/subject>";
4 private static final String LABEL_URI = "<http://www.w3.org/1999/02/22-rdf-syntax-ns#/label>";
5 private static final String COUNTRY_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#country>";
6 private static final String PERSON_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#person>";
7 private static final String COMPANY_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#company>";
8 private static final String CITY_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#city>";
9 private static final String BROADCAST_NETWORK_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#broadcastNetwork>";
10 private static final String MUSIC_GROUP_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#musicGroup>";
11 private static final String POLITICAL_PARTY_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#politicalParty>";
12 private static final String TRADE_UNION_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#tradeUnion>";
13 private static final String UNIVERSITY_TYPE_URI = "<http://www.w3.org/2000/01/rdf-schema#university>";
14 private static final String TYPE_OF_URI = "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>";
15
16 /** Immutable holder for a text file and its associated metadata. */
17 private record TextAndMeta(String text, String meta) {}
18
19 private KGC() { }
20
21 public KGC(String directoryPath, String outputRdfPath) throws IOException {
22 process(directoryPath, outputRdfPath);
23 }
24
25 /**
26 * Process all .txt/.meta file pairs in {@code directoryPath} and write
27 * RDF triples to {@code outputRdfPath}.
28 */
29 public static void process(String directoryPath, String outputRdfPath) throws IOException {
30 LOG.log(System.Logger.Level.INFO, "KGC processing directory: {0}", directoryPath);
31 Path dirPath = Path.of(directoryPath);
32 File[] directoryListing = dirPath.toFile().listFiles();
33 if (directoryListing == null) {
34 LOG.log(System.Logger.Level.WARNING, "Directory listing returned null for: {0}", directoryPath);
35 return;
36 }
37
38 try (var out = new PrintStream(outputRdfPath)) {
39 for (File child : directoryListing) {
40 if (!child.toString().endsWith(".txt")) {
41 continue;
42 }
43 LOG.log(System.Logger.Level.DEBUG, "Processing file: {0}", child);
44
45 // try to open the meta file with the same extension:
46 String metaAbsolutePath = child.getAbsolutePath();
47 Path metaPath = Path.of(metaAbsolutePath.substring(0, metaAbsolutePath.length() - 4) + ".meta");
48 LOG.log(System.Logger.Level.DEBUG, "Meta file: {0}", metaPath);
49
50 TextAndMeta data = readData(child.toPath(), metaPath);
51 String metaData = "<" + data.meta().strip() + ">";
52 TextToDbpediaUris kt = new TextToDbpediaUris(data.text());
53
54 writeTriples(out, metaData, kt.personNames, kt.personUris, PERSON_TYPE_URI);
55 writeTriples(out, metaData, kt.companyNames, kt.companyUris, COMPANY_TYPE_URI);
56 writeTriples(out, metaData, kt.cityNames, kt.cityUris, CITY_TYPE_URI);
57 writeTriples(out, metaData, kt.countryNames, kt.countryUris, COUNTRY_TYPE_URI);
58 writeTriples(out, metaData, kt.broadcastNetworkNames, kt.broadcastNetworkUris, BROADCAST_NETWORK_TYPE_URI);
59 writeTriples(out, metaData, kt.musicGroupNames, kt.musicGroupUris, MUSIC_GROUP_TYPE_URI);
60 writeTriples(out, metaData, kt.politicalPartyNames, kt.politicalPartyUris, POLITICAL_PARTY_TYPE_URI);
61 writeTriples(out, metaData, kt.tradeUnionNames, kt.tradeUnionUris, TRADE_UNION_TYPE_URI);
62 writeTriples(out, metaData, kt.universityNames, kt.universityUris, UNIVERSITY_TYPE_URI);
63 }
64 }
65 }
66
67 /**
68 * Write subject, label, and type triples for a list of named entities.
69 */
70 private static void writeTriples(PrintStream out, String metaData,
71 List<String> names, List<String> uris, String typeUri) {
72 for (int i = 0; i < names.size(); i++) {
73 out.println(metaData + " " + SUBJECT_URI + " " + uris.get(i) + " .");
74 out.println(uris.get(i) + " " + LABEL_URI + " \"" + names.get(i) + "\" .");
75 out.println(uris.get(i) + " " + TYPE_OF_URI + " " + typeUri + " .");
76 }
77 }
78
79 private static TextAndMeta readData(Path textPath, Path metaPath) throws IOException {
80 String text = Files.readString(textPath, StandardCharsets.UTF_8);
81 String meta = Files.readString(metaPath, StandardCharsets.UTF_8);
82 LOG.log(System.Logger.Level.DEBUG, "Read text ({0} chars) from {1}", text.length(), textPath);
83 return new TextAndMeta(text, meta);
84 }
}
1 This code works on a list of paired files for text data and the meta data for each text file. As an example, if there is an input text file test123.txt then there would be a matching meta file test123.meta that contains the source of the data in the file test123.txt. This data source will be a URI on the web or a local file URI. The class contractor for **KGC** takes an output file path for writing the generated RDF data and a list of pairs of text and meta file paths.
2
3 The **junit** test class **KgcTest** will process the local directory **test_data** and generate an RDF output file:
4
5 {lang="java",linenos=on}
package com.knowledgegraphcreator;
import org.junit.jupiter.api.Test;
import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path;
import static org.junit.jupiter.api.Assertions.*;
public class KgcTest {
@Test void testKGC() throws IOException { Path outputFile = Path.of(“output_with_duplicates.rdf”); KGC client = new KGC(“test_data/”, outputFile.toString());
1 assertTrue(Files.exists(outputFile), "Output RDF file should be created");
2
3 String content = Files.readString(outputFile);
4 assertFalse(content.isBlank(), "Output RDF file should not be empty");
5
6 // Verify that known entity types from test data appear in the output
7 assertTrue(content.contains("<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"),
8 "Output should contain RDF type triples");
9 assertTrue(content.contains("<http://www.w3.org/1999/02/22-rdf-syntax-ns#/label>"),
10 "Output should contain label triples");
} }
1 If specific entity names occur in multiple input files there will be a few duplicated RDF statements generated. The simplest way to deal with this is to add a one line call to the **awk** utility to efficiently remove duplicate lines in the RDF output file. Here is a listing of the **Makefile** for this example:
2
3 {lang="bash",linenos=off}
create_data_and_remove_duplicates: mvn test echo “Removing duplicate RDF statements” awk ‘!visited[$$0]++’ output_with_duplicates.rdf > output.rdf rm -f output_with_duplicates.rdf
1 If you are not familiar with **awk** and want to learn the basics then I recommend [this short tutorial](http://www.hcs.harvard.edu/~dholland/computers/awk.html).
2
3 ## KGCreator Wrap Up
4
5 When developing applications or systems using Knowledge Graphs it is useful to be able to quickly generate test data which is the primary purpose of KGCreator. A secondary use is to generate Knowledge Graphs for production use using text data sources. In this second use case you will want to manually inspect the generated data to verify its correctness or usefulness for your application.
6
7
8
9 # Knowledge Graph Navigator {#kgn}
10
11
12 The Knowledge Graph Navigator (which I will often refer to as KGN) is a tool for processing a set of entity names and automatically exploring the public Knowledge Graph [DBPedia](http://dbpedia.org) using SPARQL queries. I started to write KGN for my own use, to automate some things I used to do manually when exploring Knowledge Graphs, and later thought that KGN might be also useful for educational purposes. KGN shows the user the auto-generated SPARQL queries so hopefully the user will learn by seeing examples. KGN uses code developed in the earlier chapter [Resolve Entity Names to DBPedia References](#ner) and we will reuse here as well as the two Java classes **JenaAPis** and **QueryResults** (which wrap the Apache Jena library) from the chapter [Semantic Web](#semantic-web).
13
14 I have a [web site devoted to different versions of KGN](http://www.knowledgegraphnavigator.com/) that you might find interesting. The most full featured version of KGN, including a full user interface, is featured in my book [Loving Common Lisp, or the Savvy Programmer's Secret Weapon](https://leanpub.com/lovinglisp) that you can read free online. That version performs more speculative SPARQL queries to find information compared to the example here that I designed for ease of understanding, modification, and embedding in larger Java projects.
15
16 I chose to use DBPedia instead of WikiData for this example because DBPedia URIs are human readable. The following URIs represent the concept of a *person*. The semantic meanings of DBPedia and FOAF (friend of a friend) URIs are self-evident to a human reader while the WikiData URI is not:
17
18 {linenos=off}
http://www.wikidata.org/entity/Q215627 http://dbpedia.org/ontology/Person http://xmlns.com/foaf/0.1/name
1 I frequently use WikiData in my work and WikiData is one of the most useful public knowledge bases. I have both DBPedia and WikiData Sparql endpoints in the file **Sparql.java** that we will look at later, with the WikiData endpoint comment out. You can try manually querying WikiData at the [WikiData SPARL endpoint](https://query.wikidata.org). For example, you might explore the WikiData URI for the *person* concept using:
2
3 {lang=sparql, linenos=off}
select ?p ?o where { http://www.wikidata.org/entity/Q215627 ?p ?o } limit 10
1 For the rest of this chapter we will just use DBPedia.
2
3 After looking an interactive session using the example program for this chapter (that also includes listing automatically generated SPARQL queries) we will look at the implementation.
4
5 {width: "80%"}
6 
7
8 ## Entity Types Handled by KGN
9
10 To keep this example simple we handle just four entity types:
11
12 - People
13 - Companies
14 - Cities
15 - Countries
16
17 The entity detection library that we use from an earlier chapter also supports the following entity types that we don't use here:
18
19 - Broadcast Networks
20 - Music Groups
21 - Political Parties
22 - Trade Unions
23 - Universities
24
25 In addition to finding detailed information for people, companies, cities, and countries we will also search for relationships between person entities and company entities. This search process consists of generating a series of SPARQL queries and calling the DBPedia SPARQL endpoint.
26
27 As we look at the KGN implementation I will point out where and how you can easily add support for more entity types and in the wrap-up I will suggest further projects that you might want to try implementing with this example.
28
29 ## General Design of KGN with Example Output
30
31 The example application works by first having the user enter names of people and companies. Using libraries written in two previous chapters, we find entities in the user's input text, and generate SPARQL queries to DBPedia to find information about the entities and relationships between them.
32
33 We will start with looking at sample output so you have some understanding on what this implementation of KGN will and will not do. Here is the console output for the example query *"Bill Gates, Melinda Gates and Steve Jobs at Apple Computer, IBM and Microsoft"* (with some output removed for brevity). As you remember from the chapter *Semantic Web*, SPAQRL query results are expressed in class **QueryResult** that contains the variables (labelled as **vars**) in a query and a list of rows (one query result per row). Starting at line 117 in the following listing we see discovered relationships between entities in the input query.
34
35 {linenos=on}
Enter entities query: Bill Gates, Melinda Gates and Steve Jobs at Apple Computer, IBM and Microsoft
Processing query: Bill Gates, Melinda Gates and Steve Jobs at Apple Computer, IBM and Microsoft
person 0 1 Bill Gates http://dbpedia.org/resource/Bill_Gates person 4 5 Melinda Gates http://dbpedia.org/resource/Melinda_Gates person 7 8 Steve Jobs http://dbpedia.org/resource/Steve_Jobs company 10 11 Apple Computer http://dbpedia.org/resource/Apple_Inc. company 14 15 IBM http://dbpedia.org/resource/IBM company 16 17 Microsoft http://dbpedia.org/resource/Microsoft
Individual People:
Bill Gates : http://dbpedia.org/resource/Bill_Gates [QueryResult vars:[birthplace, label, comment, almamater, spouse] Rows: [http://dbpedia.org/resource/Seattle, Bill Gates, William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, investor, author and philanthropist. In 1975, Gates and Paul Allen co-founded Microsoft, which became the world’s largest PC software company. During his career at Microsoft, Gates held the positions of chairman, CEO and chief software architect, and was the largest individual shareholder until May 2014. Gates has authored and co-authored several books., http://dbpedia.org/resource/Harvard_University, http://dbpedia.org/resource/Melinda_Gates]
Melinda Gates : http://dbpedia.org/resource/Melinda_Gates [QueryResult vars:[birthplace, label, comment, almamater, spouse] Rows: [http://dbpedia.org/resource/Dallas | http://dbpedia.org/resource/Dallas,_Texas, Melinda Gates, Melinda Ann Gates (née French; born August 15, 1964), DBE is an American businesswoman and philanthropist. She is co-founder of the Bill & Melinda Gates Foundation. She worked at Microsoft, where she was project manager for Microsoft Bob, Microsoft Encarta and Expedia., http://dbpedia.org/resource/Duke_University, http://dbpedia.org/resource/Bill_Gates]
Steve Jobs : http://dbpedia.org/resource/Steve_Jobs [QueryResult vars:[birthplace, label, comment, almamater, spouse] Rows: [http://dbpedia.org/resource/San_Francisco, Steve Jobs, Steven Paul "Steve" Jobs (/ˈdʒɒbz/; February 24, 1955 – October 5, 2011) was an American information technology entrepreneur and inventor. He was the co-founder, chairman, and chief executive officer (CEO) of Apple Inc.; CEO and majority shareholder of Pixar Animation Studios; a member of The Walt Disney Company’s board of directors following its acquisition of Pixar; and founder, chairman, and CEO of NeXT Inc. Jobs is widely recognized as a pioneer of the microcomputer revolution of the 1970s and 1980s, along with Apple co-founder Steve Wozniak. Shortly after his death, Jobs’s official biographer, Walter Isaacson, described him as a "creative entrepreneur whose passion for perfection and ferocious drive revolutionized six industries: personal computers, animated movies, music, phones, tab, http://dbpedia.org/resource/Reed_College, http://dbpedia.org/resource/Laurene_Powell_Jobs]
Individual Companies:
Apple Computer : http://dbpedia.org/resource/Apple_Inc. [QueryResult vars:[industry, netIncome, label, comment, numberOfEmployees] Rows: [http://dbpedia.org/resource/Computer_hardware | http://dbpedia.org/resource/Computer_software | http://dbpedia.org/resource/Consumer_electronics | http://dbpedia.org/resource/Corporate_Venture_Capital | http://dbpedia.org/resource/Digital_distribution | http://dbpedia.org/resource/Fabless_manufacturing, 5.3394E10, Apple Inc., Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. Its hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. Apple’s consumer software includes the macOS and iOS operating systems, the iTunes media player, the Safari web browser, and the iLife and iWork creativity and productivity suites. Its online services include the iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud., 115000]
IBM : http://dbpedia.org/resource/IBM [QueryResult vars:[industry, netIncome, label, comment, numberOfEmployees] Rows: [http://dbpedia.org/resource/Cloud_computing | http://dbpedia.org/resource/Cognitive_computing | http://dbpedia.org/resource/Information_technology, 1.319E10, IBM, International Business Machines Corporation (commonly referred to as IBM) is an American multinational technology company headquartered in Armonk, New York, United States, with operations in over 170 countries. The company originated in 1911 as the Computing-Tabulating-Recording Company (CTR) and was renamed "International Business Machines" in 1924., 377757]
Microsoft : http://dbpedia.org/resource/Microsoft [QueryResult vars:[industry, netIncome, label, comment, numberOfEmployees] Rows: [http://dbpedia.org/resource/Computer_hardware | http://dbpedia.org/resource/Consumer_electronics | http://dbpedia.org/resource/Digital_distribution | http://dbpedia.org/resource/Software, , Microsoft, Microsoft Corporation /ˈmaɪkrəˌsɒft, -roʊ-, -ˌsɔːft/ (commonly referred to as Microsoft or MS) is an American multinational technology company headquartered in Redmond, Washington, that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services. Its best known software products are the Microsoft Windows line of operating systems, Microsoft Office office suite, and Internet Explorer and Edge web browsers. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface tablet lineup. As of 2011, it was the world’s largest software maker by revenue, and one of the world’s most valuable companies., 114000]
Individual Cities:
Seattle : http://dbpedia.org/resource/Seattle [QueryResult vars:[latitude_longitude, populationDensity, label, comment, country] Rows: [POINT(-122.33305358887 47.609722137451), 3150.979715864901, Seattle, Seattle is a West Coast seaport city and the seat of King County, Washington. With an estimated 684,451 residents as of 2015, Seattle is the largest city in both the state of Washington and the Pacific Northwest region of North America. As of 2015, it is estimated to be the 18th largest city in the United States. In July 2013, it was the fastest-growing major city in the United States and remained in the Top 5 in May 2015 with an annual growth rate of 2.1%. The Seattle metropolitan area is the 15th largest metropolitan area in the United States with over 3.7 million inhabitants. The city is situated on an isthmus between Puget Sound (an inlet of the Pacific Ocean) and Lake Washington, about 100 miles (160 km) south of the Canada–United States border. A major gateway for trade w, ]
Individual Countries:
Relationships between person Bill Gates person Melinda Gates: [QueryResult vars:[p] Rows: [http://dbpedia.org/ontology/spouse]
Relationships between person Melinda Gates person Bill Gates: [QueryResult vars:[p] Rows: [http://dbpedia.org/ontology/spouse]
Relationships between person Bill Gates company Microsoft: [QueryResult vars:[p] Rows: [http://dbpedia.org/ontology/board]
Relationships between person Steve Jobs company Apple Computer: [QueryResult vars:[p] Rows: [http://www.w3.org/2000/01/rdf-schema#seeAlso] [http://dbpedia.org/ontology/board] [http://dbpedia.org/ontology/occupation]
1 Since the DBPedia queries are time consuming, we use the caching layer from the earlier chapter *Semantic Web* when making SPARQL queries to DBPedia. The cache is especially helpful during development when the same queries are repeatedly used for testing.
2
3 The KGN user interface loop allows you to enter queries and see the results. There are two special options that you can enter instead of a query:
4
5 - sparql - this will print out all previous SPARQL queries used to present results. After entering this command the buffer of previous SPARQL queries is emptied. This option is useful for learning SPARQL and you might try pasting a few into the input field for the [public DBPedia SPARQL web app](http://dbpedia.org/sparql) and modifying them. We will use this command later in an example.
6 - demo - this will randomly choose a sample query.
7
8
9 ## UML Class Diagram for Example Application
10
11 The following UML Class Diagram for KGN shows you an overview of the Java classes we use and their public methods and fields.
12
13 {width: "80%"}
14 
15
16
17 ## Implementation
18
19 We will walk through the classes in the UML Class Diagram for KGN in alphabetical order, the exception being that we will look at the main program in **KGN.java** last.
20
21 The class **EntityAndDescription** contains two strings, a name and a URI reference. We also override the default implementation of **toString** to format and display the data in an instance of this class:
22
23 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
/**
- Immutable data carrier for an entity name and its DBPedia URI.
- Converted to a Java record for automatic toString(), equals(), hashCode(). */ public record EntityAndDescription(String entityName, String entityUri) { }
1 The class **EntityDetail** defines SPARQL query templates in lines 80-154 that have slots (using **%s** for string replacement) for the URI of an entity. We use different templates for different entity types. Before we look at these SPARQL query templates, let's learn two additional features of the SPARQL language that we will need to use in these entity templates.
2
3 We mentioned the **OPTIONAL** triple matching patterns in the chapter *Semantic Web*. Before looking at the Java code, let's first look at how optional matching works. We will run the KGN application asking for information on the city Seattle and then use the **sparql** command to print the generated SPARQL produced by the method **cityResults** (most output is not shown here for brevity). On line 2 I enter the query string "Seattle" and on line 22 I enter the command "sparql" to print out the generated SPARQL:
4
5 {linenos=on}
Enter entities query: Seattle
Individual Cities:
Seattle : http://dbpedia.org/resource/Seattle [QueryResult vars:[latitude_longitude, populationDensity, label, comment, country] Rows: [POINT(-122.33305358887 47.609722137451), 3150.979715864901, Seattle, Seattle is a West Coast seaport city and the seat of King County, Washington. With an estimated 684,451 residents as of 2015, Seattle is the largest city in both the state of Washington and the Pacific Northwest region of North America. As of 2015, it is estimated to be the 18th largest city in the United States. In July 2013, it was the fastest-growing major city in the United States and remained in the Top 5 in May 2015 with an annual growth rate of 2.1%. The Seattle metropolitan area is the 15th largest metropolitan area in the United States with over 3.7 million inhabitants. The city is situated on an isthmus between Puget Sound (an inlet of the Pacific Ocean) and Lake Washington, about 100 miles (160 km) south of the Canada–United States border. A major gateway for trade w, ]
Processing query: sparql
Generated SPARQL used to get current results:
SELECT DISTINCT (GROUP_CONCAT (DISTINCT ?latitude_longitude2; SEPARATOR=’ | ‘) AS ?latitude_longitude) (GROUP_CONCAT (DISTINCT ?populationDensity2; SEPARATOR=’ | ‘) AS ?populationDensity) (GROUP_CONCAT (DISTINCT ?label2; SEPARATOR=’ | ‘) AS ?label) (GROUP_CONCAT (DISTINCT ?comment2; SEPARATOR=’ | ‘) AS ?comment) (GROUP_CONCAT (DISTINCT ?country2; SEPARATOR=’ | ’) AS ?country) { http://dbpedia.org/resource/Seattle http://www.w3.org/2000/01/rdf-schema#comment ?comment2 . FILTER (lang(?comment2) = ‘en’) . OPTIONAL { http://dbpedia.org/resource/Seattle http://www.w3.org/2003/01/geo/wgs84_pos#geometry ?latitude_longitude2 } . OPTIONAL { http://dbpedia.org/resource/Seattle http://dbpedia.org/ontology/PopulatedPlace/populationDensity ?populationDensity2 } . OPTIONAL { http://dbpedia.org/resource/Seattle http://dbpedia.org/ontology/country ?country2 } . OPTIONAL { http://dbpedia.org/resource/Seattle http://www.w3.org/2000/01/rdf-schema#label ?label2 . } } LIMIT 30
1 This listing was manually edited to fit page width. In lines 34-36, we are trying to find a triple stating which country Seattle is in. Please note that this triple matching pattern is generated as one line but I had to manually edit it here to fit the page width.
2
3 The triple matching pattern in lines 34-36 must match some triple in DBPedia or no results will be returned. In other words this matching pattern is mandatory. The four optional matching patterns in lines 38-49 specify triple patterns that may be matched. In this example there is no triple matching the following statement in the DBPedia knowledge base so the variable **country2** is not bound and the query returns no results for the variable **country**:
4
5 {lang="sparql",linenos=off}
1 Notice also the syntax for **GROUP_CONCAT** used in lines 27-33, for example:
2
3 {lang="sparql",linenos=off}
(GROUP_CONCAT (DISTINCT ?country2; SEPARATOR=’ | ’) AS ?country)
1 This collects all values assigned to the binding variable **?country2** into a string value using the separator string " | ". Using **DISTINCT** with **GROUP_CONCAT** conveniently discards duplicate bindings for binding variables like **?country2**.
2
3 Now that we have looked at SPARQL examples using **OPTIONAL** and **GROUP_CONCAT**, the templates at the end of the following listing should be easier to understand.
4
5 The methods **genericResults** and **genericAsString** in the following listing are not currently used in this example but I leave them as easy way to get information, given any entity URI. You are likely to use these if you use the code for KGN in your projects.
6
7 For each entity type, for example *city*, I wrote one method like **cityResults** that returns an instance of **QueryResult** calculated by using the **JenaApis** library from the chapter *Semantic Web*. For each entity type there is another method, like **cityAsString** that converts an instance of **QueryResult** to a formatted string for display.
8
9 We use the code pattern seen in lines 29-30 for each entity type. We use the static method **String.format** to replace occurrences of **%s** in the entity template string with the string representation of entity URIs.
10
11 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
import com.markwatson.semanticweb.QueryResult;
import java.sql.SQLException; // Cache layer in JenaApis library throws this
public class EntityDetail {
public static QueryResult genericResults(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { var query = “““ SELECT DISTINCT ?p ?o WHERE { %s ?p ?o . FILTER (!regex(str(?p), ‘wiki’, ‘i’)) } LIMIT 10 “““.formatted(entityUri); return endpoint.query(query); }
public static String genericAsString(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { QueryResult qr = genericResults(endpoint, entityUri); return qr.toString(); }
public static QueryResult cityResults(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { var query = cityTemplate.formatted(entityUri, entityUri, entityUri, entityUri, entityUri); return endpoint.query(query); }
public static String cityAsString(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { QueryResult qr = cityResults(endpoint, entityUri); return qr.toString(); }
public static QueryResult countryResults(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { var query = countryTemplate.formatted(entityUri, entityUri, entityUri, entityUri, entityUri); return endpoint.query(query); }
public static String countryAsString(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { QueryResult qr = countryResults(endpoint, entityUri); return qr.toString(); }
public static QueryResult personResults(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { var query = personTemplate.formatted(entityUri, entityUri, entityUri, entityUri, entityUri); return endpoint.query(query); }
public static String personAsString(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { QueryResult qr = personResults(endpoint, entityUri); return qr.toString(); }
public static QueryResult companyResults(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { var query = companyTemplate.formatted(entityUri, entityUri, entityUri, entityUri, entityUri); return endpoint.query(query); }
public static String companyAsString(Sparql endpoint, String entityUri) throws SQLException, ClassNotFoundException { QueryResult qr = companyResults(endpoint, entityUri); return qr.toString(); }
private static final String companyTemplate = “““ SELECT DISTINCT (GROUP_CONCAT (DISTINCT ?industry2; SEPARATOR=’ | ‘) AS ?industry) (GROUP_CONCAT (DISTINCT ?netIncome2; SEPARATOR=’ | ‘) AS ?netIncome) (GROUP_CONCAT (DISTINCT ?label2; SEPARATOR=’ | ‘) AS ?label) (GROUP_CONCAT (DISTINCT ?comment2; SEPARATOR=’ | ‘) AS ?comment) (GROUP_CONCAT (DISTINCT ?numberOfEmployees2; SEPARATOR=’ | ’) AS ?numberOfEmployees) { %s http://www.w3.org/2000/01/rdf-schema#comment ?comment2 . FILTER (lang(?comment2) = ‘en’) . OPTIONAL { %s http://dbpedia.org/ontology/industry ?industry2 } . OPTIONAL { %s http://dbpedia.org/ontology/netIncome ?netIncome2 } . OPTIONAL { %s http://dbpedia.org/ontology/numberOfEmployees ?numberOfEmployees2 } . OPTIONAL { %s http://www.w3.org/2000/01/rdf-schema#label ?label2 . FILTER (lang(?label2) = ‘en’) } } LIMIT 30“““;
private static final String personTemplate = “““ SELECT DISTINCT (GROUP_CONCAT (DISTINCT ?birthplace2; SEPARATOR=’ | ‘) AS ?birthplace) (GROUP_CONCAT (DISTINCT ?label2; SEPARATOR=’ | ‘) AS ?label) (GROUP_CONCAT (DISTINCT ?comment2; SEPARATOR=’ | ‘) AS ?comment) (GROUP_CONCAT (DISTINCT ?almamater2; SEPARATOR=’ | ‘) AS ?almamater) (GROUP_CONCAT (DISTINCT ?spouse2; SEPARATOR=’ | ’) AS ?spouse) { %s http://www.w3.org/2000/01/rdf-schema#comment ?comment2 . FILTER (lang(?comment2) = ‘en’) . OPTIONAL { %s http://dbpedia.org/ontology/birthPlace ?birthplace2 } . OPTIONAL { %s http://dbpedia.org/ontology/almaMater ?almamater2 } . OPTIONAL { %s http://dbpedia.org/ontology/spouse ?spouse2 } . OPTIONAL { %s http://www.w3.org/2000/01/rdf-schema#label ?label2 . FILTER (lang(?label2) = ‘en’) } } LIMIT 10“““;
private static final String countryTemplate = “““ SELECT DISTINCT (GROUP_CONCAT (DISTINCT ?areaTotal2; SEPARATOR=’ | ‘) AS ?areaTotal) (GROUP_CONCAT (DISTINCT ?label2; SEPARATOR=’ | ‘) AS ?label) (GROUP_CONCAT (DISTINCT ?comment2; SEPARATOR=’ | ‘) AS ?comment) (GROUP_CONCAT (DISTINCT ?populationDensity2; SEPARATOR=’ | ’) AS ?populationDensity) { %s http://www.w3.org/2000/01/rdf-schema#comment ?comment2 . FILTER (lang(?comment2) = ‘en’) . OPTIONAL { %s http://dbpedia.org/ontology/areaTotal ?areaTotal2 } . OPTIONAL { %s http://dbpedia.org/ontology/populationDensity ?populationDensity2 } . OPTIONAL { %s http://www.w3.org/2000/01/rdf-schema#label ?label2 . } } LIMIT 30“““;
private static final String cityTemplate = “““ SELECT DISTINCT (GROUP_CONCAT (DISTINCT ?latitude_longitude2; SEPARATOR=’ | ‘) AS ?latitude_longitude) (GROUP_CONCAT (DISTINCT ?populationDensity2; SEPARATOR=’ | ‘) AS ?populationDensity) (GROUP_CONCAT (DISTINCT ?label2; SEPARATOR=’ | ‘) AS ?label) (GROUP_CONCAT (DISTINCT ?comment2; SEPARATOR=’ | ‘) AS ?comment) (GROUP_CONCAT (DISTINCT ?country2; SEPARATOR=’ | ’) AS ?country) { %s http://www.w3.org/2000/01/rdf-schema#comment ?comment2 . FILTER (lang(?comment2) = ‘en’) . OPTIONAL { %s http://www.w3.org/2003/01/geo/wgs84_pos#geometry ?latitude_longitude2 } . OPTIONAL { %s http://dbpedia.org/ontology/PopulatedPlace/populationDensity ?populationDensity2 } . OPTIONAL { %s http://dbpedia.org/ontology/country ?country2 } . OPTIONAL { %s http://www.w3.org/2000/01/rdf-schema#label ?label2 . FILTER (lang(?label2) = ‘en’) } } LIMIT 30“““;
}
1 The class **EntityRelationships** in the next listing is used to find property relationships between two entity URIs. The RDF statement matching **FILTER** on line 15 prevents matching statements where the property contains the string "wiki" to avoid WikiData URI references. This class would need to be rewritten to handle, for example, the WikiData Knowledge Base instead of the DBPedia Knowledge Base. This class uses the **JenaApis** library developed in the chapter *Semantic Web*. The class **Sparql** that we will look at later wraps the use of the **JenaApis** library.
2
3 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
import com.markwatson.semanticweb.QueryResult;
import java.sql.SQLException;
public class EntityRelationships {
public static QueryResult results(Sparql endpoint, String entity1Uri, String entity2Uri) throws SQLException, ClassNotFoundException { var query = “““ SELECT ?p WHERE { %s ?p %s . FILTER (!regex(str(?p), ‘wikiPage’, ‘i’)) } LIMIT 10 “““.formatted(entity1Uri, entity2Uri); return endpoint.query(query); } }
1 The class **Log** in the next listing defines a shorthand **out** for calling **System.out.println**, an instance of **StringBuilder** for storing all generated SPARQL queries made to DBPedia, and a utility method for clearing the stored SPARQL queries. We use the cache of SPARQL queries to support the interactive command "sparql" in the **KGN** application that we previously saw in an example when we saw the use of this command to display all cached SPARQL queries demonstrating the use of **DISTINCT** and **GROUP_CONCAT**.
2
3 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
public class Log { public static void out(String s) { System.out.println(s); } /** Accumulated SPARQL queries for inspection. Note: not thread-safe (single-threaded demo). */ public static final StringBuilder sparql = new StringBuilder(); public static void clearSparql() { sparql.delete(0, sparql.length()); } }
1 The class **PrintEntityResearchResults** in the next listing takes results from multiple DBPedia queries, formats the results, and displays them. The class constructor has no use except for the side effect of displaying results to a user. The constructor requires the arguments:
2
3 - Sparql endpoint - we will look at the definition of class **Sparql** in the next section.
4 - List<EntityAndDescription> people - a list of person names and URIs.
5 - List<EntityAndDescription> companies - a list of company names and URIs.
6 - List<EntityAndDescription> cities - a list of city names and URIs.
7 - List<EntityAndDescription> countries - a list of country names and URIs.
8
9 I define static string values for a few ANSI terminal escape sequences for changing the default color of text written to a terminal. If you are running on Windows you may need to set initialization values for **RESET**, **GREEN**, **YELLOW**, **PURPLE**, and **CYAN** to empty strings "".
10
11 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
import static com.knowledgegraphnavigator.Log.out; import static com.knowledgegraphnavigator.Utils.removeBrackets;
import java.sql.SQLException; import java.util.List;
public class PrintEntityResearchResults {
/**
- Note for Windows users: the Windows console may not render the following
- ANSI terminal escape sequences correctly. If yo have problems, just
- change the following to the empty string ““: */ public static final String RESET = “\u001B[0m”; // ANSI characters for styling public static final String GREEN = “\u001B[32m”; public static final String YELLOW = “\u001B[33m”; public static final String PURPLE = “\u001B[35m”; public static final String CYAN = “\u001B[36m”;
private PrintEntityResearchResults() { }
/**
- Print detailed research results for each entity category.
- Extracted from constructor to avoid side-effects in object construction. */ public static void printResults(Sparql endpoint, List people, List companies, List cities, List countries) throws SQLException, ClassNotFoundException { out(“\n” + GREEN + “Individual People:\n” + RESET); for (var person : people) { out(“ “ + GREEN + String.format(“%-25s”, person.entityName()) + PURPLE + “ : “ + removeBrackets(person.entityUri()) + RESET); out(EntityDetail.personAsString(endpoint, person.entityUri())); } out(“\n” + CYAN + “Individual Companies:\n” + RESET); for (var company : companies) { out(“ “ + CYAN + String.format(“%-25s”, company.entityName()) + YELLOW + “ : “ + removeBrackets(company.entityUri()) + RESET); out(EntityDetail.companyAsString(endpoint, company.entityUri())); } out(“\n” + GREEN + “Individual Cities:\n” + RESET); for (var city : cities) { out(“ “ + GREEN + String.format(“%-25s”, city.entityName()) + PURPLE + “ : “ + removeBrackets(city.entityUri()) + RESET); out(EntityDetail.cityAsString(endpoint, city.entityUri())); } out(“\n” + GREEN + “Individual Countries:\n” + RESET); for (var country : countries) { out(“ “ + GREEN + String.format(“%-25s”, country.entityName()) + PURPLE + “ : “ + removeBrackets(country.entityUri()) + RESET); out(EntityDetail.countryAsString(endpoint, country.entityUri())); } out(““); } }
1 The class **Sparql** in the next listing wraps the **JenaApis** library from the chapter *Semantic Web*. I set the SPARQL endpoint for DBPedia on line 13. I set and commented out the WikiData SPARQL endpoint on lines 11-12. The KGN application will not work with WikiData without some modifications. If you enjoy experimenting with KGN then you might want to clone it and enable it to work simultaneously with DBPedia, WikiData, and local RDF files by using three instances of the class **JenaApis**.
2
3 Notice that we are importing the value of a static StringBuffer **com.knowledgegraphnavigator.Log.sparql** on line 5. We will use this for storing SPARQL queries for display to the user.
4
5 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
import com.markwatson.semanticweb.QueryResult; import com.markwatson.semanticweb.JenaApis; import static com.knowledgegraphnavigator.Log.sparql; import static com.knowledgegraphnavigator.Log.out;
import java.sql.SQLException;
public class Sparql { //static private String endpoint = “https://query.wikidata.org/bigdata/namespace/wdq/sparql”; private static final String ENDPOINT = “https://dbpedia.org/sparql”; private final JenaApis jenaApis;
public Sparql() { this.jenaApis = new JenaApis(); }
public QueryResult query(String sparqlQuery) throws SQLException, ClassNotFoundException { //out(sparqlQuery); // debug for now… sparql.append(sparqlQuery); sparql.append(“\n\n”); return jenaApis.queryRemote(ENDPOINT, sparqlQuery); }
public static void main(String[] args) throws Exception { var sp = new Sparql(); QueryResult qr = sp.query(“select ?s ?p ?o where { ?s ?p ?o } limit 5”); out(qr.toString()); } }
1 The class **Utils** contains one utility method **removeBrackets** that is used to convert a URI in SPARQL RDF statement form:
2
3 {linenos=off}
1 to:
2
3 {linenos=off}
http://dbpedia.org/resource/Seattle
1 The single method **removeBrackets** is only used in the class **PrintEntityResearchResults**.
2
3 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
public class Utils { public static String removeBrackets(String s) { if (s.startsWith(“<”)) return s.substring(1, s.length() - 1); return s; } }
1 Finally we get to the main program implemented in the class **KGN**. The interactive program is implemented in the class constructor with the heart of the code being the **while** loop in lines 26-119 that accepts text input from the user, detects entity names and the corresponding entity types in the input text, and uses the Java classes we just looked at to find information on DBPedia for the entities in the input text as well as finding relations between these entities. Instead of entering a list of entity names the user can also enter either of the commands *sparql* (which we saw earlier in an example) or *demo* (to use a randomly chosen example query).
2
3 We use the class **TextToDbpediaUris** on line 38 to get the entity names and types found in the input text. You can refer back to chapter *Resolve Entity Names to DBPedia References* for details on using the class **TextToDbpediaUris**.
4
5 The loops in lines 39-70 store entity details that are displayed by calling **PrintEntityResearchResults** in lines 72-76. The nested loops over person entities in lines 78-91 calls **EntityRelationships.results** to look for relationships between two different person URIs. The same operation is done in the nested loops in lines 93-104 to find relationships between people and companies. The nested loops in lines 105-118 finds relationships between different company entities.
6
7 The static method **main** in lines 134-136 simply creates an instance of class **KGN** which has the side effect of running the example KGN program.
8
9 {lang="java",linenos=on}
package com.knowledgegraphnavigator;
import com.markwatson.ner_dbpedia.TextToDbpediaUris; import com.markwatson.semanticweb.QueryResult;
import static com.knowledgegraphnavigator.Log.out; import static com.knowledgegraphnavigator.Log.sparql; import static com.knowledgegraphnavigator.Log.clearSparql;
import java.util.ArrayList; import java.util.List; import java.util.Scanner; import java.util.concurrent.ThreadLocalRandom;
public class KGN {
private static final List DEMOS_LIST = List.of( “Bill Gates and Melinda Gates worked at Microsoft”, “IBM opened an office in Canada”, “Steve Jobs worked at Apple Computer and visited IBM and Microsoft in Seattle”);
/** Single Scanner instance to avoid resource leaks from repeated System.in wrapping. */ private final Scanner consoleScanner = new Scanner(System.in);
public KGN() throws Exception { var endpoint = new Sparql();
1 while (true) {
2 String query = getUserQueryFromConsole();
3 if (query == null || query.isBlank()) {
4 out("Exiting KGN.");
5 break;
6 }
7 out("\nProcessing query:\n" + query + "\n");
8 if (query.equalsIgnoreCase("sparql")) {
9 out("Generated SPARQL used to get current results:\n");
10 out(sparql.toString());
11 out("\n");
12 clearSparql();
13 } else {
14 processQuery(endpoint, query);
15 }
16 }
}
private void processQuery(Sparql endpoint, String query) throws Exception { if (query.equalsIgnoreCase(“demo”)) { query = DEMOS_LIST.get(ThreadLocalRandom.current().nextInt(DEMOS_LIST.size())); } var kt = new TextToDbpediaUris(query);
1 var userSelectedPeople = buildEntityList(kt.personNames, kt.personUris);
2 var userSelectedCompanies = buildEntityList(kt.companyNames, kt.companyUris);
3
4 if (!kt.cityNames.isEmpty()) {
5 out("+++++ kt.cityNames:" + kt.cityNames.toString());
6 }
7 var userSelectedCities = buildEntityList(kt.cityNames, kt.cityUris);
8
9 if (!kt.countryNames.isEmpty()) {
10 out("+++++ kt.countryNames:" + kt.countryNames.toString());
11 }
12 var userSelectedCountries = buildEntityList(kt.countryNames, kt.countryUris);
13
14 PrintEntityResearchResults.printResults(endpoint,
15 userSelectedPeople,
16 userSelectedCompanies,
17 userSelectedCities,
18 userSelectedCountries);
19
20 for (var person1 : userSelectedPeople) {
21 for (var person2 : userSelectedPeople) {
22 if (person1 != person2) {
23 QueryResult qr = EntityRelationships.results(endpoint, person1.entityUri(), person2.entityUri());
24 if (!qr.rows.isEmpty()) {
25 out("Relationships between person " + person1.entityName() +
26 " person " + person2.entityName() + ":");
27 out(qr.toString());
28 }
29 }
30 }
31 }
32 // Bill Gates, Melinda Gates and Steve Jobs at Apple Computer, IBM and Microsoft in Seattle
33 for (var person : userSelectedPeople) {
34 for (var company : userSelectedCompanies) {
35 QueryResult qr = EntityRelationships.results(endpoint, person.entityUri(), company.entityUri());
36 if (!qr.rows.isEmpty()) {
37 out("Relationships between person " + person.entityName() +
38 " company " + company.entityName() + ":");
39 out(qr.toString());
40 }
41 }
42 }
43 for (var company1 : userSelectedCompanies) {
44 for (var company2 : userSelectedCompanies) {
45 if (company1 != company2) {
46 QueryResult qr = EntityRelationships.results(endpoint, company1.entityUri(), company2.entityUri());
47 if (!qr.rows.isEmpty()) {
48 out("Relationships between company " + company1.entityName() +
49 " company " + company2.entityName() + ":");
50 out(qr.toString());
51 }
52 }
53 }
54 }
}
/**
- Build a list of EntityAndDescription from parallel name/URI lists. */ private static List names, List uris) { var result = new ArrayList(); for (int i = 0; i < names.size(); i++) { result.add(new EntityAndDescription(names.get(i), uris.get(i))); } return result; }
private String getUserQueryFromConsole() { out(“Enter entities query:”); if (consoleScanner.hasNextLine()) { return consoleScanner.nextLine(); } return ““; }
public static void main(String[] args) throws Exception { new KGN(); } }
1 This KGN example was hopefully both interesting to you and simple enough in its implementation (because we relied heavily on code from the last two chapters) that you feel comfortable modifying it and reusing it as a part of your own Java applications.
2
3
4 ## Wrap-up
5
6 If you enjoyed running and experimenting with this example and want to modify it for your own projects then I hope that I provided a sufficient road map for you to do so.
7
8 I suggest further projects that you might want to try implementing with this example:
9
10 - Write a web application that processes news stories and annotates them with additional data from DBPedia and/or WikiData.
11 - In a web or desktop application, detect entities in text and display additional information when the user's mouse cursor hovers over a word or phrase that is identified as an entity found in DBPedia or WikiData.
12 - Clone this KGN example and enable it to work simultaneously with DBPedia, WikiData, and local RDF files by using three instances of the class **JenaApis** and in the main application loop access all three data sources.
13
14 I had the idea for the KGN application because I was spending quite a bit of time manually setting up SPARQL queries for DBPedia (and other public sources like WikiData) and I wanted to experiment with partially automating this process. I have experimented with versions of KGN written in Java, Hy language ([Lisp running on Python that I wrote a short book on](https://leanpub.com/hy-lisp-python/read)), Swift, and Common Lisp and all four implementations take different approaches as I experimented with different ideas. You might want to check out my [web site devoted to different versions of KGN: www.knowledgegraphnavigator.com](http://www.knowledgegraphnavigator.com/).
15
16
17
18
19 Conclusions
20 ===========
21
22 The material in this book was informed by my own work experience
23 designing systems and writing software for artificial intelligence and information processing. If
24 you enjoyed reading this book and you make practical use of at least some of
25 the material I covered, then I consider my effort to be worthwhile.
26
27 Writing software is a combination of a business activity, promoting good
28 for society, and an exploration to try out new ideas for self
29 improvement. I believe that there is sometimes a fine line between
30 spending too many resources tracking many new technologies versus
31 getting stuck using old technologies at the expense of lost
32 opportunities. My hope is that reading this book was an efficient and
33 pleasurable use of your time, letting you try some new techniques and
34 technologies that you had not considered before.
35
36 When we do expend resources to try new things it is almost always best
37 to perform many small experiments and then dig deeper into areas that
38 have a good chance of providing high value and capturing your interest.
39
40 Fail fast is a common meme but failure that we do not learn from is a
41 waste.
42
43 I have been using the Java platform from the very beginning and although
44 I also use many other programming languages in my work and studies, both
45 the Java language and the Java platform provide high efficiency, scalability,
46 many well-trained developers, and a wealth of existing infrastructure
47 software and libraries. Investment in Java development also pays when
48 using alternative JVM languages like JRuby, Scala, and Clojure.
49
50 If we never get to meet in person or talk on the telephone, then I would
51 like to thank you now for taking the time to read my book.