Leanpub: Publish Early, Publish Often

Linked Data, the Semantic Web, and Knowledge Graphs

Tim Berners Lee, James Hendler, and Ora Lassila wrote in 2001 an article for Scientific American where they introduced the term Semantic Web. Here I do not capitalize semantic web and use the similar term linked data somewhat interchangeably with semantic web. Most work using these technologies now is building corporate Knowledge Graphs. I worked at Google with their Knowledge Graph in 2013 and I worked with the Knowledge Graph team at Olive AI during 2020-2021.

In ths chapter we will only be using the Hy REPL in the directory hy-lisp-python-book/source_code_for_examples/rdf:

1 $ cd hy-lisp-python-book/source_code_for_examples/rdf
2 $ uv sync
3 $ uv run hy
4 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
5 =>

In the same way that the web allows links between related web pages, linked data supports linking associated data on the web together. I view linked data as a relatively simple way to specify relationships between data sources on the web while the semantic web has a much larger vision: the semantic web has the potential to be the entirety of human knowledge represented as data on the web in a form that software agents can work with to answer questions, perform research, and to infer new data from existing data.

While the “web” describes information for human readers, the semantic web is meant to provide structured data for ingestion by software agents. This distinction will be clear as we compare WikiPedia, made for human readers, with DBPedia which uses the info boxes on WikiPedia topics to automatically extract RDF data describing WikiPedia topics. Let’s look at the WikiPedia topic for the town I live in, Sedona Arizona, and show how the info box on the English version of the WikiPedia topic page for Sedona https://en.wikipedia.org/wiki/Sedona,_Arizona maps to the DBPedia page http://dbpedia.org/page/Sedona,_Arizona. Please open both of these WikiPedia and DBPedia URIs in two browser tabs and keep them open for reference.

I assume that the format of the WikiPedia page is familiar so let’s look at the DBPedia page for Sedona that in human readable form shows the RDF statements with Sedona Arizona as the subject. RDF is used to model and represent data. RDF is defined by three values so an instance of an RDF statement is called a triple with three parts:

subject: a URI (also referred to as a “Resource”)
property: a URI (also referred to as a “Resource”)
value: a URI (also referred to as a “Resource”) or a literal value (like a string)

The subject for each Sedona related triple is the above URI for the DBPedia human readable page. The subject and property references in an RDF triple will almost always be a URI that can ground an entity to information on the web. The human readable page for Sedona lists several properties and the values of these properties. One of the properties is “dbo:areaCode” where “dbo” is a name space reference (in this case for a DatatypeProperty).

The following two figures show an abstract representation of linked data and then a sample of linked data with actual web URIs for resources and properties:

Abstract RDF representation with 2 Resources, 2 literal values, and 3 Properties

Concrete example using RDF seen in last chapter showing the RDF representation with 2 Resources, 2 literal values, and 3 Properties

We saw a SPARQL Query (SPARQL for RDF data is similar to SQL for relational database queries) in the last chapter. Let’s look at another example using the RDF in the last figure:

1     select ?v where {  <http://markwatson.com/index.rdf#Sun_ONE>
2                        <http://www.ontoweb.org/ontology/1#booktitle>
3                        ?v }

This query should return the result “Sun ONE Services - J2EE”. If you wanted to query for all URI resources that are books with the literal value of their titles, then you can use:

1     select ?s ?v where {  ?s
2                           <http://www.ontoweb.org/ontology/1#booktitle>
3                           ?v }

Note that ?s and ?v are arbitrary query variable names, here standing for “subject” and “value”. You can use more descriptive variable names like:

1     select ?bookURI ?bookTitle where 
2         { ?bookURI
3           <http://www.ontoweb.org/ontology/1#booktitle>
4           ?bookTitle }

We will be diving a little deeper into RDF examples in the next chapter when we write a tool for generating RDF data from raw text input. For now I want you to understand the idea of RDF statements represented as triples, that web URIs represent things, properties, and sometimes values, and that URIs can be followed manually (often called “dereferencing”) to see what they reference in human readable form.

Understanding the Resource Description Framework (RDF)

Text data on the web has some structure in the form of HTML elements like headers, page titles, anchor links, etc. but this structure is too imprecise for general use by software agents. RDF is a method for encoding structured data in a more precise way.

We used the RDF data on my web site in the last chapter to introduce the “plumbing” of using the rdflib Python library to access, manipulate, and query RDF data.

Resource Namespaces Provided in rdflib

The following standard namespaces are predefined in rdflib:

RDF https://www.w3.org/TR/rdf-syntax-grammar/
RDFS https://www.w3.org/TR/rdf-schema/
OWL http://www.w3.org/2002/07/owl#
XSD http://www.w3.org/2001/XMLSchema#
FOAF http://xmlns.com/foaf/0.1/
SKOS http://www.w3.org/2004/02/skos/core#
DOAP http://usefulinc.com/ns/doap#
DC http://purl.org/dc/elements/1.1/
DCTERMS http://purl.org/dc/terms/
VOID http://rdfs.org/ns/void#

Let’s look into the Friend of a Friend (FOAF) namespace. Click on the above link for FOAF http://xmlns.com/foaf/0.1/ and find the definitions for the FOAF Core:

 1     Agent
 2     Person
 3     name
 4     title
 5     img
 6     depiction (depicts)
 7     familyName
 8     givenName
 9     knows
10     based_near
11     age
12     made (maker)
13     primaryTopic (primaryTopicOf)
14     Project
15     Organization
16     Group
17     member
18     Document
19     Image

and for the Social Web:

 1 nick
 2 mbox
 3 homepage
 4 weblog
 5 openid
 6 jabberID
 7 mbox_sha1sum
 8 interest
 9 topic_interest
10 topic (page)
11 workplaceHomepage
12 workInfoHomepage
13 schoolHomepage
14 publications
15 currentProject
16 pastProject
17 account
18 OnlineAccount
19 accountName
20 accountServiceHomepage
21 PersonalProfileDocument
22 tipjar
23 sha1
24 thumbnail
25 logo

You now have seen a few common Schemas for RDF data. Another Schema that is widely used for annotating web sites, that we won’t need for our examples here, is schema.org. Let’s now use a Hy REPL session to explore namespaces and programatically create RDF using rdflib:

 1 Marks-MacBook:database $ uv run hy
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import rdflib.namespace [FOAF])
 4 => FOAF
 5 Namespace('http://xmlns.com/foaf/0.1/')
 6 => FOAF.name
 7 rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name')
 8 => FOAF.title
 9 rdflib.term.URIRef('http://xmlns.com/foaf/0.1/title')
10 => (import rdflib)
11 => (setv graph (rdflib.Graph))
12 => (setv mark (rdflib.BNode))
13 => (graph.bind "foaf" FOAF)
14 => (import rdflib [RDF])
15 => (graph.add [mark RDF.type FOAF.Person])
16 => (graph.add [mark FOAF.nick (rdflib.Literal "Mark" :lang "en")])
17 => (graph.add [mark FOAF.name (rdflib.Literal "Mark Watson" :lang "en")])
18 => (for [node graph] (print node))
19 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://www\
20 .w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://xmlns.com/foaf/0\
21 .1/Person'))
22 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://xml\
23 ns.com/foaf/0.1/name'), rdflib.term.Literal('Mark Watson', lang='en'))
24 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://xml\
25 ns.com/foaf/0.1/nick'), rdflib.term.Literal('Mark', lang='en'))
26 => (graph.serialize :format "pretty-xml")
27 b'<?xml version="1.0" encoding="utf-8"?>
28 <rdf:RDF
29     xmlns:foaf="http://xmlns.com/foaf/0.1/"
30     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
31 >
32   <foaf:Person rdf:nodeID="N21c7fa7385b545eb8a7e3821b75b9cb5">
33     <foaf:name xml:lang="en">Mark Watson</foaf:name>
34     <foaf:nick xml:lang="en">Mark</foaf:nick>
35   </foaf:Person>
36 </rdf:RDF>\n'
37 => (graph.serialize :format "turtle")
38 @prefix foaf: <http://xmlns.com/foaf/0.1/> .
39 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
40 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
41 @prefix xml: <http://www.w3.org/XML/1998/namespace> .
42 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
43 
44 [] a foaf:Person ;
45      foaf:name "Mark Watson"@en ;
46      foaf:nick "Mark"@en .
47 
48 => (graph.serialize :format "nt")
49 _:N21c7fa7385b545eb8a7e3821b75b9cb5
50    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
51    <http://xmlns.com/foaf/0.1/Person> .
52 _:N21c7fa7385b545eb8a7e3821b75b9cb5 <http://xmlns.com/foaf/0.1/name> "Mark Watson"@e\
53 n .
54 _:N21c7fa7385b545eb8a7e3821b75b9cb5 <http://xmlns.com/foaf/0.1/nick> "Mark"@en .
55 =>

Understanding the SPARQL Query Language

For the purposes of the material in this book, the two sample SPARQL queries here and in the last chapter are sufficient for you to get started using rdflib with arbitrary RDF data sources and simple queries.

The Apache Foundation has a good introduction to SPARQL that I refer you to for more information.

Wrapping the Python rdflib Library

I hope that I have provided you with enough motivation to explore RDF data sources and consider the use of linked data/semantic web technologies for your projects.

If I depend on a library, regardless of the programming language, I like to keep an up-to-date copy of the source code ready at hand. There is sometimes no substitute for having library code available to read.

Up next

Knowledge Graph Creator