Linked Data and the Semantic Web

I am going to show you how to query semantic web data sources on the web and provide examples for how you might use this data in applications. I have written two previous books on the semantic web, one covering Common Lisp and the other covering JVM languages Java, Scala, Clojure, and Ruby. You can read these recent eBooks online for free on my Leanpub author’s page. If you enjoy the light introduction in this chapter then please do read my other eBooks that cover in more detail semantic web material on RDF, RDFS, and SPARQL.

I like to think of the semantic web and linked data resources as:

  • A source of structured data on the web. These resources are called SPARQL endpoints.
  • Data is represented by data triples: subject, predicate, and object. The subject of one triple can be the object of another triple. Predicates are relationships; a few examples: “owns”, “is part of”, “author of”, etc.
  • Data that is accessed via the SPARQL query language.
  • A source of data that may or may not be available. SPARQL endpoints are typically available for free use and they are sometimes unavailable. Although not covered here, I sometimes work around this problem by adding a caching layer to SPARQL queries (access key being a SPARQL query string, the value being the query results). This caching speeds up development and running unit tests, and sometimes saves a customer demo when a required SPARQL endpoint goes offline at an inconvenient time.

DBPedia is the semantic web version of Wikipedia. The many millions of data triples that make up DBPedia are mostly derived from the structured “info boxes” on Wikipedia pages.

As you are learning SPARQL use the DBPedia SPARQL endpoint to practice. As a practitioner who uses linked data, for any new project I start by identifying SPARQL endpoints for possibly useful data. I then interactively experiment with SPARQL queries to extract the data I need. Only when I am satisfied with the choice of SPARQL endpoints and SPARQL queries do I write any code to automatically fetch linked data for my application.

Pro tip: I mentioned SPARQL query caching. I sometimes cache query results in a local database, saving the returned RDF data indexed by the SPARQL query. You can also store the cache timestamp and refresh the cache every few weeks as needed. In addition to making development and unit testing faster, your applications will be more resilient.

In the last chapter “Natural Language Processing Tools” we resolved entities in natural language text to DBPedia (semantic web SPAQL endpoint for Wikipedia) URIs. Here we will use some of these URIs to demonstrate fetching real world knowledge that you might want to use in applications.

The SPARQL Query Language

SPARQL Client Architecture
Figure 18. SPARQL Client Architecture

Example RDF N3 triples (subject, predicate, object) might look like:

<http://www.markwatson.com>
  <http://dbpedia.org/ontology/owner>
  "Mark Watson" .

Element of triples can be URIs or string constants. Triples are often written all on one line; I split it to three lines to fit the page width. Here the subject is the URI for my web site, the predicate is a URI defining an ownership relationship, and the object is a string literal.

If you want to see details for any property or other URI you see, then “follow your nose” and open the URI in a web browser. For example remove the brackets from the owner property URI http://dbpedia.org/ontology/owner and open it in a web browser. For working with RDF data programmatically, it is convenient using full URI. For humans reading RDF, the N3 notation is better because it supports defining URI standard prefixes for use as abbreviations; for example:

prefix ontology: <http://dbpedia.org/ontology/>

<http://www.markwatson.com>
  ontology:owner
  "Mark Watson" .

If you wanted to find all things that I own (assuming this data was in a public RDF repository, which it isn’t) then we might think to match the pattern:

prefix ontology: <http://dbpedia.org/ontology/>

?subject ontology:owner "Mark Watson"

And return all URIs matching the variable ?subject as the query result. This is the basic idea of making SPARQL queries.

The following SPARQL query will be implemented later in Haskell using the HSparql library:

prefix resource: <http://dbpedia.org/resource/>
prefix dbpprop: <http://dbpedia.org/property/>
prefix foaf: <http://xmlns.com/foaf/0.1/>

SELECT *
WHERE {
    ?s dbpprop:genre resource:Web_browser .
    ?s foaf:name ?name .
} LIMIT 5

In this last SPARQL query example, the triple patterns we are trying to match are inside a WHERE clause. Notice that in the two triple patterns, the subject field of each is the variable ?s. The first pattern matches all DBPedia triples with a predicate http://dbpedia.org/property/genre and an object equal to http://dbpedia.org/resource/Web_browser. We then find all triples with the same subject but with a predicate equal to http://xmlns.com/foaf/0.1/name.

Each result from this query will contain two values for variables ?s and ?name: a DBPedia URI for some thing and the name for that thing. Later we will run this query using Haskell code and you can see what the output might look like.

Sometimes when I am using a specific SPARQL query in an application, I don’t bother defining prefixes and just use URIs in the query. As an example, suppose I want to return the Wikipedia (or DBPedia) abstract for IBM. I might use a query such as:

select * where {
  <http://dbpedia.org/resource/IBM>
  <http://dbpedia.org/ontology/abstract>
  ?o .
  FILTER langMatches(lang(?o), "EN")
} LIMIT 100

If you try this query using the web interface for DBPedia SPARQL queries you get just one result because of the FILTER option that only returns English language results. You could also use FR for French results, GE for German results, etc.

A Haskell HTTP Based SPARQL Client

One approach to query the DBPedia SPARQL endpoint is to build a HTTP GET request, send it to the SPARQL endpoint server, and parse the returned XML response. We will start with this simple approach:

 1 {-# LANGUAGE OverloadedStrings #-}
 2 
 3 module Main where
 4 
 5 import Network.HTTP.Client
 6 import Network.HTTP.Client.TLS (tlsManagerSettings)
 7 import Network.HTTP.Base (urlEncode)
 8 import Network.HTTP.Types.Status (statusCode)
 9 import Text.XML.HXT.Core
10 import Text.HandsomeSoup
11 import qualified Data.ByteString.Lazy.Char8 as B
12 
13 prefixUrl :: String
14 prefixUrl = "https://dbpedia.org/sparql?format=xml&query="
15 
16 buildQuery :: String -> String
17 buildQuery sparqlString = prefixUrl ++ urlEncode sparqlString
18 
19 main :: IO ()
20 main = do
21   let url = buildQuery "select ?label where {<http://dbpedia.org/resource/IBM> <http://www.w3.org/2000/01/rdf-schema#label> ?label . FILTER langMatches(lang(?label), \"EN\")}"
22   manager <- newManager tlsManagerSettings
23   initialReq <- parseRequest url
24   let req = initialReq
25               { requestHeaders =
26                   [ ("User-Agent", "HaskellSparqlClient/1.0 (educational example)")
27                   , ("Accept",     "application/sparql-results+xml")
28                   ]
29               }
30   response <- httpLbs req manager
31   let status = statusCode (responseStatus response)
32   if status /= 200
33     then putStrLn $ "HTTP error: " ++ show status
34     else do
35       let body = responseBody response
36       let doc  = readString [] (B.unpack body)
37       putStrLn "\nIBM rdfs:labels:\n"
38       labels <- runX $ doc >>> css "binding" >>> (getAttrValue "name" &&& (deep getText))
39       if null labels
40         then putStrLn "(no results — check the SPARQL endpoint or query)"
41         else mapM_ print labels

The constant prefixUrl on line 14 specifies the DBPedia SPARQL endpoint URL with an XML format parameter. The function buildQuery on line 17 takes any SPARQL query, URL encodes it, and appends it to the endpoint URL. In main, we create an HTTP manager with TLS settings (line 22), build a request with custom headers for a User-Agent and to request XML results (lines 24-29), and then execute the request (line 30). We check the HTTP status code and, if successful, parse the XML response. On lines 37-38 I use the HXT parsing library with the HandsomeSoup CSS selector to extract bindings. I covered the use of the HandsomeSoup parsing library in the chapter Web Scraping.

We use runX to execute a series of operations on an XML document (the doc variable). We first select all elements in doc that have the CSS class binding using the css function. Next we extract the value of the name attribute from each selected element using getAttrValue and also extract the text inside the element using the function deep. The &&& operator is used to combine the two values for the name attribute and the element text into a tuple.

Querying Remote SPARQL Endpoints

We will write some code in this section to make the example query to get the names of web browsers from DBPedia. In the last section we made a SPARQL query using fairly low level Haskell libraries. We will be using the high level library HSparql to build SPARQL queries and call the DBPedia SPARQL endpoint.

The example in this section can be found in SparqlClient/TestSparqlClient.hs. Because Haskell is type safe, extracting the values wrapped in query results requires knowing RDF element return types. The code defines an extractBinding helper function that pattern-matches on the various RDF node types to extract a display string:

 1 -- simple experiments with the excellent HSparql library
 2 --
 3 -- HSparql DSL mapping to raw SPARQL:
 4 --   prefix "name" (iriRef url) => PREFIX name: <url>
 5 --   var                        => a fresh ?varN variable
 6 --   triple s p o               => s p o  (in the WHERE clause)
 7 --   resource .:. "Foo"         => name:Foo  (prefixed IRI)
 8 --   SelectQuery { queryVars }  => SELECT ?var1 ?var2 ...
 9 
10 {-# LANGUAGE OverloadedStrings #-}
11 
12 module Main where
13 
14 import Database.HSparql.Connection (BindingValue(..))
15 
16 import Data.RDF hiding (triple)
17 import Database.HSparql.QueryGenerator
18 import Database.HSparql.Connection (selectQuery)
19     
20 webBrowserSelect :: Query SelectQuery
21 webBrowserSelect = do
22     resource <- prefix "dbprop" (iriRef "http://dbpedia.org/resource/")
23     dbpprop  <- prefix "dbpedia" (iriRef "http://dbpedia.org/property/")
24     foaf     <- prefix "foaf" (iriRef "http://xmlns.com/foaf/0.1/")
25     x    <- var
26     name <- var
27     triple x (dbpprop .:. "genre") (resource .:. "Web_browser")
28     triple x (foaf .:. "name") name
29 
30     return SelectQuery { queryVars = [name] }
31 
32 companyAbstractSelect :: Query SelectQuery
33 companyAbstractSelect = do
34     resource <- prefix "dbprop" (iriRef "http://dbpedia.org/resource/")
35     ontology <- prefix "ontology" (iriRef "http://dbpedia.org/ontology/")
36     o <- var
37     triple (resource .:. "Edinburgh_University_Press") (ontology .:. "abstract") o
38     return SelectQuery { queryVars = [o] }
39 
40 companyTypeSelect :: Query SelectQuery
41 companyTypeSelect = do
42     resource <- prefix "dbprop" (iriRef "http://dbpedia.org/resource/")
43     ontology <- prefix "ontology" (iriRef "http://dbpedia.org/ontology/")
44     o <- var
45     triple (resource .:. "Edinburgh_University_Press") (ontology .:. "type") o
46     return SelectQuery { queryVars = [o] }
47 
48 -- | Extract a display string from a single binding row.
49 -- Handles the main RDF node types: language-tagged literals, plain literals,
50 -- typed literals, URI nodes, and blank nodes.
51 extractBinding :: [BindingValue] -> String
52 extractBinding [Bound (LNode (PlainLL s _))] = show s  -- language-tagged literal
53 extractBinding [Bound (LNode (PlainL s))]    = show s  -- plain literal
54 extractBinding [Bound (LNode (TypedL s _))]  = show s  -- typed literal
55 extractBinding [Bound (UNode s)]             = show s  -- URI node
56 extractBinding [Bound (BNode s)]             = "_:" ++ show s  -- blank node
57 extractBinding [Bound (BNodeGen i)]          = "_:b" ++ show i -- generated blank node
58 extractBinding [Unbound]                     = "(unbound)"
59 extractBinding _                             = "(unexpected binding shape)"
60 
61 main :: IO ()
62 main = do
63   -- companyAbstractSelect => SELECT ?o WHERE { dbprop:Edinburgh_University_Press ontology:abstract ?o }
64   sq1 <- selectQuery "http://dbpedia.org/sparql" companyAbstractSelect
65   putStrLn "\nAbstracts extracted from the company abstract query results:\n"
66   case sq1 of
67     Just a  -> mapM_ (putStrLn . extractBinding) a
68     Nothing -> putStrLn "No results returned."
69 
70   -- companyTypeSelect => SELECT ?o WHERE { dbprop:Edinburgh_University_Press ontology:type ?o }
71   sq2 <- selectQuery "http://dbpedia.org/sparql" companyTypeSelect
72   putStrLn "\nTypes extracted from the company type query results:\n"
73   case sq2 of
74     Just a  -> mapM_ (putStrLn . extractBinding) a
75     Nothing -> putStrLn "No results returned."
76 
77   -- webBrowserSelect => SELECT ?name WHERE { ?x dbpedia:genre dbprop:Web_browser . ?x foaf:name ?name }
78   sq3 <- selectQuery "http://dbpedia.org/sparql" webBrowserSelect
79   putStrLn "\nWeb browser names extracted from the query results:\n"
80   case sq3 of
81     Just a  -> mapM_ (putStrLn . extractBinding) a
82     Nothing -> putStrLn "No results returned."

Haskell Code for SPARQL Queries with HSparql

This provided Haskell code demonstrates the use of the HSparql library to interact with a SPARQL endpoint (specifically, DBpedia) to perform semantic queries on linked data.

The comment block at the top of the file (lines 3-8) documents how the HSparql DSL maps to raw SPARQL syntax, which is helpful when learning the library.

SPARQL Query Definitions

It begins by defining three SPARQL queries, each constructed using the Query monad provided by HSparql. These queries are:

  • webBrowserSelect:

    • This query aims to retrieve the names of entities categorized as web browsers.
    • It utilizes prefixes to simplify the representation of URIs within the query.
    • It selects entities (x) that have a “genre” property linking them to the concept of a “Web_browser” and then retrieves their “name.”
  • companyAbstractSelect:

    • This query targets information about the “Edinburgh University Press.”
    • It seeks to retrieve the “abstract” associated with this entity, which provides a concise summary or description.
  • companyTypeSelect:

    • Similar to the previous query, this one focuses on the “Edinburgh University Press” but retrieves its “type,” which indicates the category or class it belongs to within the DBpedia ontology.

The extractBinding Helper

The extractBinding function (lines 51-59) pattern-matches on the various RDF node types returned by HSparql to extract a display string. It handles language-tagged literals, plain literals, typed literals, URI nodes, blank nodes, unbound values, and unexpected shapes. This is more robust than inline pattern matching and handles all the cases you might encounter when querying different SPARQL endpoints.

main Function

The main function serves as the entry point of the program. It performs the following actions:

  1. Query Execution: It executes each of the defined SPARQL queries against the DBpedia SPARQL endpoint using the selectQuery function. This function returns the query results wrapped in a Maybe type to handle potential query failures.

  2. Result Processing: The code uses mapM_ with extractBinding to process and print each binding row. It handles both successful query results (Just a) and potential query failures (Nothing).

  3. Output: The extracted information is printed to the console, with each result on its own line.

Summary

In summary, this Haskell code showcases a practical example of how to leverage the HSparql library to interact with a SPARQL endpoint (DBpedia) to retrieve and process structured data from the Semantic Web. It demonstrates the construction of SPARQL queries, their execution, and the subsequent handling and presentation of query results.

The output from this example with three queries to the DBPedia SPARQL endpoint will show abstracts for Edinburgh University Press, its type(s), and names of web browsers, each printed on its own line.

Linked Data and Semantic Web Wrap Up

If you enjoyed the material on linked data and DBPedia then please do get a free copy of one of my semantic web books on my website book page as well as other SPARQL and linked data tutorials on the web.

Structured and semantically labelled data, when it is available, is much easier to process and use effectively than raw text and HTML collected from web sites.