Leanpub: Publish Early, Publish Often

Clojure Wrapper for the Jena RDF and SPARQL Library

If you read through the optional background material in the last chapter you have some understanding of RDF Data and SPARQL queries. If you skipped the last chapter you can still follow along with the code here.

When querying remote SPARQL endpoints like DBPedia and WikiData I often find that I repeatedly make some of the same queries many times, especially during development and testing. I have found that by caching SPARQL query results that I can greatly improve my developer experience. We will use the Apache Derby relational database (pure Java code and easy to embed in applications) for query caching.

We declare both Jena and the Derby relational database libraries as dependencies in our project file:

 1 (defproject semantic_web_jena_clj "0.1.0-SNAPSHOT"
 2   :description "Clojure Wrapper for Apache Jena"
 3   :url "https://markwatson.com"
 4   :license
 5   {:name "EPL-2.0 OR GPL-2+ WITH Classpath-exception-2.0"
 6    :url "https://www.eclipse.org/legal/epl-2.0/"}
 7   :source-paths      ["src"]
 8   :java-source-paths ["src-java"]
 9   :javac-options     ["-target" "1.8" "-source" "1.8"]
10   :dependencies [[org.clojure/clojure "1.11.1"]
11                  [org.apache.derby/derby "10.15.2.0"]
12                  [org.apache.derby/derbytools "10.15.2.0"]
13                  [org.apache.derby/derbyclient
14                   "10.15.2.0"]
15                  [org.apache.jena/apache-jena-libs
16                   "3.17.0" :extension "pom"]]
17   :repl-options {:init-ns semantic-web-jena-clj.core})

We will use the Jena library for handling RDF and SPARQL queries and the Derby database library for implementing query caching. Please note that the directory structure for this project also includes Java code that I wrote to wrap the Jena APIs for my specific needs (some files not shown for brevity):

 1 $ tree                
 2 .
 3 ├── LICENSE
 4 ├── README.md
 5 ├── data
 6 │   ├── business.sql
 7 │   ├── news.nt
 8 │   ├── sample_news.nt
 9 ├── pom.xml
10 ├── project.clj
11 ├── src
12 │   └── semantic_web_jena_clj
13 │       └── core.clj
14 ├── src-java
15 │   └── main
16 │       ├── java
17 │       │   └── com
18 │       │       └── markwatson
19 │       │           └── semanticweb
20 │       │               ├── Cache.java
21 │       │               ├── JenaApis.java
22 │       │               └── QueryResult.java
23 │       └── resources
24 │           └── log4j.xml
25 └── test
26     └── semantic_web_jena_clj
27         └── core_test.clj

While I expect that you will just use the Java code as is, there is one modification that you might want to make for your applications: I turned on OWL reasoning by default. If you don’t need OWL reasoning and you will be working with large numbers of RDF triples (tens of millions should fit nicely in-memory on your laptop), then you might want to change the following two lines of code in JenaApis.java by uncommenting line 2 and commenting line 4:

1  // use if OWL reasoning not required:
2  //model = ModelFactory.createDefaultModel();
3  // to use the OWL reasoner:
4  model = ModelFactory.createOntologyModel();

OWL reasoning is expensive but for small RDF Data sets you might as well leave it turned on.

I don’t list the file JenaApis.java here but you might want to have it open in an editor while reading the following listing of the Clojure code that wraps this Java code.

The Clojure wrapping functions are mostly self-explanatory. The main corner case is converting Java results from Jena to Clojure seq data structures, as we do in lines 13-14.

 1 (ns semantic-web-jena-clj.core
 2   (:import (com.markwatson.semanticweb JenaApis Cache
 3                                        QueryResult)))
 4 
 5 (defn- get-jena-api-model
 6   "get a default model with OWL reasoning"
 7   []
 8   (new JenaApis))
 9 
10 (defonce model (get-jena-api-model))
11 
12 (defn- results->clj [results]
13   (let [variable-list (seq (. results variableList))
14         bindings-list (seq (map seq (. results rows)))]
15     (cons variable-list bindings-list)))
16 
17 (defn load-rdf-file [fpath]
18   (. model loadRdfFile fpath))
19 
20 (defn query "SPARQL query" [sparql-query]
21   (results->clj (. model query sparql-query)))
22 
23 (defn query-remote
24  "remote service like DBPedia, etc."
25  [remote-service sparql-query]
26   (results->clj
27     (. model queryRemote remote-service sparql-query)))
28 
29 (defn query-dbpedia [sparql-query]
30   (query-remote "https://dbpedia.org/sparql"
31                 sparql-query))
32 
33 (defn query-wikidata [sparql-query]
34   (query-remote
35     "https://query.wikidata.org/bigdata/namespace/wdq/sparql" sparql-query))

Here is a listing of text code that loads RDF data from a file and does a SPARQL query, SPARQL queries DBPedia, and SPARQL queries WikiData:

 1 (ns semantic-web-jena-clj.core-test
 2   (:require [clojure.test :refer :all]
 3             [semantic-web-jena-clj.core :refer :all]))
 4 
 5 (deftest load-data-and-sample-queries
 6   (testing
 7     "Load local triples files and SPARQL queries"
 8     (load-rdf-file "data/sample_news.nt")
 9     (let [results (query "select * { ?s ?p ?o } limit 5")]
10       (println results)
11       (is (= (count results) 6)))))
12 
13 (deftest dbpedia-test
14   (testing "Try SPARQL query to DBPedia endpoint"
15     (println
16       (query-dbpedia
17         "select ?p where { <http://dbpedia.org/resource/Bill_Gates> ?p <http://dbped\
18 ia.org/resource/Microsoft> . } limit 10"))))
19 
20 (deftest wikidata-test
21   (testing "Try SPARQL query to WikiData endpoint"
22     (println
23       (query-dbpedia
24         "select * where { ?subject ?property ?object . } limit 10"))))

You might question line 11: We are checking that the return values as a seq of length six while the SPARQL statement limits the returned results to five results on line 9. The “extra” result” of the first element in the seq that is a list of variable names from the SPARQL query.

Output will look like (reformatted for readability and most output is not shown):

1 ((subject property object)
2  (http://www.openlinksw.com/virtrdf-data-formats#default-iid
3   http://www.w3.org/1999/02/22-rdf-syntax-ns#type
4   http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat)
5  (http://www.openlinksw.com/virtrdf-data-formats#default-iid-nullable
6   http://www.w3.org/1999/02/22-rdf-syntax-ns#type
7   http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat)
8   ...

Data consists of nested lists where the first sub-list is the SPARQL query variable names, in this case: subject property object. Subsequent sub-lists are binding values for the query variables.

We will use the Jena wrapper in the next chapter.

Up next

Simple RDF Datastore and Partial SPARQL Query Processor