Information Gathering
This chapter covers information gathering on the web using data sources and general techniques that I have found useful. When I was planning this new book edition I had intended to also cover some basics for using the Semantic Web from Common Lisp, basically distilling some of the data from my previous book “Practical Semantic Web and Linked Data Applications, Common Lisp Edition” published in 2011. However since a free PDF is now available for that book I decided to just refer you to my previous work if you are interested in the Semantic Web and Linked Data. You can also find the Java edition of this previous book on my web site.
Gathering information from the web in realtime has some real advantages:
- You don’t need to worry about storing data locally.
- Information is up to date (depending on which web data resources you choose to use).
There are also a few things to consider:
- Data on the web may have legal restrictions on its use so be sure to read the terms and conditions on web sites that you would like to use.
- Authorship and validity of data may be questionable.
DBPedia Lookup Service
Wikipedia is a great source of information. As you may know, you can download a data dump of all Wikipedia data with or without version information and comments. When I want fast access to the entire Wikipedia set of English language articles I choose the second option and just get the current pages with no comments of versioning information. This is the direct download link for current Wikipedia articles. There are no comments or user pages in this GZIP file. This is not as much data as you might think, only about 9 gigabytes compressed or about 42 gigabytes uncompressed.
To load and run an example, try:
1 (ql:quickload "dbpedia")
2 (dbpedia:dbpedia-lookup "berlin")
Wikipedia is a great resource to have on hand but I am going to show you in this section how to access the Semantic Web version or Wikipedia, DBPedia using the DBPedia Lookup Service in the next code listing that shows the contents of the example file dbpedia-lookup.lisp in the directory src/dbpedia:
1 (ql:quickload :drakma)
2 (ql:quickload :babel)
3 (ql:quickload :s-xml)
4
5 ;; utility from http://cl-cookbook.sourceforge.net/strings.html#manip:
6 (defun replace-all (string part replacement &key (test #'char=))
7 "Returns a new string in which all the occurrences of the part
8 is replaced with replacement."
9 (with-output-to-string (out)
10 (loop with part-length = (length part)
11 for old-pos = 0 then (+ pos part-length)
12 for pos = (search part string
13 :start2 old-pos
14 :test test)
15 do (write-string string out
16 :start old-pos
17 :end (or pos (length string)))
18 when pos do (write-string replacement out)
19 while pos)))
20
21 (defstruct dbpedia-data uri label description)
22
23 (defun dbpedia-lookup (search-string)
24 (let* ((s-str (replace-all search-string " " "+"))
25 (s-uri
26 (concatenate
27 'string
28 "http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString="
29 s-str))
30 (response-body nil)
31 (response-status nil)
32 (response-headers nil)
33 (xml nil)
34 ret)
35 (multiple-value-setq
36 (response-body response-status response-headers)
37 (drakma:http-request
38 s-uri
39 :method :get
40 :accept "application/xml"))
41 ;; (print (list "raw response body as XML:" response-body))
42 ;;(print (list ("status:" response-status "headers:" response-headers)))
43 (setf xml
44 (s-xml:parse-xml-string
45 (babel:octets-to-string response-body)))
46 (dolist (r (cdr xml))
47 ;; assumption: data is returned in the order:
48 ;; 1. label
49 ;; 2. DBPedia URI for more information
50 ;; 3. description
51 (push
52 (make-dbpedia-data
53 :uri (cadr (nth 2 r))
54 :label (cadr (nth 1 r))
55 :description
56 (string-trim
57 '(#\Space #\NewLine #\Tab)
58 (cadr (nth 3 r))))
59 ret))
60 (reverse ret)))
61
62 ;; (dbpedia-lookup "berlin")
I am only capturing the attributes for DBPedia URI, label and description in this example code. If you uncomment line 41 and look at the entire response body from the call to DBPedia Lookup, you can see other attributes that you might want to capture in your applications.
Here is a sample call to the function dbpedia:dbpedia-lookup (only some of the returned data is shown):
1 * (ql:quickload "dbpedia")
2 * (dbpedia:dbpedia-lookup "berlin")
3
4 (#S(DBPEDIA-DATA
5 :URI "http://dbpedia.org/resource/Berlin"
6 :LABEL "Berlin"
7 :DESCRIPTION
8 "Berlin is the capital city of Germany and one of the 16 states of Germany.
9 With a population of 3.5 million people, Berlin is Germany's largest city
10 and is the second most populous city proper and the eighth most populous
11 urban area in the European Union. Located in northeastern Germany, it is
12 the center of the Berlin-Brandenburg Metropolitan Region, which has 5.9
13 million residents from over 190 nations. Located in the European Plains,
14 Berlin is influenced by a temperate seasonal climate.")
15 ...)
Wikipedia, and the DBPedia linked data for of Wikipedia are great sources of online data. If you get creative, you will be able to think of ways to modify the systems you build to pull data from DPPedia. One warning: Semantic Web/Linked Data sources on the web are not available 100% of the time. If your business applications depend on having the DBPedia always available then you can follow the instructions on the DBPedia web site to install the service on one of your own servers.
Web Spiders
When you write web spiders to collect data from the web there are two things to consider:
- Make sure you read the terms of service for web sites whose data you want to use. I have found that calling or emailing web site owners explaining how I want to use the data on their site usually works to get permission.
- Make sure you don’t access a site too quickly. It is polite to wait a second or two between fetching pages and other assets from a web site.
We have already used the Drakma web client library in this book. See the files src/dbpedia/dbpedia-lookup.lisp (covered in the last section) and src/solr_examples/solr-client.lisp (covered in the Chapter on NoSQL). Paul Nathan has written library using Drakma to crawl a web site with an example to print out links as they are found. His code is available under the AGPL license at articulate-lisp.com/src/web-trotter.lisp and I recommend that as a starting point.
I find it is sometimes easier during development to make local copies of a web site so that I don’t have to use excess resources from web site hosts. Assuming that you have the wget utility installed, you can mirror a site like this:
1 wget -m -w 2 http://knowledgebooks.com/
2 wget -mk -w 2 http://knowledgebooks.com/
Both of these examples have a two-second delay between HTTP requests for resources. The option -m indicates to recursively follow all links on the web site. The -w 2 option delays for two seconds between requests. The option -mk converts URI references to local file references on your local mirror. The second example on line 2 is more convenient.
We covered reading from local files in the Chapter on Input and Output. One trick I use is to simply concatenate all web pages into one file. Assuming that you created a local mirror of a web site, cd to the top level directory and use something like this:
1 cd knowledgebooks.com
2 cat *.html */*.html > ../web_site.html
You can then open the file, search for text in in p, div, h1, etc. HTML elements to process an entire web site as one file.
Using Apache Nutch
Apache Nutch, like Solr, is built on Lucene search technology. I use Nutch as a “search engine in a box” when I need to spider web sites and I want a local copy with a good search index.
Nutch handles a different developer’s use case over Solr which we covered in the Chapter on NoSQL. As we saw, Solr is an effective tool for indexing and searching structured data as documents. With very little setup, Nutch can be set up to automatically keep an up to date index of a list of web sites, and optionally follow links to some desired depth from these “seed” web sites.
You can use the same Common Lisp client code that we used for Solr with one exception; you will need to change the root URI for the search service to:
1 http://localhost:8080/opensearch?query=
So the modified client code src/solr_examples/solr-client.lisp needs one line changed:
1 (defun do-search (&rest terms)
2 (let ((query-string (format nil "~{~A~^+AND+~}" terms)))
3 (cl-json:decode-json-from-string
4 (drakma:http-request
5 (concatenate
6 'string
7 "http://localhost:8080/opensearch?query="
8 query-string
9 "&wt=json")))))
Early versions of Nutch were very simple to install and configure. Later versions of Nutch have been more complex, more performant, and have more services, but it will take you longer to get set up than earlier versions. If you just want to experiment with Nutch, you might want to start with an earlier version.
The OpenSearch.org web site contains many public OpenSearch services that you might want to try. If you want to modify the example client code in src/solr-client.lisp a good start is OpenSearch services that return JSON data and OpenSearch Community JSON formats web page is a good place to start. Some of the services on this web page like the New York Times service require that you sign up for a developer’s API key.
When I start writing an application that requires web data (no matter which programming language I am using) I start by finding services that may provide the type of data I need and do my initial development with a web browser with plugin support to nicely format XML and JSON data. I do a lot of exploring and take a lot of notes before I write any code.
Wrap Up
I tried to provide some examples and advice in this short chapter to show you that even though other languages like Ruby and Python have more libraries and tools for gathering information from the web, Common Lisp has good libraries for information gathering also and they are easily used via Quicklisp.