Leanpub: Publish Early, Publish Often

Web Scraping Examples

Except for the first chapter on network programming techniques, this chapter, and the final chapter on what I call Knowledge Management-Lite, this book is primarily about machine learning in one form or another. As a practical matter, much of the data that many people use for machine learning either comes from the web or from internal data sources. This short chapter provides some guidance and examples for getting text data from the web.

In my work I usually use the Ruby scripting language for web scraping and information gathering (as I wrote about in my APress book Scripting Intelligence Web 3.0 Information Gathering and Processing) but there is also good support for using Java for web scraping and since this is a book on modern Java development, we will use Java in this chapter.

Before we start a technical discusion about “web scraping” I want to point out to you that much of the information on the web is copyright and the first thing that you should do is to read the terms of service for web sites to insure that your use of “scraped” or “spidered” data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites.

Motivation for Web Scraping

As we will see in the next chapter on linked data there is a huge amount of structured data available on the web via web services, semantic web/linked data markup, and APIs. That said, you will frequently find data that is useful to pull raw text from web sites but this text is usually fairly unstructured and in a messy (and frequently changing) format as web pages meant for human consumption and not meant to be ingested by software agents. In this chapter we will cover useful “web scraping” techniques. You will see that there is often a fair amount of work in dealing with different web design styles and layouts. To make things even more inconvenient you might find that your software information gathering agents will often break because of changes in web sites.

I tend to use one of three general techniques for scraping web sites. Only the first two will be covered in this chapter:

Use an HTML parsing library that strips all HTML markup and Javascript from a page and returns a “pure text” block of text. Here the text in navigation menus, headers, etc. will be interspersed with what we might usually think of a “content” from a web site.
Exploit HTML DOM (Document Omject MOdel) formatting information on web sites to pick out headers, page titles, navigation menues, and large blocks of content text.
Use a tool like (Selenium)[http://docs.seleniumhq.org/] to programatically control a web browser so your software agents can login to site and otherwise perform navigation. In other words your software agents can simulate a human using a web browser.

I seldom need to use tools like Selenium but as the saying goes “when you need them, you need them.” For simple sites I favor extracting all text as a single block and use DOM processing as needed.

I am not going to cover the use of Selenium and the Java Selenium Web-Driver APIs in this chapter because, as I mentioned, I tend to not use it frequently and I think that you are unlikely to need to do so either. I refer you to the Selenium documentation if the first two approaches in the last list do not work for your application. Selenium is primarily intended for building automating testing of complex web applications, so my occasional use in web spidering is not the common use case.

I assume that you have some experience with HTML and DOM. For reference, the following figure shows a small part of the DOM for a page on one of my web sites:

This screen shot shows the Chrome web browser developer tools, specifically viewing the page’s DOM. Since a DOM is a tree data structure it is useful to be able to collapse or to expand sub-trees in the DOM. In this figure, the HTML BODY element contains two top level DIV elements. The first DIV that contains the navigation menu for my site is collapes. The second DIV contains an H2 heading and various nested DIV and P (paragraph) elements. I show this fragment of my web pagenot as an example of clean HTML coding but rather as an example of how messy and nested web page elements can be.

Using the jsoup Library

We will use the MIT licensed library jsoup. One reason I selected jsoup for the examples in this chapter out of many fine libraries that provide similar functionality is the particularly nice documentation, especially The jsoup Cookbook which I urge you to bookmark as a general reference. In this chapter I will concentrate on just the most frequent web scraping use cases that I use in my own work.

The following bit of code uses jsoup to get the text inside all P (paragraph) elements that are direct children of any DIV element. On line 14 we use the jsoup library to fetch my home web page:

 1 package com.markwatson.web_scraping;
 2 
 3 import org.jsoup.*;
 4 import org.jsoup.nodes.Document;
 5 import org.jsoup.nodes.Element;
 6 import org.jsoup.select.Elements;
 7 
 8 /**
 9  * Examples of using jsoup
10  */
11 public class MySitesExamples {
12 
13   public static void main(String[] args) throws Exception {
14     Document doc = Jsoup.connect("http://www.markwatson.com").get();
15     Elements newsHeadlines = doc.select("div p");
16     for (Element element : newsHeadlines) {
17       System.out.println(" next element text: " + element.text());
18     }
19   }
20 }

In line 15 I am selecting the pattern that returns all P elements that are direct children of any DIV element and in lines 16 to 18 print the text inside these P elements.

For training data for machine learning it is useful to just grab all text on a web page and assume that common phrases dealing with web navigaion, etc. will be dropped from learned models because they occur in many different training examples for different classifications. The following code snippet shows how to fetch the plain text from an entire web page:

    Document doc = Jsoup.connect("http://www.markwatson.com").get();
    String all_page_text = doc.text();
    System.out.println("All text on web page:\n" + all_page_text);

All text on web page:
Mark Watson: Consultant specializing in machine learning and artificial intellig\
ence Toggle navigation Mark Watson Home page Consulting Free mentoring Blog Book\
s Open Source Fun Consultant specializing in machine learning, artificial intell\
igence, cognitive computing, and web engineering...

The 2gram (i.e., two words in sequence) “Toggle navigation” in the last listing has nothing to do with the real content in my site and is an artifact of using the Bootstrap CSS and Javascript tools. Often “noise” like this is simply ignored by machine learning models if it appears on many different sites but beware that this might be a problem and you might need to precisiely fetch text from specific DOM elements. Similarly, notice that this last listing picks up the plain text from the navigation menus.

The following code snippet finds HTML anchor elements and prints the data associated with these elements:

    Document doc = Jsoup.connect("http://www.markwatson.com").get();
    Elements anchors = doc.select("a[href]");
    for (Element anchor : anchors) {
      String uri = anchor.attr("href");
      System.out.println(" next anchor uri: " + uri);
      System.out.println(" next anchor text: " + anchor.text());
    }

 1  next anchor uri: #
 2  next anchor text: Mark Watson
 3  next anchor uri: /
 4  next anchor text: Home page
 5  next anchor uri: /consulting/
 6  next anchor text: Consulting
 7  next anchor uri: /mentoring/
 8  next anchor text: Free mentoring
 9  next anchor uri: http://blog.markwatson.com
10  next anchor text: Blog
11  next anchor uri: /books/
12  next anchor text: Books
13  next anchor uri: /opensource/
14  next anchor text: Open Source
15  next anchor uri: /fun/
16  next anchor text: Fun
17  next anchor uri: http://www.cognition.tech
18  next anchor text: www.cognition.tech
19  next anchor uri: https://github.com/mark-watson
20  next anchor text: GitHub
21  next anchor uri: https://plus.google.com/117612439870300277560
22  next anchor text: Google+
23  next anchor uri: https://twitter.com/mark_l_watson
24  next anchor text: Twitter
25  next anchor uri: http://www.freebase.com/m/0b6_g82
26  next anchor text: Freebase
27  next anchor uri: https://www.wikidata.org/wiki/Q18670263
28  next anchor text: WikiData
29  next anchor uri: https://leanpub.com/aijavascript
30  next anchor text: Build Intelligent Systems with JavaScript
31  next anchor uri: https://leanpub.com/lovinglisp
32  next anchor text: Loving Common Lisp, or the Savvy Programmer's Secret Weapon
33  next anchor uri: https://leanpub.com/javaai
34  next anchor text: Practical Artificial Intelligence Programming With Java
35  next anchor uri: http://markwatson.com/index.rdf
36  next anchor text: XML RDF
37  next anchor uri: http://markwatson.com/index.ttl
38  next anchor text: Turtle RDF
39  next anchor uri: https://www.wikidata.org/wiki/Q18670263
40  next anchor text: WikiData

Notice that there are diffent types of URIs like #, relative, and absolute. Any characters following a # character do not affect the routing of which web page is shown (or which API is called) but the characters after the # character are available for use in specifying anchor positions on a web page or extra parameters for API calls. Relative APIs like /consulting/ (as seen in line 5) are understood to be relative to the base URI of the web site.

I often require that URIs be absolute URIs (i.e., starts with a protocol like “http:” or “https:”) and the following code snippet selects just absolute URI anchors:

1     Elements absolute_uri_anchors = doc.select("a[href]");
2     for (Element anchor : absolute_uri_anchors) {
3       String uri = anchor.attr("abs:href");
4       System.out.println(" next anchor absolute uri: " + uri);
5       System.out.println(" next anchor absolute text: " + anchor.text());
6     }

In line 3 I am specifying the attribute as “abs:href” to be more selective:

 next anchor absolute text: Mark Watson
 next anchor absolute uri: http://www.markwatson.com/
 next anchor absolute text: Home page
 next anchor absolute uri: http://www.markwatson.com/consulting/
 next anchor absolute text: Consulting
 next anchor absolute uri: http://www.markwatson.com/mentoring/
 next anchor absolute text: Free mentoring
 next anchor absolute uri: http://blog.markwatson.com
 next anchor absolute text: Blog
 ...

Wrap Up

I just showed you quick reference to the most common use cases for my work. What I didn’t show you was an example for organizing spidered information for reuse.

I sometimes collect training data for machine learning by using web searches with query keywords tailored to find information in specific categories. I am not covering automating web search in this book but I would like to refer you to an open source wrapper that I wrote for Microsoft’s Bing Search APIs on github. As an example, just to give you an idea for experimentation, if you wanted to train a model to categorize text containing car descriptions into two classes: “US domestic cars” “foreign made cars” thsn you might use search queries like “cars Ford Chevy” and “cars Volvo Volkswagen Peugeot” to get example text for these two categories.

If you use the Bing Search APIs for collecting training data, then you can for the top ranked search results use the techniques covered in this chapter to retreive the text from the original web pages. Then use one or more of the machine learning techniques covered in this book to build classification models. This is a good technique and some people might even consider it a super power.

For machine learning, I sometimes collect text in file names that indicate the classification of the text in each file. I often also collect data in a NOSQL datastore like MongoDB or CouchDB, or use a relational database like Postgres to store training data for future reuse.

Up next

Linked Data