Information Gathering
I often write software to automatically collect and use data from the web and other sources. In this chapter I have collected utility code that I have written over the years into a small library supporting two approaches to web scraping and GeoNames lookup. This code is simple but I hope you will find useful.
The following UML class diagram shows the public APIs the libraries developed in this chapter:
Web Scraping Examples
As a practical matter, much of the data that many people use for machine learning either comes from the web or from internal data sources. This section provides some guidance and examples for getting text data from the web.
Before we start a technical discussion about web scraping I want to point out to you that much of the information on the web is copyright and the first thing that you should do is to read the terms of service for web sites to insure that your use of “scraped” or “spidered” data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites.
Motivation for Web Scraping
There is a huge amount of structured data available on the web via web services, semantic web/linked data markup, and APIs. That said, you will frequently find data that is useful to pull raw text from web sites but this text is usually fairly unstructured and in a messy (and frequently changing) format as web pages meant for human consumption and not meant to be ingested by software agents. In this chapter we will cover useful “web scraping” techniques. You will see that there is often a fair amount of work in dealing with different web design styles and layouts. To make things even more inconvenient you might find that your software information gathering agents will often break because of changes in web sites.
I tend to use one of three general techniques for scraping web sites. Only the first two will be covered in this chapter:
- Use an HTML parsing library that strips all HTML markup and Javascript from a page and returns a “pure text” block of text. The text in navigation menus, headers, etc. will be interspersed with what we might usually think of a “content” from a web site.
- Exploit HTML DOM (Document Object Model) formatting information on web sites to pick out headers, page titles, navigation menus, and large blocks of content text.
- Use a tool like Selenium to programatically control a web browser so your software agents can login to site and otherwise perform navigation. In other words your software agents can simulate a human using a web browser.
I seldom need to use tools like Selenium but as the saying goes “when you need them, you need them.” For simple sites I favor extracting all text as a single block and use DOM processing as needed.
I am not going to cover the use of Selenium and the Java Selenium Web-Driver APIs in this chapter because, as I mentioned, I tend to not use it frequently and I think that you are unlikely to need to do so either. I refer you to the Selenium documentation if the first two approaches in the last list do not work for your application. Selenium is primarily intended for building automating testing of complex web applications, so my occasional use in web spidering is not the common use case.
I assume that you have some experience with HTML and DOM. DOM is a tree data structure.
Web Scraping Using the Jsoup Library
We will use the MIT licensed library jsoup. One reason I selected jsoup for the examples in this chapter out of many fine libraries that provide similar functionality is the particularly nice documentation, especially The jsoup Cookbook which I urge you to bookmark as a general reference. In this chapter I will concentrate on just the most frequent web scraping use cases that I use in my own work.
The following bit of example code uses jsoup to get the text inside all P (paragraph) elements that are direct children of any DIV element. On line 14 we use the jsoup library to fetch my home web page:
1 package com.markwatson.web_scraping;
2
3 import org.jsoup.*;
4 import org.jsoup.nodes.Document;
5 import org.jsoup.nodes.Element;
6 import org.jsoup.select.Elements;
7
8 /**
9 * Examples of using jsoup
10 */
11 public class MySitesExamples {
12
13 public static void main(String[] args) throws Exception {
14 Document doc = Jsoup.connect("https://markwatson.com")
15 .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.0; rv:77.0) Gecko/20100101 Firefox/77.0")
16 .timeout(2000).get();
17 Elements newsHeadlines = doc.select("div p");
18 for (Element element : newsHeadlines) {
19 System.out.println(" next element text: " + element.text());
20 }
21 String all_page_text = doc.text();
22 System.out.println("All text on web page:\n" + all_page_text);
23 Elements anchors = doc.select("a[href]");
24 for (Element anchor : anchors) {
25 String uri = anchor.attr("href");
26 System.out.println(" next anchor uri: " + uri);
27 System.out.println(" next anchor text: " + anchor.text());
28 }
29 Elements absolute_uri_anchors = doc.select("a[href]");
30 for (Element anchor : absolute_uri_anchors) {
31 String uri = anchor.attr("abs:href");
32 System.out.println(" next anchor absolute uri: " + uri);
33 System.out.println(" next anchor absolute text: " + anchor.text());
34 }
35
36 }
37 }
In line 18 I am selecting the pattern that returns all P elements that are direct children of any DIV element and in lines 19-21 print the text inside these P elements.
For training data for machine learning it is useful to just grab all text on a web page and assume that common phrases dealing with web navigation, etc. will be dropped from learned models because they occur in many different training examples for different classifications. In the above listing, line 22 shows how to fetch the plain text from an entire web page. The code on line 24 fetched anchor elements and the loop in lines 25-29 prints out this anchor data as URI and text. The code in lines 30-35 does the same thing except we are converting relative URIs to absolute URIs.
Output might look like (most of the output is not shown from running this example file MySitesExamples.java):
1 next element text: I am the author of 20+ books on Artificial Intelligence, Common Lisp, Deep Learning, Haskell, Java, Ruby, JavaScript, and the Semantic Web. I have 55 US Patents.
2 next element text: My customer list includes: Google, Capital One, CompassLabs, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, Sitescout.com, Embed.ly, and Webmind Corporation.
3 All text on web page:
4 Mark Watson: consultant specializing in Common Lisp, deep learning and natural language processing
5 learning Toggle navigation Mark Watson consultant and author specializing in Common Lisp development and AI
6 ...
7
8 next anchor uri: #
9 next anchor text: Mark Watson consultant and author specializing in Common Lisp development and AI research projects and commercial products
10 next anchor uri: /
11 next anchor text: Home page
12 next anchor uri: /consulting
13 next anchor text: Consulting
14 next anchor uri: /blog
15 next anchor text: My Blog
16 next anchor uri: /books
17 next anchor text: My Books
18 next anchor uri: /opensource
19 next anchor text: Open Source
20 ...
21 next anchor absolute uri: https://markwatson.com#
22 next anchor absolute text: Mark Watson consultant and author specializing in Common Lisp development and AI research projects and commercial products
23 next anchor absolute uri: https://markwatson.com/
24 next anchor absolute text: Home page
25 next anchor absolute uri: https://markwatson.com/consulting
26 ...
The 2gram (i.e., two words in sequence) “Toggle navigation” in the last listing has nothing to do with the real content in my site and is an artifact of using the Bootstrap CSS and Javascript tools. Often “noise” like this is simply ignored by machine learning models if it appears on many different sites but beware that this might be a problem and you might need to precisely fetch text from specific DOM elements. Similarly, notice that this last listing picks up the plain text from the navigation menus.
Notice that there are different types of URIs like #, relative, and absolute. Any characters following a # character do not affect the routing of which web page is shown (or which API is called) but the characters after the # character are available for use in specifying anchor positions on a web page or extra parameters for API calls. Relative APIs like consulting/ are understood to be relative to the base URI of the web site.
I often require that URIs be absolute URIs (i.e., starts with a protocol like “http:” or “https:”) and lines 28-33 show how to select just absolute URI anchors. In line 31 I am specifying the attribute as “abs:href” to be more selective.