Information Gathering
I often write software to automatically collect and use data from the web and other sources. In this chapter I have collected utility code that I have written over the years into a small library supporting two approaches to web scraping and GeoNames lookup. This code is simple but I hope you will find useful.
Web Scraping Examples
As a practical matter, much of the data that many people use for machine learning either comes from the web or from internal data sources. This section provides some guidance and examples for getting text data from the web.
Before we start a technical discussion about web scraping I want to point out to you that much of the information on the web is copyright and the first thing that you should do is to read the terms of service for web sites to insure that your use of “scraped” or “spidered” data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites.
Motivation for Web Scraping
There is a huge amount of structured data available on the web via web services, semantic web/linked data markup, and APIs. That said, you will frequently find data that is useful to pull raw text from web sites but this text is usually fairly unstructured and in a messy (and frequently changing) format as web pages meant for human consumption and not meant to be ingested by software agents. In this chapter we will cover useful “web scraping” techniques. You will see that there is often a fair amount of work in dealing with different web design styles and layouts. To make things even more inconvenient you might find that your software information gathering agents will often break because of changes in web sites.
I tend to use one of three general techniques for scraping web sites. Only the first two will be covered in this chapter:
- Use an HTML parsing library that strips all HTML markup and Javascript from a page and returns a “pure text” block of text. The text in navigation menus, headers, etc. will be interspersed with what we might usually think of a “content” from a web site.
- Exploit HTML DOM (Document Object Model) formatting information on web sites to pick out headers, page titles, navigation menus, and large blocks of content text.
- Use a tool like Selenium to programatically control a web browser so your software agents can login to site and otherwise perform navigation. In other words your software agents can simulate a human using a web browser.
I seldom need to use tools like Selenium but as the saying goes “when you need them, you need them.” For simple sites I favor extracting all text as a single block and use DOM processing as needed.
I am not going to cover the use of Selenium and the Java Selenium Web-Driver APIs in this chapter because, as I mentioned, I tend to not use it frequently and I think that you are unlikely to need to do so either. I refer you to the Selenium documentation if the first two approaches in the last list do not work for your application. Selenium is primarily intended for building automating testing of complex web applications, so my occasional use in web spidering is not the common use case.
I assume that you have some experience with HTML and DOM. DOM is a tree data structure.
Web Scraping Using the Jsoup Library
We will use the MIT licensed library jsoup. One reason I selected jsoup for the examples in this chapter out of many fine libraries that provide similar functionality is the particularly nice documentation, especially The jsoup Cookbook which I urge you to bookmark as a general reference. In this chapter I will concentrate on just the most frequent web scraping use cases that I use in my own work.
The following bit of example code uses jsoup to get the text inside all P (paragraph) elements that are direct children of any DIV element. On line 14 we use the jsoup library to fetch my home web page:
1 package com.markwatson.web_scraping;
2
3 import org.jsoup.*;
4 import org.jsoup.nodes.Document;
5 import org.jsoup.nodes.Element;
6 import org.jsoup.select.Elements;
7
8 /**
9 * Examples of using jsoup
10 */
11 public class MySitesExamples {
12
13 public static void main(String[] args) throws Exception {
14 Document doc = Jsoup.connect("https://markwatson.com")
15 .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0; rv:120.0) Gecko/20100101 Firefox/120.0")
16 .timeout(5000).get();
17 Elements newsHeadlines = doc.select("div p");
18 for (var element : newsHeadlines) {
19 System.out.println(" next element text: " + element.text());
20 }
21 String allPageText = doc.text();
22 System.out.println("All text on web page:\n" + allPageText);
23 Elements anchors = doc.select("a[href]");
24 for (var anchor : anchors) {
25 String uri = anchor.attr("href");
26 System.out.println(" next anchor uri: " + uri);
27 System.out.println(" next anchor text: " + anchor.text());
28 }
29 Elements absoluteUriAnchors = doc.select("a[href]");
30 for (var anchor : absoluteUriAnchors) {
31 String uri = anchor.attr("abs:href");
32 System.out.println(" next anchor absolute uri: " + uri);
33 System.out.println(" next anchor absolute text: " + anchor.text());
34 }
35
36 }
37 }
In line 18 I am selecting the pattern that returns all P elements that are direct children of any DIV element and in lines 19-21 print the text inside these P elements.
For training data for machine learning it is useful to just grab all text on a web page and assume that common phrases dealing with web navigation, etc. will be dropped from learned models because they occur in many different training examples for different classifications. In the above listing, line 22 shows how to fetch the plain text from an entire web page. The code on line 24 fetched anchor elements and the loop in lines 25-29 prints out this anchor data as URI and text. The code in lines 30-35 does the same thing except we are converting relative URIs to absolute URIs.
Output might look like (most of the output is not shown from running this example file MySitesExamples.java):
1 next element text: I am the author of 20+ books on Artificial Intelligence, Common Lisp, Deep Learning, Haskell, Java, Ruby, JavaScript, and the Semantic Web. I have 55 US Patents.
2 next element text: My customer list includes: Google, Capital One, CompassLabs, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, Sitescout.com, Embed.ly, and Webmind Corporation.
3 All text on web page:
4 Mark Watson: consultant specializing in Common Lisp, deep learning and natural language processing
5 learning Toggle navigation Mark Watson consultant and author specializing in Common Lisp development and AI
6 ...
7
8 next anchor uri: #
9 next anchor text: Mark Watson consultant and author specializing in Common Lisp development and AI research projects and commercial products
10 next anchor uri: /
11 next anchor text: Home page
12 next anchor uri: /consulting
13 next anchor text: Consulting
14 next anchor uri: /blog
15 next anchor text: My Blog
16 next anchor uri: /books
17 next anchor text: My Books
18 next anchor uri: /opensource
19 next anchor text: Open Source
20 ...
21 next anchor absolute uri: https://markwatson.com#
22 next anchor absolute text: Mark Watson consultant and author specializing in Common Lisp development and AI research projects and commercial products
23 next anchor absolute uri: https://markwatson.com/
24 next anchor absolute text: Home page
25 next anchor absolute uri: https://markwatson.com/consulting
26 ...
The 2gram (i.e., two words in sequence) “Toggle navigation” in the last listing has nothing to do with the real content in my site and is an artifact of using the Bootstrap CSS and Javascript tools. Often “noise” like this is simply ignored by machine learning models if it appears on many different sites but beware that this might be a problem and you might need to precisely fetch text from specific DOM elements. Similarly, notice that this last listing picks up the plain text from the navigation menus.
Notice that there are different types of URIs like #, relative, and absolute. Any characters following a # character do not affect the routing of which web page is shown (or which API is called) but the characters after the # character are available for use in specifying anchor positions on a web page or extra parameters for API calls. Relative APIs like consulting/ are understood to be relative to the base URI of the web site.
I often require that URIs be absolute URIs (i.e., starts with a protocol like “http:” or “https:”) and lines 28-33 show how to select just absolute URI anchors. In line 31 I am specifying the attribute as “abs:href” to be more selective.
Web Spidering Using the Jericho Library
Here is another web spidering example that is different than the earlier example using the jsoup library. Here we will implement a spider using built in Java standard library network classes and also the Jericho HTML parser library.
1 package com.markwatson.info_spiders;
2
3 import org.jsoup.Jsoup;
4 import org.jsoup.nodes.Document;
5 import org.jsoup.nodes.Element;
6 import org.jsoup.select.Elements;
7
8 import java.io.IOException;
9 import java.net.URI;
10 import java.net.http.HttpClient;
11 import java.net.http.HttpRequest;
12 import java.net.http.HttpResponse;
13 import java.time.Duration;
14 import java.util.*;
15
16 /**
17 * This simple web spider returns a list of lists, each containing two
18 * strings representing "URL" and "text". Specifically, I do not return links on each page.
19 */
20
21 /**
22 * Copyright Mark Watson 2008-2020. All Rights Reserved.
23 * License: Apache 2
24 */
25
26 public class WebSpider {
27 public WebSpider(String rootUrl, int maxReturnedPages) throws Exception {
28 String host = URI.create(rootUrl).getHost();
29 System.out.println("+ host: " + host);
30 var urls = new ArrayList<String>();
31 var alreadyVisited = new HashSet<String>();
32 urls.add(rootUrl);
33 int numFetched = 0;
34
35 try (var httpClient = HttpClient.newBuilder()
36 .connectTimeout(Duration.ofSeconds(10))
37 .followRedirects(HttpClient.Redirect.NORMAL)
38 .build()) {
39
40 while (numFetched <= maxReturnedPages && !urls.isEmpty()) {
41 try {
42 System.out.println("+ urls: " + urls);
43 String urlStr = urls.removeFirst();
44 System.out.println("+ url_str: " + urlStr);
45 if (urlStr.toLowerCase().contains(host) && !alreadyVisited.contains(urlStr)) {
46 alreadyVisited.add(urlStr);
47
48 var request = HttpRequest.newBuilder()
49 .uri(URI.create(urlStr))
50 .timeout(Duration.ofSeconds(15))
51 .header("User-Agent", "Mozilla/5.0 (compatible; JavaAIBook/1.0)")
52 .GET()
53 .build();
54
55 HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
56 if (response.statusCode() != 200) {
57 System.out.println("Skipping " + urlStr + " (HTTP " + response.statusCode() + ")");
58 continue;
59 }
60
61 Document doc = Jsoup.parse(response.body(), urlStr);
62 numFetched++;
63 String text = doc.text();
64
65 // Skip any pages where text on page is identical to existing
66 // page (e.g., http://example.com and http://example.com/index.html)
67 boolean duplicate = urlContentLists.stream()
68 .anyMatch(ls -> text.equals(ls.get(1)));
69
70 if (!duplicate) {
71 try {
72 Thread.sleep(500);
73 } catch (InterruptedException ie) {
74 Thread.currentThread().interrupt();
75 }
76
77 Elements anchors = doc.select("a[href]");
78 for (Element anchor : anchors) {
79 String linkStr = anchor.attr("abs:href");
80 if (!linkStr.isEmpty()) {
81 urls.add(linkStr);
82 }
83 }
84 urlContentLists.add(List.of(urlStr, text));
85 }
86 }
87 } catch (IOException ex) {
88 System.out.println("Error: " + ex);
89 ex.printStackTrace();
90 }
91 }
92 }
93 }
94
95 private final List<List<String>> urlContentLists = new ArrayList<>();
96
97 public List<List<String>> getUrlContentLists() {
98 return Collections.unmodifiableList(urlContentLists);
99 }
100
101 /** @deprecated Use {@link #getUrlContentLists()} instead. Kept for backward compatibility. */
102 @Deprecated
103 public List<List<String>> url_content_lists = urlContentLists;
104 }
The test class WebClientTest shows how to use this class:
1 WebSpider client = new WebSpider("http://pbs.org", 10);
2 System.out.println("Found URIs: " + client.url_content_lists);
Here is the output for the test class WebClientTest:
1 + host: pbs.org
2 + urls: [http://pbs.org]
3 + url_str: http://pbs.org
4 Found URIs: [[http://pbs.org, ]]
Client for GeoNames Service
The GeoNames service looks up information about locations. You need to [sign up for a free account])http://www.geonames.org/login) and the access key needs to be stored in an environment variable GEONAMES which is accessed in Java code using:
1 System.getenv("GEONAMES")
The implementation file is GeoNamesClient.java uses the utility class GeoNameData that we will look at later:
1 package com.markwatson.info_spiders;
2
3 import org.geonames.*;
4
5 import java.util.ArrayList;
6 import java.util.List;
7
8 /**
9 * Copyright Mark Watson 2008-2020. All Rights Reserved.
10 * License: Apache 2
11 */
12
13 // You will need a free GeoNames account. Sign up: https://www.geonames.org/login
14 // Then, set an environment variable: export GEONAMES=your-geonames-account-name
15
16 public class GeoNamesClient {
17 public GeoNamesClient() {
18 }
19
20 private List<GeoNameData> helper(String name, String type) throws Exception {
21 var ret = new ArrayList<GeoNameData>();
22
23 String geonamesAccountName = System.getenv("GEONAMES");
24 if (geonamesAccountName == null || geonamesAccountName.isBlank()) {
25 throw new IllegalStateException("""
26 GeoNames API key not configured.
27 You will need a free GeoNames account.
28 Sign up: https://www.geonames.org/login
29 Then, set an environment variable:
30 export GEONAMES=your-geonames-account-name""");
31 }
32 WebService.setUserName(geonamesAccountName);
33
34 var searchCriteria = new ToponymSearchCriteria();
35 searchCriteria.setStyle(Style.LONG);
36 searchCriteria.setQ(name);
37 ToponymSearchResult searchResult = WebService.search(searchCriteria);
38 for (Toponym toponym : searchResult.getToponyms()) {
39 if (toponym.getFeatureClassName() != null &&
40 toponym.getFeatureClassName().contains(type) &&
41 toponym.getName().contains(name) &&
42 valid(toponym.getName())) {
43 ret.add(new GeoNameData(toponym));
44 }
45 }
46 return ret;
47 }
48
49 private boolean valid(String str) {
50 return str.chars().noneMatch(Character::isDigit);
51 }
52
53 public List<GeoNameData> getCityData(String cityName) throws Exception {
54 return helper(cityName, "city");
55 }
56
57 public List<GeoNameData> getCountryData(String countryName) throws Exception {
58 return helper(countryName, "country");
59 }
60
61 public List<GeoNameData> getStateData(String stateName) throws Exception {
62 List<GeoNameData> states = helper(stateName, "state");
63 for (GeoNameData state : states) {
64 state.geoType = GeoNameData.GeoType.STATE;
65 }
66 return states;
67 }
68
69 public List<GeoNameData> getRiverData(String riverName) throws Exception {
70 return helper(riverName, "stream");
71 }
72
73 public List<GeoNameData> getMountainData(String mountainName) throws Exception {
74 return helper(mountainName, "mountain");
75 }
76 }
The class GeoNamesClient in the last listing uses the class GeoNameData which processes the structured data returned from the GeoNames service and provides public fields to access this information and an implementation of toString to pretty-print the data to a string:
1 package com.markwatson.info_spiders;
2
3 import org.geonames.Toponym;
4
5 /**
6 * Copyright Mark Watson 2008-2020. All Rights Reserved.
7 * License: Apache-2.0
8 */
9 public class GeoNameData {
10 public enum GeoType {
11 CITY, COUNTRY, STATE, RIVER, MOUNTAIN, UNKNOWN
12 }
13
14 public int geoNameId = 0;
15 public GeoType geoType = GeoType.UNKNOWN;
16 public String name = "";
17 public double latitude = 0;
18 public double longitude = 0;
19 public String countryCode = "";
20
21 public GeoNameData(Toponym toponym) {
22 geoNameId = toponym.getGeoNameId();
23 latitude = toponym.getLatitude();
24 longitude = toponym.getLongitude();
25 name = toponym.getName();
26 countryCode = toponym.getCountryCode();
27 String featureClassName = toponym.getFeatureClassName();
28 if (featureClassName != null) {
29 geoType = switch (featureClassName) {
30 case String s when s.startsWith("city") -> GeoType.CITY;
31 case String s when s.startsWith("country") -> GeoType.COUNTRY;
32 case String s when s.startsWith("state") -> GeoType.STATE;
33 case String s when s.startsWith("stream") -> GeoType.RIVER;
34 case String s when s.startsWith("mountain") -> GeoType.MOUNTAIN;
35 default -> GeoType.UNKNOWN;
36 };
37 }
38 }
39
40 public GeoNameData() {
41 }
42
43 @Override
44 public String toString() {
45 return "[GeoNameData: %s, type: %s, country code: %s, ID: %d, latitude: %.4f, longitude: %.4f]"
46 .formatted(name, geoType, countryCode, geoNameId, latitude, longitude);
47 }
48 }
The test class GeoNamesClientTest shows how to use these two classes:
1 GeoNamesClient client = new GeoNamesClient();
2 System.out.println(client.getCityData("Paris")); pause();
3 System.out.println(client.getCountryData("Canada")); pause();
4 System.out.println(client.getStateData("California")); pause();
5 System.out.println(client.getRiverData("Amazon")); pause();
6 System.out.println(client.getMountainData("Whitney"));
The output from this test is shown below:
1 [[GeoNameData: Paris, type: CITY, country code: FR, ID: 2988507, latitude: 48.85341, longitude: 2.3488], [GeoNameData: Le Touquet-Paris-Plage, type: CITY, country code: FR, ID: 2999139, latitude: 50.52432, longitude: 1.58571], [GeoNameData: Paris, type: CITY, country code: US, ID: 4717560, latitude: 33.66094, longitude: -95.55551], [GeoNameData: Balneario Nuevo Paris, type: CITY, country code: UY, ID: 3441475, latitude: -34.85, longitude: -56.23333], [GeoNameData: Paris, type: CITY, country code: BY, ID: 8221628, latitude: 55.15464, longitude: 27.38456], [GeoNameData: Paris, type: CITY, country code: TG, ID: 2364431, latitude: 7.15, longitude: 1.08333]]
2 [[GeoNameData: Canada, type: COUNTRY, country code: CA, ID: 6251999, latitude: 60.10867, longitude: -113.64258], [GeoNameData: Canada Bay, type: COUNTRY, country code: AU, ID: 7839706, latitude: -33.8659, longitude: 151.11591]]
3 [[GeoNameData: Baja California Sur, type: STATE, country code: MX, ID: 4017698, latitude: 25.83333, longitude: -111.83333], [GeoNameData: Baja California, type: STATE, country code: MX, ID: 4017700, latitude: 30.0, longitude: -115.0], [GeoNameData: California, type: STATE, country code: US, ID: 5332921, latitude: 37.25022, longitude: -119.75126]]
4 [[GeoNameData: Amazon Bay, type: RIVER, country code: PG, ID: 2133985, latitude: -10.30264, longitude: 149.36313]]
5 [[GeoNameData: Mount Whitney, type: MOUNTAIN, country code: US, ID: 5409018, latitude: 36.57849, longitude: -118.29194], [GeoNameData: Whitney Peak, type: MOUNTAIN, country code: AQ, ID: 6628058, latitude: -76.43333, longitude: -126.05], [GeoNameData: Whitney Point, type: MOUNTAIN, country code: AQ, ID: 6628059, latitude: -66.25, longitude: 110.51667], [GeoNameData: Whitney Island, type: MOUNTAIN, country code: RU, ID: 1500850, latitude: 81.01149, longitude: 60.88737], [GeoNameData: Whitney Island, type: MOUNTAIN, country code: AQ, ID: 6628055, latitude: -69.66187, longitude: -68.50341], [GeoNameData: Whitney Meadow, type: MOUNTAIN, country code: US, ID: 5409010, latitude: 36.43216, longitude: -118.26648], [GeoNameData: Whitney Peak, type: MOUNTAIN, country code: US, ID: 5444110, latitude: 39.43276, longitude: -106.47309], [GeoNameData: Whitney Portal, type: MOUNTAIN, country code: US, ID: 5409011, latitude: 36.58882, longitude: -118.22592], [GeoNameData: Whitney Mountain, type: MOUNTAIN, country code: US, ID: 4136375, latitude: 36.40146, longitude: -93.91742], [GeoNameData: Whitney Bridge Dip, type: MOUNTAIN, country code: AU, ID: 11878190, latitude: -28.61241, longitude: 153.16546], [GeoNameData: Whitney Point, type: MOUNTAIN, country code: US, ID: 5815920, latitude: 47.76037, longitude: -122.85127], [GeoNameData: Whitney Pass, type: MOUNTAIN, country code: US, ID: 5409024, latitude: 36.55577, longitude: -118.2812], [GeoNameData: Whitney Island, type: MOUNTAIN, country code: CA, ID: 6181293, latitude: 58.6505, longitude: -78.71621]]
Wrap-up for Information Gathering
Access to data is an advantage large companies usually have over individuals and small organizations. That said, there is a lot of free information on the web and I hope my simple utility classes we have covered here will be of some use to you.
I respect the rights of people and organizations who put information on the web. This includes:
- Read the terms of service on web sites to make sure your your of the site’s data is compliant and also avoid accessing any one web site too frequently.
- When you access services like DBpedia and Geonames consider caching the results so that you don’t ask the service for the same information repeatedly. This is particularly important during development and testing. In a later chapter we will see how to use the Apache Derby database to cache SPARQL queries to the DBPedia service.