Leanpub: Publish Early, Publish Often

Web Scraping

It is important to respect the property rights of web site owners and abide by their terms and conditions for use. This Wikipedia article on Fair Use provides a good overview of using copyright material.

The web scraping code we develop here uses the Swift library SwiftSoup that is loosely based on the BeautifulSoup libraries available in other programming languages.

For my work and research, I have been most interested in using web scraping to collect text data for natural language processing but other common applications include writing AI news collection and summarization assistants, trying to predict stock prices based on comments in social media which is what we did at Webmind Corporation in 2000 and 2001, etc.

I wrote a simple web scraping library that is available at https://github.com/mark-watson/WebScraping_swift that you can use in your projects by putting the following dependency in your Project.swift file:

1     dependencies: [
2          .package(url: "git@github.com:mark-watson/WebScraping_swift.git",
3              .branch("main")),
4     ],

Here is the main implementation file for the library:

 1 import Foundation
 2 import SwiftSoup
 3 
 4 public func webPageText(uri: String) -> String {
 5     guard let myURL = URL(string: uri) else {
 6         print("Error: \(uri) doesn't seem to be a valid URL")
 7         fatalError("invalid URI")
 8     }
 9     let html = try! String(contentsOf: myURL, encoding: .ascii)
10     let doc: Document = try! SwiftSoup.parse(html)
11     let plain_text = try! doc.text()
12     return plain_text
13 }
14 
15 func webPageHeadersHelper(uri: String, headerName: String) -> [String] {
16     var ret: [String] = []
17     guard let myURL = URL(string: uri) else {
18         print("Error: \(uri) doesn't seem to be a valid URL")
19         fatalError("invalid URI")
20     }
21     do {
22         let html = try String(contentsOf: myURL, encoding: .ascii)
23         let doc: Document = try SwiftSoup.parse(html)
24         let h1_headers = try doc.select(headerName)
25         for el in h1_headers {
26             let h1 = try el.text()
27             ret.append(h1)
28         }
29     } catch {
30         print("Error")
31     }
32     return ret
33 }
34 
35 public func webPageH1Headers(uri: String) -> [String] {
36     return webPageHeadersHelper(uri: uri, headerName: "h1")
37 }
38     
39 public func webPageH2Headers(uri: String) -> [String] {
40     return webPageHeadersHelper(uri: uri, headerName: "h2")
41 }
42 
43 public func webPageAnchors(uri: String) -> [[String]] {
44     var ret: [[String]] = []
45     guard let myURL = URL(string: uri) else {
46         print("Error: \(uri) doesn't seem to be a valid URL")
47         fatalError("invalid URI")
48     }
49     do {
50         let html = try String(contentsOf: myURL, encoding: .ascii)
51         let doc: Document = try SwiftSoup.parse(html)
52         let anchors = try doc.select("a")
53         for a in anchors {
54             let text = try a.text()
55             let a_uri = try a.attr("href")
56             if a_uri.hasPrefix("#") {
57                 ret.append([text, uri + a_uri])
58             } else {
59                 ret.append([text, a_uri])
60             }
61         }
62     } catch {
63         print("Error")
64     }
65     return ret
66 }

This Swift code defines several functions that can be used to scrape information from a web page located at a given URI.

The webPageText function takes a URI as input and returns the plain text content of the web page located at that URI. It first checks if the URI is valid and then reads the content of the web page using the contentsOf method of the String class. It then uses the parse method of the SwiftSoup library to parse the HTML content of the page and extract the plain text.

The webPageH1Headers and webPageH2Headers functions use the webPageHeadersHelper function to extract the H1 and H2 header texts respectively from the web page located at a given URI. The webPageHeadersHelper function uses the same technique as the webPageText function to read and parse the HTML content of the page. It then selects the headers using the specified headerName parameter and extracts the text of the headers.

The webPageAnchors function extracts all the anchor tags <a> from the web page located at a given URI, along with their corresponding text and URI. It also uses the webPageHeadersHelper function to read and parse the HTML content of the page, selects the anchor tags using the “a” selector, and extracts their text and href attributes.

Overall, these functions provide a simple way to scrape information from a web page and extract specific information such as plain text, header texts, and anchor tags.

I wrote these utility functions to get the plain text from a web site, HTML header text, and anchors. You can clone this library and extend it for other types of HTML elements you may need to process.

The test program shows how to call the APIs in the library:

 1 import XCTest
 2 import Foundation
 3 import SwiftSoup
 4 
 5 @testable import WebScraping_swift
 6 
 7 final class WebScrapingTests: XCTestCase {
 8     func testGetWebPage() {
 9         let text = webPageText(uri: "https://markwatson.com")
10         print("\n\n\tTEXT FROM MARK's WEB SITE:\n\n", text)
11     }
12 
13     func testToShowSwiftSoupExamples() {
14         let myURLString = "https://markwatson.com"
15         let h1_headers = webPageH1Headers(uri: myURLString)
16         print("\n\n++ h1_headers:", h1_headers)
17         let h2_headers = webPageH2Headers(uri: myURLString)
18         print("\n\n++ h2_headers:", h2_headers)
19         let anchors = webPageAnchors(uri: myURLString)
20         print("\n\n++ anchors:", anchors)
21 }
22 
23     static var allTests = [("testGetWebPage", testGetWebPage),
24                            ("testToShowSwiftSoupExamples",
25                             testToShowSwiftSoupExamples)]
26 }

This Swift test program tests the functionality of the WebScraping_swift library. It defines two test functions: testGetWebPage and testToShowSwiftSoupExamples.

The testGetWebPage function uses the webPageText function to retrieve the plain text content of my website located at “https://markwatson.com”. It then prints the retrieved text to the console.

The testToShowSwiftSoupExamples function demonstrates the use of webPageH1Headers, webPageH2Headers, and webPageAnchors functions on the same website. It extracts and prints the H1 and H2 header texts and anchor tags of the same website.

The allTests variable is an array of tuples that map the test function names to the corresponding function references. This variable is used by the XCTest framework to discover and run the test functions.

Overall, this Swift test program demonstrates how to use the functions defined in the WebScraping_swift library to extract specific information from a web page.

Here we run the unit tests (with much of the output not shown for brevity):

 1 $ swift test
 2 
 3 	TEXT FROM MARK's WEB SITE:
 4 
 5  Mark Watson: AI Practitioner and Polyglot Programmer | Mark Watson    Read my Blog \
 6    Fun stuff    My Books    My Open Source Projects    Hire Me    Free Mentoring    \
 7 Privacy Policy Mark Watson: AI Practitioner and Polyglot Programmer I am the author \
 8 of 20+ books on Artificial Intelligence, Common Lisp, Deep Learning, Haskell, Clojur\
 9 e, Java, Ruby, Hy language, and the Semantic Web. I have 55 US Patents. My customer \
10 list includes: Google, Capital One, Olive AI, CompassLabs, Disney, SAIC, Americast, \
11 PacBell, CastTV, Lutris Technology, Arctan Group, Sitescout.com, Embed.ly, and Webmi\
12 nd Corporation.
13 
14 ++ h1_headers: ["Mark Watson: AI Practitioner and Polyglot Programmer", "The books t\
15 hat I have written", "Fun stuff", "Open Source", "Hire Me", "Free Mentoring", "Priva\
16 cy Policy"]
17 
18 ++ h2_headers: ["I am the author of 20+ books on Artificial Intelligence, Common Lis\
19 p, Deep Learning, Haskell, Clojure, Java, Ruby, Hy language, and the Semantic Web. I\
20  have 55 US Patents.", "Other published books:"]
21 
22 ++ anchors: [["Read my Blog", "https://mark-watson.blogspot.com"], ["Fun stuff", "ht\
23 tps://markwatson.com#fun"], ["My Books", "https://markwatson.com#books"], ["My Open \
24 Source Projects", "https://markwatson.com#opensource"], ["Hire Me", "https://markwat\
25 son.com#consulting"], ["Free Mentoring", "https://markwatson.com#mentoring"], ["Priv\
26 acy Policy", "https://markwatson.com/privacy.html"], ["leanpub", "https://leanpub.co\
27 m/u/markwatson"], ["GitHub", "https://github.com/mark-watson"], ["LinkedIn", "https:\
28 //www.linkedin.com/in/marklwatson/"], ["Twitter", "https://twitter.com/mark_l_watson\
29 "], ["leanpub", "https://leanpub.com/lovinglisp"], ["leanpub", "https://leanpub.com/\
30 haskell-cookbook/"], ["leanpub", "https://leanpub.com/javaai"], 
31 ]
32 Test Suite 'All tests' passed at 2021-08-06 17:37:11.062.
33 	 Executed 2 tests, with 0 failures (0 unexpected) in 0.471 (0.472) seconds

Running in the Swift REPL

 1 $ swift run --repl
 2 [1/1] Build complete!
 3 Launching Swift REPL with arguments: -I/Users/markw_1/GIT_swift_book/WebScraping_swi\
 4 ft/.build/arm64-apple-macosx/debug -L/Users/markw_1/GIT_swift_book/WebScraping_swift\
 5 /.build/arm64-apple-macosx/debug -lWebScraping_swift__REPL
 6 Welcome to Apple Swift version 5.5 (swiftlang-1300.0.29.102 clang-1300.0.28.1).
 7 Type :help for assistance.
 8   1> import WebScraping_swift
 9   2> webPageText(uri: "https://markwatson.com")
10 $R0: String = "Mark Watson: AI Practitioner and Polyglot Programmer | Mark Watson   \
11  Read my Blog    Fun stuff    My Books    My Open Source Projects    Privacy Policy \
12 Mark Watson: AI Practitioner and Polyglot Programmer I am the author of 20+ books on\
13  Artificial Intelligence, Common Lisp, Deep Learning, Haskell, Clojure, Java, Ruby, \
14 Hy language, and the Semantic Web. I have 55 US Patents. My customer list includes: \
15 Google, Capital One, Babylist, Olive AI, CompassLabs, Disney, SAIC, Americast, PacBe\
16 ll, CastTV, Lutris Technology, Arctan Group, Sitescout.com, Embed.ly, and Webmind Co\
17 rporation"...
18   3>

This chapter finishes a quick introduction to using Swift and Swift packages for command line utilities. The remainder of this book comprises machine learning, natural language processing, and semantic web/linked data examples.

Up next

Part 2: Large Language Models