Web Scraping

Note: the code for this example replaced November 9, 2024.

In my past work I usually used the Ruby and Python scripting languages for web scraping but as I use the Haskell language more often for projects both large and small I am now using Haskell for web scraping, data collection, and data cleaning tasks. If you worked through the tutorial chapter on impure Haskell programming then you already know most of what you need to understand this chapter. Here we will walk through a few short examples for common web scraping tasks.

Before we start a tutorial about web scraping I want to point out that much of the information on the web is copyright and the first thing that you should do is to read the terms of service for web sites to insure that your use of web scraped data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites.

As we saw in the last chapter on linked data there is a huge amount of structured data available on the web via web services, semantic web/linked data markup, and APIs. That said, you will frequently find text (usually HTML) that is useful on web sites. However, this text is often at least partially unstructured and in a messy and frequently changing format because web pages are meant for human consumption and making them easy to parse and use by software agents is not a priority of web site owners. Here is t he code for the entire example in directory haskell_tutorial_cookbook_examples/WebScraping (code description follows the listing):

{-# LANGUAGE OverloadedStrings #-}

import Network.HTTP.Simple
import Text.HTML.TagSoup
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import qualified Data.ByteString.Lazy.Char8 as BL8
import Data.Maybe (mapMaybe)

main :: IO ()
main = do
    -- Fetch the HTML content
    response <- httpLBS "https://markwatson.com/"
    let body = BL8.unpack $ getResponseBody response
        tags = parseTags body

    -- Extract and print headers
    let headers = getResponseHeaders response
    putStrLn "Headers:"
    mapM_ print headers

    -- Extract and print all text content
    let texts = extractTexts tags
    putStrLn "\nText Content:"
    TIO.putStrLn texts

    -- Extract and print all links
    let links = extractLinks tags
    putStrLn "\nLinks:"
    mapM_ TIO.putStrLn links

-- Function to extract all text content from tags
extractTexts :: [Tag String] -> Text
extractTexts = T.unwords . map (T.strip . T.pack) . filter (not . null) . mapMaybe m\
aybeTagText

-- Function to extract all links from tags
extractLinks :: [Tag String] -> [Text]
extractLinks = map (T.pack . fromAttrib "href") . filter isATag
  where
    isATag (TagOpen "a" _) = True
    isATag _               = False

This Haskell program retrieves and processes the content of the webpage at https://markwatson.com/. It utilizes the http-conduit library to perform an HTTP GET request, fetching the HTML content of the specified URL. The response body, initially in a lazy ByteString format, is converted to a String using BL8.unpack to facilitate subsequent parsing operations.

For parsing the HTML content, the program employs the TagSoup library, which is adept at handling both well-formed and malformed HTML. The parseTags function processes the HTML String into a list of Tag String elements, representing the structure of the HTML document. This parsed representation enables efficient extraction of specific components, such as headers, text content, and hyperlinks.

The program defines two functions, extractTexts and extractLinks, to extract text content and hyperlinks, respectively. The extractTexts function filters the parsed tags to identify text nodes, removes any empty strings, converts them to Text, strips leading and trailing whitespace, and concatenates them into a single Text value. The extractLinks function filters for anchor tags, extracts their href attributes, and converts these URLs to Text.

In the main function, after fetching and parsing the HTML content, the program retrieves and prints the HTTP response headers using getResponseHeaders. It then calls extractTexts to obtain and display the textual content of the webpage, followed by extractLinks to list all hyperlinks present in the HTML. This structured approach allows for a clear and organized extraction of information from the specified webpage.

Here is some example output (shortened for brevity):

 1  $ cabal run TagSoupTest
 2 Headers:
 3 ("Date","Sat, 09 Nov 2024 18:12:46 GMT")
 4 ("Content-Type","text/html; charset=utf-8")
 5 ("Transfer-Encoding","chunked")
 6 ("Connection","keep-alive")
 7 ("Last-Modified","Mon, 04 Nov 2024 22:52:48 GMT")
 8 ("Access-Control-Allow-Origin","*")
 9 
10 Text Content:
11      Mark Watson: AI Practitioner and Author of 20+ AI Books | Mark Watson          \
12        window.dataLayer = window.dataLayer || [];
13        function gtag(){dataLayer.push(arguments);}
14        gtag('js', new Date());
15        gtag('config', 'G-MJNL6DY9ZQ');            Read My Blog on Blogspot      Read\
16  My Blog on Substack      Consulting      Fun stuff      My Books      Open Source  
17    Privacy Policy       Mark Watson AI Practitioner and Consultant Specializing in L
18 arge Language Models, LangChain/Llama-Index Integrations, Deep Learning, and the Sem
19 antic Web   I am the author of 20+ books on Artificial Intelligence, Python, Common 
20 Lisp, Deep Learning, Haskell, Clojure, Java, Ruby, Hy language, and the Semantic Web
21 . I have 55 US Patents.         My customer list includes: Google, Capital One, Baby
22 list, Olive AI, CompassLabs, Mind AI, Disney, SAIC, Americast, PacBell, CastTV, Lutr
23 is Technology, Arctan Group, Sitescout.com, Embed.ly, and Webmind Corporation.
24 
25 Links:
26 https://mark-watson.blogspot.com/
27 https://marklwatson.substack.com
28 #consulting
29 #fun
30 #books
31 #opensource
32 https://markwatson.com/privacy.html

Web Scraping Wrap Up

There are many Haskell library options for web scraping and cleaning data. In this chapter I showed you just what I use in my projects.

The material in this chapter and the chapters on text processing and linked data should be sufficient to get you started using online data sources in your applications.