Web Scraping

Web scraping, the automated extraction of content from web pages, is a powerful technique for research, data analysis, and building intelligent applications. Before we dive into the code, it is important to discuss responsible and legal web scraping practices. Always read and respect a site’s robots.txt file and its terms of service before scraping. Limit the rate of your requests so you do not place undue burden on web servers; a delay of a second or two between requests is good practice. Prefer using public APIs when they are available, and only scrape content that is publicly accessible. Be aware that some jurisdictions have laws (such as the Computer Fraud and Abuse Act in the United States or the GDPR in Europe) that restrict automated data collection. When in doubt, contact the site owner for permission. In my own work I routinely email web site owners to explain how I plan to use their data, and this approach has served me well. If you treat web scraping the way you would treat visiting someone’s home, politely, and with respect for their resources, you will stay on solid ethical ground.

In this chapter we develop three self-contained scripts, found in the src/webscraping/ directory: html-headers.lisp, page-text.lisp, and page-markdown.lisp. Together they demonstrate a progression from simple HTML inspection to full content extraction. All three examples use the Drakma HTTP client to fetch pages and the Plump HTML parser to build a DOM tree. The third library we use is CLSS, a CSS-selector engine for querying parsed HTML. For text cleanup the examples use CL-PPCRE, a fast regular-expression library. Each script loads its dependencies via Quicklisp at the top of the file so you can run them directly from the command line.

Extracting HTML Headers

Our first example, html-headers.lisp, is the simplest: fetch a web page and print the text content of every heading tag, h1 through h6. This is useful for quickly surveying the structure of a page, what sections does it contain, and how is the content organized?

 1 (ql:quickload '(:drakma :plump :clss))
 2 
 3 (defun fetch-and-print-headers (url)
 4   "Fetches the URL and prints text from H1 to H6 tags using drakma, plump, and clss."
 5   (format t "Fetching ~A...~%" url)
 6   (let* ((html-content (drakma:http-request url))
 7          (parsed-html (plump:parse html-content)))
 8     (dolist (tag '("h1" "h2" "h3" "h4" "h5" "h6"))
 9       (format t "~A sections:~%" tag)
10       (let ((nodes (clss:select tag parsed-html)))
11         (loop for node across nodes do
12           (let ((text (plump:text node)))
13             (when text
14               (format t "  - ~A~%"
15                 (string-trim '(#\Space #\Tab #\Newline #\Return) text)))))))))
16 
17 (fetch-and-print-headers "https://markwatson.com")

The flow is straightforward. Line 6 calls drakma:http-request to download the raw HTML as a string. Line 7 parses it into a Plump DOM tree. We then loop over each heading level (line 8), use clss:select with a CSS selector string to find matching nodes (line 10), and extract the text content with plump:text (line 12). The string-trim call on line 15 strips leading and trailing whitespace from each heading’s text.

You can run this example from the command line:

1 sbcl --load html-headers.lisp --eval "(sb-ext:exit)"

Here is a snippet of the output when run against my personal site:

 1 Fetching https://markwatson.com...
 2 h1 sections:
 3   - Mark Watson: AI Practitioner and Author
 4 h2 sections:
 5   - Current Projects
 6   - Free Books
 7   - Commercial Books
 8 h3 sections:
 9 h4 sections:
10 h5 sections:
11 h6 sections:

This tiny script is already quite useful. You could extend it to crawl a list of URLs and build a table of contents for an entire site.

Extracting Page Content as Plain Text

Our second example, page-text.lisp, goes beyond headers and extracts the full readable text of a web page, stripping out scripts, styles, navigation, footers, and other boilerplate. The result is clean plain text suitable for natural language processing, summarization, or feeding into an LLM.

 1 (ql:quickload '(:drakma :plump :cl-ppcre))
 2 
 3 (defun text-node-p (node)
 4   (typep node 'plump:text-node))
 5 
 6 (defun element-node-p (node)
 7   (typep node 'plump:element))
 8 
 9 (defun ignore-tag-p (tag)
10   (member tag '("script" "style" "head" "nav" "header"
11                 "footer" "iframe" "noscript")
12           :test #'string-equal))
13 
14 (defun get-element-spacing (tag text)
15   "Returns the text wrapped in appropriate layout formatting
16    or markers based on the tag."
17   (cond
18     ((string-equal tag "h1")
19       (format nil "~%__H1__~%~A~%__H1__~%" text))
20     ((string-equal tag "h2")
21       (format nil "~%__H2__~%~A~%__H2__~%" text))
22     ((member tag '("h3" "h4" "h5" "h6") :test #'string-equal)
23       (format nil "~%~A~%~%" text))
24     ((member tag '("p" "blockquote" "pre" "ul" "ol")
25              :test #'string-equal)
26       (format nil "~%~A~%~%" text))
27     ((string-equal tag "br")
28       (format nil "~A~%" text))
29     ((member tag '("li" "tr") :test #'string-equal)
30       (format nil "~A~%" text))
31     ((member tag '("div" "article" "section" "aside" "main")
32              :test #'string-equal)
33       (if (and (> (length text) 0)
34                (not (char= (char text (1- (length text)))
35                             #\Newline)))
36           (format nil "~A~%" text)
37           text))
38     (t text)))
39 
40 (defun get-clean-text (node)
41   "Recursively collects text from NODE, skipping non-content
42    tags and inserting linebreaks/markers for tags."
43   (cond
44     ((text-node-p node)
45       (plump:text node))
46     ((element-node-p node)
47       (let ((tag (plump:tag-name node)))
48         (if (ignore-tag-p tag)
49             ""
50             (let ((text (with-output-to-string (s)
51                           (loop for child across (plump:children node)
52                                 do (write-string
53                                      (get-clean-text child) s)))))
54               (get-element-spacing tag text)))))
55     ((typep node 'plump:nesting-node)
56       (with-output-to-string (s)
57         (loop for child across (plump:children node)
58               do (write-string (get-clean-text child) s))))
59     (t "")))

The key design decision here is the ignore-tag-p function (line 9) which filters out tags that contain non-content material: script, style, head, nav, header, footer, iframe, and noscript. The get-element-spacing function (line 14) maps each HTML element to the appropriate whitespace treatment, headings get extra blank lines, paragraphs get double newlines, list items get single newlines, and so on.

The marker strings __H1__ and __H2__ are a clever technique: they act as placeholders that the whitespace cleanup pass (below) can later replace with the correct number of blank lines, without being collapsed during the intermediate cleanup steps.

 1 (defun clean-whitespace (text)
 2   "Cleans up excessive spaces and newlines in the text,
 3    preserving spacing for H1/H2."
 4   (let* ((n (format nil "~%"))
 5          ;; 1. Clean up lines that contain only whitespace
 6          (text (cl-ppcre:regex-replace-all
 7                  "(?m)^[ \\t]+$" text ""))
 8          ;; 2. Collapse double spaces
 9          (text (cl-ppcre:regex-replace-all
10                  "[ \\t]+" text " "))
11          ;; 3. Collapse 3+ newlines to at most 2
12          (text (cl-ppcre:regex-replace-all
13                  (format nil "~A{3,}" n)
14                  text (format nil "~A~A" n n)))
15          ;; 4. Replace __H1__ markers with 4 newlines
16          (text (cl-ppcre:regex-replace-all
17                  (format nil "~A*__H1__~A*" n n)
18                  text (format nil "~A~A~A~A" n n n n)))
19          ;; 5. Replace __H2__ markers with 3 newlines
20          (text (cl-ppcre:regex-replace-all
21                  (format nil "~A*__H2__~A*" n n)
22                  text (format nil "~A~A~A" n n n)))
23          ;; 6. Trim leading/trailing whitespace
24          (text (string-trim
25                  '(#\Space #\Tab #\Newline #\Return) text)))
26     text))
27 
28 (defun fetch-and-print-text (url)
29   "Fetches the URL and prints cleaned up text content."
30   (format t "Fetching ~A...~%" url)
31   (let* ((html-content (drakma:http-request url))
32          (parsed-html (plump:parse html-content))
33          (raw-text (get-clean-text parsed-html))
34          (cleaned-text (clean-whitespace raw-text)))
35     (format t "~A~%" cleaned-text)))
36 
37 (fetch-and-print-text "https://markwatson.com")

The clean-whitespace function applies five regular-expression passes using CL-PPCRE: strip whitespace-only lines, collapse runs of spaces, collapse excessive newlines, expand the __H1__ and __H2__ markers into proper visual spacing, and finally trim the whole result.

Sample output (truncated):

 1 Fetching https://markwatson.com...
 2 
 3 
 4 
 5 Mark Watson: AI Practitioner and Author
 6 
 7 
 8 
 9 Current Projects
10 
11 I am currently working on the 6th edition of my book
12 "Loving Common Lisp, or the Savvy Programmer's Secret Weapon."
13 I also develop AI-powered applications using Common Lisp,
14 Haskell, and Python.
15 
16 
17 
18 Free Books
19 
20 Loving Common Lisp, or the Savvy Programmer's Secret Weapon
21 A Lisp Programmer Living in Python-Land: The Hy Programming Language
22 Practical Artificial Intelligence Programming With Clojure

Converting a Web Page to Markdown

Our third and most sophisticated example, page-markdown.lisp, converts a full web page into well-formed Markdown. This is especially useful for feeding web content into large language models, which process Markdown far more effectively than raw HTML.

 1 (ql:quickload '(:drakma :plump :cl-ppcre))
 2 
 3 (defun text-node-p (node)
 4   (typep node 'plump:text-node))
 5 
 6 (defun element-node-p (node)
 7   (typep node 'plump:element))
 8 
 9 (defun ignore-tag-p (tag)
10   (member tag '("script" "style" "head" "noscript" "iframe")
11           :test #'string-equal))
12 
13 (defun block-element-p (tag)
14   (member tag '("p" "div" "li" "br" "h1" "h2" "h3" "h4"
15                 "h5" "h6" "tr" "article" "section" "aside")
16           :test #'string-equal))
17 
18 (defun html-to-markdown (node)
19   "Recursively converts a plump HTML node into Markdown."
20   (cond
21     ((text-node-p node)
22       (plump:text node))
23     ((element-node-p node)
24       (let* ((tag (plump:tag-name node))
25              (inner-md
26                (with-output-to-string (s)
27                  (loop for child across (plump:children node)
28                        do (write-string
29                             (html-to-markdown child) s)))))
30         (cond
31           ((ignore-tag-p tag) "")
32           ((string-equal tag "h1")
33             (format nil "~%__H1__~%# ~A~%__H1__~%"
34               (string-trim '(#\Space #\Tab #\Newline #\Return)
35                            inner-md)))
36           ((string-equal tag "h2")
37             (format nil "~%__H2__~%## ~A~%__H2__~%"
38               (string-trim '(#\Space #\Tab #\Newline #\Return)
39                            inner-md)))
40           ((string-equal tag "h3")
41             (format nil "~%### ~A~%~%"
42               (string-trim '(#\Space #\Tab) inner-md)))
43           ((string-equal tag "h4")
44             (format nil "~%#### ~A~%~%"
45               (string-trim '(#\Space #\Tab) inner-md)))
46           ((string-equal tag "h5")
47             (format nil "~%##### ~A~%~%"
48               (string-trim '(#\Space #\Tab) inner-md)))
49           ((string-equal tag "h6")
50             (format nil "~%###### ~A~%~%"
51               (string-trim '(#\Space #\Tab) inner-md)))
52           ((string-equal tag "p")
53             (format nil "~%~A~%~%"
54               (string-trim '(#\Space #\Tab #\Newline #\Return)
55                            inner-md)))
56           ((string-equal tag "br") (format nil "~%"))
57           ((string-equal tag "strong")
58             (format nil "**~A**" inner-md))
59           ((string-equal tag "b")
60             (format nil "**~A**" inner-md))
61           ((string-equal tag "em")
62             (format nil "*~A*" inner-md))
63           ((string-equal tag "i")
64             (format nil "*~A*" inner-md))
65           ((string-equal tag "code")
66             (format nil "`~A`" inner-md))
67           ((string-equal tag "pre")
68             (format nil "~%```~%~A~%```~%" inner-md))
69           ((string-equal tag "a")
70             (let ((href (plump:attribute node "href")))
71               (if (and href
72                        (> (length (string-trim
73                                     '(#\Space #\Tab)
74                                     inner-md)) 0))
75                   (format nil "[~A](~A)"
76                     (string-trim
77                       '(#\Space #\Tab #\Newline #\Return)
78                       inner-md)
79                     href)
80                   inner-md)))
81           ((string-equal tag "img")
82             (let ((src (plump:attribute node "src"))
83                   (alt (or (plump:attribute node "alt")
84                            "image")))
85               (if src
86                   (format nil "![~A](~A)" alt src)
87                   "")))
88           ((string-equal tag "li")
89             (format nil "* ~A~%"
90               (string-trim '(#\Space #\Tab #\Newline #\Return)
91                            inner-md)))
92           ((block-element-p tag)
93             (format nil "~%~A~%" inner-md))
94           (t inner-md))))
95     ((typep node 'plump:nesting-node)
96       (with-output-to-string (s)
97         (loop for child across (plump:children node)
98               do (write-string (html-to-markdown child) s))))
99     (t "")))

The html-to-markdown function is the heart of this example. It walks the DOM tree recursively. For each element node it first converts all children to Markdown (the inner-md string), then wraps the result in the appropriate Markdown syntax based on the tag: # prefixes for headings, **...** for bold, *...* for italic, [text](url) for links, ![alt](src) for images, * for list items, and triple backticks for code blocks.

The whitespace cleanup function is nearly identical to the plain-text version:

 1 (defun clean-markdown-whitespace (text)
 2   "Cleans up excessive spaces and newlines in the Markdown,
 3    preserving spacing for H1/H2."
 4   (let* ((n (format nil "~%"))
 5          (text (cl-ppcre:regex-replace-all
 6                  "(?m)^[ \\t]+$" text ""))
 7          (text (cl-ppcre:regex-replace-all
 8                  "[ \\t]+" text " "))
 9          (text (cl-ppcre:regex-replace-all
10                  (format nil "~A{3,}" n)
11                  text (format nil "~A~A" n n)))
12          (text (cl-ppcre:regex-replace-all
13                  (format nil "~A*__H1__~A*" n n)
14                  text (format nil "~A~A~A~A" n n n n)))
15          (text (cl-ppcre:regex-replace-all
16                  (format nil "~A*__H2__~A*" n n)
17                  text (format nil "~A~A~A" n n n)))
18          (text (string-trim
19                  '(#\Space #\Tab #\Newline #\Return) text)))
20     text))
21 
22 (defun fetch-and-print-markdown (url)
23   "Fetches the URL and prints content converted to Markdown."
24   (format t "Fetching ~A...~%" url)
25   (let* ((html-content (drakma:http-request url))
26          (parsed-html (plump:parse html-content))
27          (raw-markdown (html-to-markdown parsed-html))
28          (cleaned-markdown
29            (clean-markdown-whitespace raw-markdown)))
30     (format t "~A~%" cleaned-markdown)))
31 
32 (fetch-and-print-markdown "https://markwatson.com")

Sample output (truncated):

 1 Fetching https://markwatson.com...
 2 
 3 
 4 
 5 # Mark Watson: AI Practitioner and Author
 6 
 7 
 8 
 9 ## Current Projects
10 
11 I am currently working on the 6th edition of my book
12 **Loving Common Lisp, or the Savvy Programmer's Secret Weapon.**
13 I also develop AI-powered applications using Common Lisp,
14 Haskell, and Python.
15 
16 
17 
18 ## Free Books
19 
20 * [Loving Common Lisp](https://leanpub.com/lovinglisp)
21 * [A Lisp Programmer Living in Python-Land](https://leanpub.com/hy-lisp-python)

Wrap Up

The three scripts in this chapter form a practical toolkit for extracting information from the web using Common Lisp. The header extractor gives you a quick structural overview, the plain-text extractor yields clean readable content, and the Markdown converter produces richly formatted output ideal for downstream processing.

Here are some project ideas that build on this web scraping code:

Build a personal knowledge base. Scrape articles and blog posts you read frequently, convert them to Markdown, and store them in a local file system or database for full-text search. This is especially powerful when combined with the embedding and vector search techniques covered in later chapters.
Create a site-structure analyzer. Extend the header extraction script to crawl an entire site (following internal links) and build a hierarchical table of contents. This is invaluable for auditing large documentation sites or wikis.
Feed web content to an LLM. Use the Markdown converter to scrape a page and pass the cleaned output directly to an LLM API (such as the OpenAI, Ollama, or Gemini interfaces covered elsewhere in this book) for summarization, question answering, or translation.
Monitor pages for changes. Run the plain-text extractor on a schedule and diff successive snapshots to detect when a page’s content changes, useful for tracking product prices, news updates, or government filings.
Extract structured data. Adapt the CSS-selector technique from the header example to pull specific data fields (prices, dates, names) from pages with consistent HTML structure, and export the results as CSV or JSON.
Combine with the Lightpanda browser client. For pages that require JavaScript rendering, use the Lightpanda interface from the previous chapter to fetch the fully rendered HTML, then pass that HTML through the text or Markdown extraction functions developed here.

Optional Practice Problems

Extraction of Hyperlinks into an Association List: In page-text.lisp, hyperlink tags <a> are processed by simply extracting their inner text content. Write a function extract-all-links that parses the DOM tree using CLSS and extracts all href attributes, converting them into an association list of (anchor-text . url). Filter out empty anchors or relative links, resolving relative paths against the base URL.
Robots.txt Parser and Rate Limiting Compliance: To comply with the responsible scraping guidelines mentioned in the introduction of web-scraping.md, write a utility function allowed-by-robots-p in a new file or as a helper. Fetch and parse the /robots.txt file for a given URL, check if the current user agent is allowed to access the target path, and read any Crawl-delay directive to sleep dynamically before making requests via Drakma.
Markdown Table Converter: The html-to-markdown function in page-markdown.lisp handles block elements, headers, bold/italic markup, links, images, and lists, but does not support HTML tables (<table>, <tr>, <th>, <td>). Extend html-to-markdown to parse HTML table nodes and render them as correctly formatted GitHub Flavored Markdown (GFM) tables, including table header separators.
Dynamic User-Agent and Request Headers Customization: Many websites block default Drakma user agents to prevent scraping. Create a configuration utility or a helper function in html-headers.lisp that randomly selects a User-Agent string from a pre-defined list of modern browser headers (Chrome, Firefox, Safari) and adds custom request headers like Accept-Language or Referer to the Drakma request payload.
Recursive Site Crawler with Level Limits (DFS/BFS): Build a recursive site crawler in page-text.lisp that starts from a seed URL, extracts all internal links (using your link extraction function), and recursively scrapes pages up to a maximum depth limit (e.g. depth 2). Store the scraped pages as individual Markdown files, keeping track of visited URLs in a hash table to avoid infinite recursion.
Automatic Encoding Detection and Recovery: When Drakma fetches HTML content as a string, it may misinterpret the encoding if the site does not declare UTF-8 in the HTTP response headers. Write a wrapper function fetch-html-with-encoding-recovery in page-text.lisp that handles encoding recovery. If the returned string contains malformed characters, search the raw binary response for <meta charset="..."> tags, and re-decode the byte array using flexi-streams:octets-to-string with the detected encoding.

Up next

Using a Local Document Embeddings Vector Database With OpenAI GPT-5 APIs for Semantically Querying Your Own Data