Read A Lisp Programmer Living in Python-Land: The Hy Programming Language

A Lisp Programmer Living in Python-Land: The Hy Programming Language

Mark Watson

Cover Material, Copyright, and License
Preface
- Requests from the Author
- Setting Up Your Development Environment
- What is Lisp Programming Style?
- Hy is Python, But With a Lisp Syntax
- How This Book Reflects My Views on Artificial Intelligence and the Future of Society and Technology
- About the Book Cover
Introduction to the Hy Language
- Using Python Libraries
- Global vs. Local Variables
- Using Python Code in Hy Programs
- Using Hy Libraries in Python Programs
- Replacing the Python slice (cut) Notation with the Hy Functional Form
- Iterating Through a List With Index of Each Element
- Formatted Output
- Importing Libraries from Different Directories on Your Laptop
- Hy Looks Like Clojure: How Similar Are They?
- Plotting Data Using the Numpy and the Matplotlib Libraries
- Bonus Points: Configuration for macOS and ITerm2 for Generating Plots Inline in a Hy REPL and Shell
Why Lisp?
- I Hated the Waterfall Method in the 1970s but Learned to Love a Bottom-Up Programming Style
- First Introduction to Lisp
- Commercial Product Development and Deployment Using Lisp
- Performing Bottom Up Development Inside a REPL is a Lifestyle Choice
Writing Web Applications
- Getting Started With Flask: Using Python Decorators in Hy
- Using Jinja2 Templates To Generate HTML
- Handling HTTP Sessions and Cookies
- Deploying Hy Language Flask Apps to Google Cloud Platform AppEngine
- Wrap Up
Responsible Web Scraping
- Using the Python BeautifulSoup Library in the Hy Language
- Getting HTML Links from the DemocracyNow.org News Web Site
- Getting Summaries of Front Page from the NPR.org News Web Site
Using the Brave Search APIs
- Setting an Environment Variable for the Access Key for Brave Search APIs
- Example Search Script
- Wrap-up
Deep Learning
- Simple Multi-layer Perceptron Neural Networks
- Deep Learning
- Using Keras and TensorFlow to Model The Wisconsin Cancer Data Set
- Using a LSTM Recurrent Neural Network to Generate English Text Similar to the Philosopher Nietzsche’s Writing
Natural Language Processing
- Exploring the spaCy Library
- Implementing a HyNLP Wrapper for the Python spaCy Library
- Wrap-up
Datastores
- Sqlite
- PostgreSQL
- RDF Data Using the “rdflib” Library
- Wrap-up
Linked Data, the Semantic Web, and Knowledge Graphs
- Understanding the Resource Description Framework (RDF)
- Resource Namespaces Provided in rdflib
- Understanding the SPARQL Query Language
- Wrapping the Python rdflib Library
Knowledge Graph Creator
- Recommended Industrial Use of Knowledge Graphs
- Design of KGCreator Application
- Problems with using Literal Values in RDF
- Revisiting This Example Using URIs Instead of Literal Values
- Wrap-up
Knowledge Graph Navigator
- Review of NLP Utilities Used in Application
- Utilities to Colorize SPARQL and Generated Output
- Text Utilities for Queries and Results
- Finishing the Main Function for KGN
- Wrap-up
Using OpenAI GPT
- OpenAI Text Completion API
Using Google Gemini API
- REST Interface
- Using Google’s Python Package to Access Gemini
- Wrap Up for Using the Gemini APIs
Running Local LLMs Using Ollama
- Completions
- Tool Use
- Wrap Up for Running Local LLMs Using Ollama
Agents Using the Agno Agent Framework Running On a Local Ollama Model
- An Agent For Answering Questions About A Specific Web Site
- Wrap Up for Agno Agent Example
Using Perplexity Sonar Model for Combined Web Search and LLM Based Reasoning
- A Hy Language Client Library for Perplexity
- Example Output
- Wrap Up for Using Perplexity
Using LangChain to Chain Together Large Language Models
- Installing Necessary Packages
- Basic Usage and Examples
- Creating Embeddings
- Using LangChain Vector Stores to Query Documents
- LangChain Wrap Up
Large Language Models Experiments Using Google Colab
Book Wrap-up

Cover Material, Copyright, and License

This eBook will be updated occasionally so please periodically check the leanpub.com web page for this book for updates.

If you read my eBooks free online then please consider hiring me for consulting work https://markwatson.com.

This is this edition released August 2025.

Please visit the author’s website.

Preface

While this is a book on the Hy Lisp language, we have a wider theme here. In an age where artificial intelligence (AI) is a driver of the largest corporations and government agencies, the question is how do individuals and small organizations take advantage of AI technologies given the disadvantages of small scale. The material I chose to write about here is selected to help you, dear reader, survive as a healthy small fish in a big bond.

I have been using Lisp languages professionally since 1982 and have written books covering the Common Lisp and Scheme languages. Most of my career has involved working on AI projects so tools for developing AI applications will be a major theme. In addition to covering the Hy language, you will get experience with AI tools and techniques that will help you craft your own AI platforms regardless of whether you are a consultant, work at a startup, or a corporation.

The latest version of this book (updated in August 2025) has major code changes:

Code examples modified to work with the latest Hy version 1.1.0.
The README files for the code examples and book text now reflect the author’s use of the uv Python utility to run Hy scripts.
Addition of a chapter to run LLMs locally using Ollama.

The code examples can be found in my GitHub repository https://github.com/mark-watson/hy-lisp-python-book that contains code examples in the directory source_code_for_examples and the Markdown manuscript files for this book in the directory manuscript.

This book covers many programming topics using the Lisp language Hy that compiles to Python AST and is compatible with code, libraries, and frameworks written in Python. The main topics we will cover and write example applications for are:

Deep Learning
Large Langauge Models (LLMS)
Relational and graph databases
Web app development
Web scraping
Accessing semantic web and linked data sources like Wikipedia, DBpedia, and Wikidata
Automatically constructing Knowledge Graphs from text documents, semantic web and linked data
Natural Language Processing (NLP) using Deep Learning

The topics were chosen because of my work experience and the theme of this book is how to increase programmer productivity and happiness using a Lisp language in a bottom-up development style. This style relies heavily on the use of an interactive REPL for exploring APIs and writing new code. I chose the above topics based on my experience working as a developer and researcher. Please note: you will see the term REPL frequently in this book. REPL stands for Read Eval Print Loop.

Some of the examples are very simple (e.g., the web app examples) while some are more complex (e.g., Deep Learning and knowledge graph examples). Regardless of the simplicity or complexity of the examples I hope that you find the code interesting, useful in your projects, and fun to experiment with.

Requests from the Author

This book will always be available to read free online at https://leanpub.com/hy-lisp-python/read.

That said, I appreciate it when readers purchase my books because the income enables me to spend more time writing.

Hire the Author as a Consultant

I am available for short consulting projects. Please see https://markwatson.com.

Setting Up Your Development Environment

In August 2025 I changed the way that I build and run both Hy language and Python scripts and programs. I now use uv and install dependencies in each Hy or Python project directory.

To free disk space for the venv directories, I define a top level Makefile in the GitGub repository for the example programs https://github.com/mark-watson/hy-lisp-python-book:

1 clean:
2 	rm -rf */__pycache__ */venv

What is Lisp Programming Style?

I will give some examples here and also show exploratory Hy language REPL examples later in the book. How often do you search the web for documentation on how to use a library, write some code only to discover later that you didn’t use the API correctly? I reduce the amount of time that I spend writing code by having a Lisp REPL open so that I can experiment with API calls and returned results while reading the documentation.

When I am working on new code or a new algorithm I like to have a Lisp REPL open and try short snippets of code to get working code for solving low level problems, building up to more complex code. As I figure out how to do things I enter code that works and which I want to keep in a text editor and then convert this code into my own library. I then iterate on loading my new library into a REPL and stress test it, look for API improvements, etc.

I find, in general, that a “bottom-up” approach gets me to working high quality systems faster than spending too much time doing up front planning and design. The problem with spending too much up front time on design is that we change our minds as to what makes the most sense to solve a problem as we experiment with code. I try to avoid up front time spent on work that I will have to re-work or even toss out.

Hy is Python, But With a Lisp Syntax

When I need a library for a Hy project I search for Python libraries and either write a thin Hy language “wrapper” around the Python library or just call the Python APIs directly from Hy code. You will see many examples of both approaches in this book.

How This Book Reflects My Views on Artificial Intelligence and the Future of Society and Technology

Since starting work on AI in 1982 I have seen the field progress from a niche technology where even international conferences had small attendances to a field that is generally viewed as transformative. In the USA there is legitimate concern that economic adversaries like China will exceed our abilities to develop core AI technologies and integrate these technologies into commercial and military systems. As I write this in February 2020, some people in our field including myself believe that the Chinese company Baidu may have already passed Google and Microsoft in applied AI.

Even though most of my professional work in the last five years has been in Deep Learning (and before that I worked with the Knowledge Graph at Google on a knowledge representation problem and application), I believe that human level Artificial General Intelligence (AGI) will use hybrid Deep Learning, “old fashioned” symbolic AI, and techniques that we have yet to discover.

This belief that Deep Learning will not get us to AGI capabilities is a motivation for me to use the Hy language because it offers transparent access to Python Deep Learning frameworks with a bottom-up Lisp development style that I have used for decades using symbolic AI and knowledge representation.

I hope you find that Hy meets your needs as it does my own.

About the Book Cover

The official Hy Language logo is an octopus:

The Hy Language logo Cuddles by Karen Rustad

Usually I use photographs that I take myself for covers of my LeanPub books. Although I have SCUBA dived since I was 13 years old, sadly I have no pictures of an octopus that I have taken myself. I did find a public domain picture I liked (that is the cover of this book) on Wikimedia. Cover Credit: Thanks to Wikimedia user Pseudopanax for placing the cover image in the public domain.

I thank my wife Carol for editing this manuscript, finding typos, and suggesting improvements.

I would like to thank Pascal (Reddit user chuchana) for corrections and suggestions. I would like to thank Carlos Ungil for catching a typo and reporting it. I would like to thank Jud Taylor for finding several typo errors. I would like to thank Dave Smythe for finding some typos.

Introduction to the Hy Language

The Hy programming language is a Lisp language that inter-operates smoothly with Python. We start with a few interactive examples that I encourage you to experiment with as you read. Then we will look at Hy data types and commonly used built-in functions that are used in the remainder of this book.

I assume that you know at least a little Python and more importantly the Python ecosystem and general tools like uv and pip.

Please start by installing uv on your laptop or server. Use either of the following commands on macOS or Linux:

pip install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

On Windows use one of these commands:

irm https://astral.sh/uv/install.ps1 | iex

Using Python Libraries

Using Python libraries like TensorFlow, Keras, BeautifulSoup, etc. are the reason I use the Hy language. Importing Python code and libraries and calling out to Python is simple and here we look at sufficient examples so that you will understand example code that we will look at later.

Note: starting in August 2025 the example programs in the GitHub repository XX are in individual directores, each pre-configured for use with uv

For example, in the chapter Responsible Web Scraping we will use the BeautifulSoup library. We will look at some Python code snippets and the corresponding Hy language versions of these snippets. Let’s first look at a Python example that we will then convert to Hy:

1 from bs4 import BeautifulSoup
2 
3 raw_data = '<html><body><a href="http://markwatson.com">Mark</a></body></html>'
4 soup = BeautifulSoup(raw_data)
5 a_tags = soup.find_all("a")
6 print("a tags:", a_tags)

In the following listing notice how we import other code and libraries in Hy. The special form setv is used to define variables in a local context. Since the setv statements in lines 4, 6, and 7 are used at the top level, they are global in the Python/Hy module named after the root name of the source file.

 1 $ cd hy-lisp-python-book/source_code_for_examples/webscraping
 2 $ uv run hy
 3 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 4 => (import bs4 [BeautifulSoup])
 5 => (setv raw-data "<html><body><a href=\"http://markwatson.com\">Mark</a></body></ht\
 6 ml>")
 7 => (setv soup (BeautifulSoup raw-data "lxml"))
 8 => (setv a (.find-all soup "a"))
 9 => (print "atags:" a)
10 atags: [<a href="http://markwatson.com">Mark</a>]
11 => (type a)
12 <class 'bs4.element.ResultSet'>
13 => (dir a)
14 ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '\
15 __dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribut\
16 e__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__in\
17 it_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', \
18 '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__r\
19 mul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '\
20 __weakref__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop'\
21 , 'remove', 'reverse', 'sort', 'source']

Notice in lines 3 and 6 that we can have “-“ characters inside of variable and function names (raw-data and find-all in this case) in the Hy language where we might use “_” underscore characters in Python. Like Python, we can use type get get the type of a value and dir to see what symbols are available for a object.

Global vs. Local Variables

Although I don’t generally recommend it, sometimes it is convenient to export local variables defined with setv to be global variables in the context of the current module (that is defined by the current source file). As an example:

 1 $ uv run hy
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (defn foo []
 4 ... (global x)
 5 ... (setv x 1)
 6 ... (print x))
 7 => (foo)
 8 1
 9 => x
10 1
11 =>

Before executing function foo the global variable x is undefined (unless you coincidentally already defined somewhere else). When function foo is called, a global variable x is defined and then it equal to the value 1.

Using Python Code in Hy Programs

If there is a Python source file, named for example, test.py in the same directory as a Hy language file:

1 def factorial (n):
2   if n < 2:
3     return 1
4   return n * factorial(n - 1)

This code will be in a module named test because that is the root source code file name. We might import the Python code using the following in Python:

1 import test
2 
3 print(test.factorial(5))

and we can use the following in Hy to import the Python module test (defined in test.py):

1 (import test)
2 
3 (print (test.factorial 5))

Running this interactively in Hy:

1 $ uv run hy
2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
3 => (import test)
4 => test
5 <module 'test' from '/Users/markw/GITHUB/hy-lisp-python/test.py'>
6 => (print (test.factorial 5))
7 120

If we only wanted to import BeautifulSoup from the Python BeautifulSoup library bs4 we can specify this in the import form:

1 (import bs4 [BeautifulSoup])

Using Hy Libraries in Python Programs

There is nothing special about importing and using Hy library code or your own Hy scripts in Python programs. The directory hy-lisp-python-book/source_code_for_examples/use_hy_in_python in the git repository for this book https://github.com/mark-watson/hy-lisp-python-book contains an example Hy script get_web_page.hy that is a slightly modified version of code we will explain and use in the later chapter on web scraping and a short Python script use_hy_stuff.py that uses a function defined in Hy:

get_web_page.hy:

 1 (import argparse os)
 2 (import urllib.request [Request urlopen])
 3 
 4 (defn get-raw-data-from-web [aUri [anAgent
 5                                    {"User-Agent" "HyLangBook/1.0"}]]
 6   (setv req (Request aUri :headers anAgent))
 7   (setv httpResponse (urlopen req))
 8   (setv data (.read httpResponse))
 9   data)
10 
11 (defn main_hy []
12   (print (get-raw-data-from-web "http://markwatson.com")))

We define two functions here. Notice the optional argument anAgent defined in lines 4-5 where we provide a default value in case the calling code does not provide a value. In the next Python listing we import the file in the last listing and call the Hy function main on line 4 using the Python calling syntax.

Hy is the same as Python once it is compiled to an abstract syntax tree (AST).

hy-lisp-python/use_in_python:

1 import hy
2 from get_web_page import main_hy
3 
4 main_hy()

What I want you to understand and develop a feeling for is that Hy and Python are really the same but with a different syntax and that both languages can easily be used side by side.

Replacing the Python slice (cut) Notation with the Hy Functional Form

In Python we use a special notation for extracting sub-sequences from lists or strings:

$ uv run python
Python 3.12.0 (main, Oct  2 2023, 20:56:14) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '0123456789'
>>> s[2:4]
'23'
>>> s[-4:]
'6789'
>>> s[-4:-1]
'678'
>>>

In Hy this would be:

$ uv run hy    
Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
=> (setv s "0123456789")
=> (cut s 2 4)
'23'
=> (cut s -4)
'6789'
=> (cut s -4 -1)
'678'
=>

It also works to use cut with setv to destructively change a list; for example:

=> (setv x [0 1 2 3 4 5 6 7 8])
=> x
[0, 1, 2, 3, 4, 5, 6, 7, 8]
=> (cut x 2 4)
[2, 3]
=> (setv (cut x 2 4) [22 33])
=> x
[0, 1, 22, 33, 4, 5, 6, 7, 8]

Iterating Through a List With Index of Each Element

We will use lfor as a form of Python list comprehension; for example:

 1 => (setv sentence "The ball rolled")
 2 => (lfor i (enumerate sentence) i)
 3 [(0, 'T'), (1, 'h'), (2, 'e'), (3, ' '), (4, 'b'), (5, 'a'), (6, 'l'), (7, 'l'), (8,\
 4  ' '), (9, 'r'), (10, 'o'), (11, 'l'), (12, 'l'), (13, 'e'), (14, 'd')]
 5 => (setv vv (lfor i (enumerate sentence) i))
 6 => vv
 7 [(0, 'T'), (1, 'h'), (2, 'e'), (3, ' '), (4, 'b'), (5, 'a'), (6, 'l'), (7, 'l'), (8,\
 8  ' '), (9, 'r'), (10, 'o'), (11, 'l'), (12, 'l'), (13, 'e'), (14, 'd')]
 9 => (for [[a b] vv]
10 ... (print a b))
11 0 T
12 1 h
13 2 e
14 3  
15 4 b
16 5 a
17 6 l
18 7 l
19 8  
20 9 r
21 10 o
22 11 l
23 12 l
24 13 e
25 14 d
26 =>

On line 2, the expression (enumerate sentence) generates one character at a time from a string. enumerate operating on a list will generate one list element at a time.

Line 9 shows an example of destructuring: the values in the list vv are tuples (tuples are like lists but are immutable, that is, once a tuple is constructed the values it holds can not be changed) with two values. The values in each tuple are copied into binding variables in the list [a b]. We could have used the following code instead but it is more verbose:

=> (for [x vv]
    (setv a (first x))
    (setv b (second x))
... (print a b))
0 T
1 h
2 e
3  
4 b
 . . .
13 e
14 d
=>

Formatted Output

I suggest using the Python format method when you need to format output. In the following repl listing, you can see a few formatting options: insert any Hy data into a string (line 3), print values with a specific width and right justified (in line 5 the width for both values is 15 characters), print values with a specific width and left justified (in line 7), and limiting the number of characters values can be expressed as (in line 9 the object “cat” is expressed as just the first two characters and the value 3.14159 is expressed as just three numbers, the period not counting).

$ uv run hy    
Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
=> (.format "first: {} second: {}" "cat" 3.14159)
'first: cat second: 3.14159'
=> (.format "first: {:>15} second: {:>15}" "cat" 3.14159)
'first:             cat second:         3.14159'
=> (.format "first: {:15} second: {:15}" "cat" 3.14159)
'first: cat             second:         3.14159'
=> (.format "first: {:.2} second: {:.3}" "cat" 3.14159)
'first: ca second: 3.14'
=>

Notice the calling .format here returns a string value rather than writing to an output stream.

Importing Libraries from Different Directories on Your Laptop

I usually write applications by first implementing simpler low-level utility libraries that are often not in the same directory path as the application that I am working on. Let’s look at a simple example of accessing the library nlp_lib.hy in the directory hy-lisp-python/nlp from the directory hy-lisp-python/webscraping:

 1 $ pwd
 2 /Users/markw/GITHUB/hhy-lisp-python-book
 3 $ cd webscraping 
 4 $ uv run hy    
 5 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 6 => (import sys)
 7 => (sys.path.insert 1 "../nlp")
 8 => (import nlp-lib [nlp])
 9 => (nlp "President George Bush went to Mexico and he had a very good meal")
10 {'text': 'President George Bush went to Mexico and he had a very good meal', 
11   ...
12  'entities': [['George Bush', 'PERSON'], ['Mexico', 'GPE']]}
13 => (import [coref-nlp-lib [coref-nlp]])
14 => (coref-nlp "President George Bush went to Mexico and he had a very good meal")
15 {'corefs': 'President George Bush went to Mexico and President George Bush had a ver\
16 y good meal',  ...  }}}
17 =>

Here I did not install the library nlp_lib.hy using Python setuptools (which I don’t cover in this book, you can read the documentation) as a library on the system. I rely on relative paths between the library directory and the application code that uses the library.

On line 6 I am inserting the library directory into the Python system load path so the import statement on line 8 can find the nlp-lib library and on line 13 can find the coref-nlp-lib library.

Hy Looks Like Clojure: How Similar Are They?

Clojure is a dynamic general purpose Lisp language for the JVM. One of the great Clojure features is support of immutable data (read only after creation) that makes multi-threaded code easier to write and maintain.

Unfortunately, Clojure’s immutable data structures cannot be easily implemented efficiently in Python so the Hy language does not support immutable data, except for tuples. Otherwise the syntax for defining functions, using maps/hash tables/dictionaries, etc. is similar between the two languages.

The original Hy language developer Paul Tagliamonte was clearly inspired by Clojure.

The book Serious Python by Julien Danjou has an entire chapter (Chapter 9) on the Python AST (abstract syntax tree), an introduction to Hy, and an interview with Paul Tagliamonte. Recommended!

This podcast in 2015 interviews Hy developers Paul Tagliamonte, Tuukka Turto, and Morten Linderud. You can see the current Hy contributor list on github.

Plotting Data Using the Numpy and the Matplotlib Libraries

Data visualization is a common task when working with numeric data. In a later chapter on Deep Learning we will use two functions, the relu and sigmoid functions. Here we will use a few simple Hy language scripts to plot these functions.

The Numpy library supports what is called “broadcasting” in Python. In the function sigmoid that we define in the following REPL, we can pass either a single floating point number or a Numpy array as an argument. When we pass a Numpy array, then the function sigmoid is applied to each element of the Numpy array:

 1 $ uv run hy    
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import numpy :as np)
 4 => (import matplotlib.pyplot :as plt)
 5 => 
 6 => (defn sigmoid [x]
 7 ...   (/ 1.0 (+ 1.0 (np.exp (- x)))))
 8 => (sigmoid 0.2)
 9 0.549833997312478
10 => (sigmoid 2)
11 0.8807970779778823
12 => (np.array [-5 -2 0 2 5])
13 array([-5, -2,  0,  2,  5])
14 => (sigmoid (np.array [-5 -2 0 2 5]))
15 array([0.00669285, 0.11920292, 0.5, 0.88079708, 0.99330715])
16 =>

The git repository directory hy-lisp-python/matplotlib contains two similar scripts for plotting the sigmoid and relu functions. Here is the script to plot the sigmoid function:

 1 (import numpy :as np)
 2 (import matplotlib.pyplot :as plt)
 3 
 4 (defn sigmoid [x]
 5   (/ 1.0 (+ 1.0 (np.exp (- x)))))
 6 
 7 (setv X (np.linspace -8 8 50))
 8 (plt.plot X (sigmoid X))
 9 (plt.title "Sigmoid Function")
10 (plt.ylabel "Sigmoid")
11 (plt.xlabel "X")
12 (plt.grid)
13 (plt.show)

The generated plot looks like this on macOS (Matplotlib is portable and also works on Windows and Linux):

Bonus Points: Configuration for macOS and ITerm2 for Generating Plots Inline in a Hy REPL and Shell

On the macOS ITerm2 terminal app and on most Linux terminal apps, it is possible to get inline matplotlib plots in a shell (bash, zsh, etc.), in Emacs, etc. This will take some setup work but it is well worth it especially if you work on remote servers via SSH or tmux. Here is the setup for macOS:

1   pip3 install itermplot

The add the following to your .profile, .bash_profile, or .zshrc (depending on your shell setup):

1   export MPLBACKEND="module://itermplot"

Here we run an example from the last section in a zsh shell (bash, etc. also should work):

Inline matplotlib use in zsh shell in an ITerm on macOS

The best part of generating inline plots is during interactive REPL-based coding sessions:

Inline matplotlib use in a Hy REPL on macOS

If you use a Mac laptop to SSH into a remote Linux server you need to install itermplot and set the environment variable MPLBACKEND on the remote server.

Why Lisp?

Now that we have learned the basics of the Hy Lisp language in the last chapter, I would like to move our conversation to a broader question of why we would want to use Lisp. I want to start with my personal history of why I turned to Lisp languages in the late 1970s for almost all of my creative and research oriented development and later transitioned to also using Lisp languages in production.

I Hated the Waterfall Method in the 1970s but Learned to Love a Bottom-Up Programming Style

I graduated UCSB in the mid 1970s with a degree in Physics and took a job as a scientific programmer in the 100% employee owned company SAIC. My manager had a PhD in Computer Science and our team and the organization we were in used what is known as the waterfall method where systems were designed carefully from the top down, carefully planned mostly in their entirety, and then coded up. We, and the whole industry I would guess, wasted a lot of time with early planning and design work that had to be discarded or heavily modified after some experience implementing the system.

What would be better? I grew to love bottom-up programming. When I was given a new project I would start by writing and testing small procedures for low level operations, things I was sure I would need. I then aggregated the functionality into higher levels of control logic, access to data, etc. Finally I would write the high level application.

I mostly did this for a while writing code in FORTRAN at SAIC and using Algol for weekend consulting work Salk Institute, working on hooking up lab equipment to minicomputers in Roger Guillemin’s lab (he won a Nobel Prize during that time, which was exciting). Learning Algol, a very different language than FORTRAN, helped broaden my perspectives.

I wanted a better programming language! I also wanted a more productive way to do my job both as a programmer and to make the best use of the few free hours a week that I had for my own research and learning about artificial intelligence (AI). I found my “better way” of development by adopting a bottom-up style that involves first writing low level libraries and utilities and then layering complete programs on top of well-tested low level code.

First Introduction to Lisp

In the late 1970s I discovered a Lisp implementation on my company’s DECsystem-10 timesharing computer. I had heard of Lisp when reading Bertram Raphael’s book “THE THINKING COMPUTER. Mind Inside Matter” and I learned Lisp on my own time and then, during lunch hour, taught a one day a week class to anyone at work who wanted to learn Lisp. After a few months of Lisp experience I received permission to teach an informal lunch time class to teach anyone working in my building who wanted to to learn Lisp on our DECsystem-10.

Lisp is the perfect language to support the type of bottom-up iterative programming style that I like.

Commercial Product Development and Deployment Using Lisp

My company, SAIC, identified AI as an important technology in the early 1980s. Two friends at work (Bob Beyster who founded SAIC and Joe Walkush who was our corporate treasurer and who liked Lisp from his engineering studies at MIT) arranged for the company to buy a hardware Lisp Machine, a Xerox 1108 for me. I ported Charles Forgy’s expert system development language OPS5 to run on InterLisp-D on the Xerox Lisp Machines and we successfully sold this as a product. When Coral Common Lisp was released for the Apple Macintosh in 1984, I switched my research and development to the Mac and released ExperOPS5, which also sold well, and used Common Lisp to write the first prototypes for SAIC’s ANSim neural network library. I converted my code to C++ to productize it. We also continued to use Lisp for IR&D projects and while working on the DARPA NMRD project.

Even though I proceeded to use C++ for much of my development, as well as writing C++ books for McGraw-Hill and J. Riley publishers, Lisp remained my “thinking and research” language.

Performing Bottom Up Development Inside a REPL is a Lifestyle Choice

It is my personal choice to prefer a bottom up style of coding, effectively extending the Hy (or other Lisp) language to look like something that looks custom designed and built to solve a specific problem. This is possible in Lisp languages because once a function is defined, it is for our purposes part of the Hy language. If, for example, you are writing a web application that uses a database then I believe that it makes sense to first write low level functions to perform operations that you know you will need, for example, for creating and updating customer data from the database, utility functions used in a web application (which we cover in the next chapter), etc. For the rest of your application, you use these new low level functions as if they were built into the language.

When I need to write a new low-level function, I start in a REPL and define variables (with test values) for what the function arguments will be. I then write the code for the function one line at a time using these “arguments” in expressions that will later be copied to a Hy source file. Immediately seeing results in a REPL helps me catch mistakes early, often a misunderstanding of the type or values of intermediate calculations. This style of coding works for me and I hope you like it also.

Writing Web Applications

Python has good libraries and frameworks for building web applications and here we will use the Flask library and framework “under the hood” and write two simple Hy Language web applications. We will start with a simple “Hello World” example in Python, see how to reformulate it in Hy, and then proceed with more complex examples that will show how to use HTML generating templates, sessions, and cookies to store user data for the next time they visit your web site. In a later chapter we will cover use of the SQLite and PostgreSQL databases which are commonly used to persist data for users in web applications. This pattern involves letting a user login and store a unique token for the user in a web browser cookie. In principle, you can do the same with web browser cookies but if a user visits your web site with a different browser or device then they will not have access to the data stored in cookies on a previous visit.

I like lightweight web frameworks. In Ruby I use Sinatra, in Haskell I use Spock, and when I built Java web apps I liked lightweight tools like JSP. Flask is simple but capable and using it from Hy is productive and fun. In addition to using lightweight frameworks I like to deploy web apps in the simplest way possible. We will close this chapter by discussing how to use the Heroku and Google Cloud Platform AppEngine platforms.

Getting Started With Flask: Using Python Decorators in Hy

You will need the Python Flask library but it is pre-configured in the uv environment in the directory hy-lisp-python-book/source_code_for_examples/webapp.

We will use the Hy macro with-decorator to replace Python code with annotations. Here the decorator @app.route is used to map a URI pattern with a Python callback function. In the following case we define the behavior when the index page of a web app is accessed:

1 from flask import Flask
2 
3 @app.route('/')
4   def index():
5      return "Hello World !")
6 
7 app.run()

I first used Flask with the Hy language after seeing a post of code from HN user “volent”, seen in the file flask_test.hy in the directory hy-lisp-python-book/source_code_for_examples/webapp that is functionally equivalent to the above Python code snippet:

1 ;; snippet by HN user volent and modifed for
2 ;; Hy 0.26.0 with a comment from stackoverflow user plokstele:
3 
4 (import flask [Flask])
5 (setv app (Flask "Flask test"))
6 (defn [(.route app "/")] index [] "Hello World !")
7 (app.run)

I liked this example and after experimenting with the code, I then started using Hy and Flask. Please try running this example to make sure you are setup properly with Flask:

$ uv run hy flask_test.hy
 * Serving Flask app 'Flask test'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use\
 a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit

Open http://127.0.0.1:5000/ in your web browser:

Using Jinja2 Templates To Generate HTML

Jinja2 is a templating system that allows HTML markup to be supplemented with Python variable references and simple Python loops, etc. The values of application variables can be stored in a context and the HTML template has the variables values substituted with current values before returning a HTML response to the user’s web browser.

By default Jinja2 templates are stored in a subdirectory named templates. The template for this example can be found in the file hy-lisp-python/webapp/templates/template1.j2 that is shown here:

 1 <html>
 2   <head>
 3     <title>Testing Jinja2 and Flask with the Hy language</title>
 4   </head>
 5   <body>
 6      {% if name %}
 7        <h1>Hello {{name}}</h1>
 8      {% else %}
 9        <h1>Hey, please enter your name!</h1>
10      {% endif %}
11     
12     <form method="POST" action="/response">
13       Name: <input type="text" name="name" required>
14       <input type="submit" value="Submit">
15     </form>
16   </body>
17 </html>

Note that in line 6 we are using a Python if expression to check if the variable name is defined in the current app execution context.

In the context of a running Flask app, the following will render the above template with the variable name defined as None:

1 (render_template "template1.j2")

We can set values as named parameters for variables used in the template, for example:

1 (render_template "template1.j2" :name "Mark")

I am assuming that you understand the basics or HTML and also GET and POST operations in HTTP requests.

The following Flask web app defines behavior for rendering the template without the variable name set and also a HTML POST handler to pass the name entered on the HTML form back to the POST response handler:

 1 (import flask [Flask render_template request])
 2 
 3 (setv app (Flask "Flask and Jinja2 test"))
 4 
 5 (defn [(.route app "/")]
 6   index []
 7     (render_template "template1.j2"))
 8 
 9 (defn [(.route app "/response" :methods ["POST"])]
10   response []
11     (setv name (request.form.get "name"))
12     (print name)
13     (render_template "template1.j2" :name name))
14 
15 (app.run)

Please note that there is nothing special about the names inside the with-decorator code blocks: the functions index and response could have arbitrary names like a123 an b17. I used the function names index and response because they help describe what the functions do.

Open http://127.0.0.1:5000/ in your web browser:

Flask web app using a Jinja2 Template after entering my name and submitting the HTML input form

Handling HTTP Sessions and Cookies

There is a special variable session that Flask maintains for each client of a Flask web app. Different people using a web app will have independent sessions. In a web app, we can set a session value by treating the session for a given user as a dictionary:

=> (setv (get session "name") "Mark")
=> session
{'name': 'Mark'}

Inside a Jinja2 template you can use a simple Python expression to place a session variable’s value into the HTML generated from a template:

{{ session['name'] }}

In a web app you can access the session using:

(get session "name")

In order to set the value of a named cookie, we can:

1 (import flask [Flask render_template request make-response])
2 
3 (defn [(.route app "/response" :methods ["POST"])]
4   response []
5     (setv name (request.form.get "name"))
6     (print name)
7     (setv a-response (make-response (render-template "template1.j2" :name name)))
8     (a-response.set-cookie "hy-cookie" name)
9     a-response)

Values of named cookies can be retrieved using:

(request.cookies.get "name")

The value for request is defined in the execution context by Flask when handling HTTP requests. Here is a complete example of handling cookies in the file cookie_test.hy:

 1 (import flask [Flask render_template request make-response])
 2 
 3 (setv app (Flask "Flask and Jinja2 test"))
 4 
 5 (defn [(.route app "/")]
 6   index []
 7     (setv cookie-data (request.cookies.get "hy-cookie"))
 8     (print "cookie-data:" cookie-data)
 9     (setv a-response (render_template "template1.j2" :name cookie-data))
10     a-response)
11 
12 (defn [(.route app "/response" :methods ["POST"])]
13   response []
14     (setv name (request.form.get "name"))
15     (print name)
16     (setv a-response (make-response (render-template "template1.j2" :name name)))
17     (a-response.set-cookie "hy-cookie" name)
18     a-response)
19 
20 (app.run)

I suggest that you not only try running this example as-is but also try changing the template, and generally experiment with the code. Making even simple code changes helps to understand the code better.

Deploying Hy Language Flask Apps to Google Cloud Platform AppEngine

The example for this section is in a separate github repository that you should clone or copy to a new project for a starter project if you intend to deploy to AppEngine.

This AppEngine example is very similar to that in the last section except that it also serves a static asset and has a small Python stub main program to load the Hy language library and import the Hy language code.

Here is the Python stub main program:

1 import hy
2 import flask_test
3 from flask_test import app
4 
5 if __name__ == '__main__':
6     # Used when running locally only. When deploying to Google App
7     # Engine, a webserver process such as Gunicorn will serve the app.
8     app.run(host='localhost', port=9090, debug=True)

The Hy app is slightly different than we saw in the last section. On line 6 we specify the location of static assets and we do not call the run() method on the app object.

 1 (import flask [Flask render_template request])
 2 (import os)
 3 
 4 (setv port (int (os.environ.get "PORT" 5000)))
 5 
 6 (setv app (Flask "Flask test" :static_folder "./static" :static_url_path "/static"))
 7 
 8 (defn [(.route app "/")]
 9   index []
10     (render_template "template1.j2"))
11 
12 (defn [(.route app "/response" :methods ["POST"])]
13   response []
14     (setv name (request.form.get "name"))
15     (print name)
16     (render_template "template1.j2" :name name))
17     
18 (app.run)

I assume that you have some experience with GCP and have the following:

GCP command line tools installed.
You have created a new project on the GCP AppEngine console named something like hy-gcp-test (if you choose a name already in use, you will get a warning).

After cloning or otherwise copying this project, you use the command line tools to deploy and test your Flask app:

gcloud auth login
gcloud config set project hy-gcp-test
gcloud app deploy
gcloud app browse

If you have problems, look at your logs:

gcloud app logs tail -s default

You can edit changes locally and test locally using:

python main.py

Any changes can be tested by deploying again:

gcloud app deploy

Please note that every time you deploy, a new instance is created. You will want to use the GCP AppEngine console to remove old instances, and remove all instances when you are done.

Going Forward

You can make a copy of this example, create a GitHub repo, and follow the above directions as a first step to creating Hy language application on AppEngine. The Google Cloud Platform has many services that you can use in your app (using the Python APIs, called from your Hy program), including:

Storage and Databases.
Big Data.
Machine Learning.

Wrap Up

I like to be able to implement simple things simply, without a lot of ceremony. Once you work through these examples I hope you feel that you can generate Hy and Flask based web apps quickly and with very little code required.

To return to the theme of bottom-up programming, I find that starting with short low level utility functions and placing them in a separate file makes reuse simple and makes future similar projects even easier. For each language I work with, I collect snippets of useful code and short utilities kept in separate files. When writing code I start looking in my snippets directory for the language I am using to implement low level functionality even before doing a web search. When I work in Common Lisp I keep all low level code that I have written in small libraries contained in a single Quicklisp source root directory and for Python and Hy I use Python’s setuptools library to generate libraries that are installed globally on my laptop for easy reuse. It is worth some effort to organize your work for future reuse.

Responsible Web Scraping

I put the word “Responsible” in the chapter title to remind you that just because it is easy (as we will soon see) to pull data from web sites, it is important to respect the property rights of web site owners and abide by their terms and conditions for use. This Wikipedia article on Fair Use provides a good overview of using copyright material.

The web scraping code we develop here uses the Python BeautifulSoup and URI libraries.

For my work and research, I have been most interested in using web scraping to collect text data for natural language processing but other common applications include writing AI news collection and summarization assistants, trying to predict stock prices based on comments in social media which is what we did at Webmind Corporation in 2000 and 2001, etc.

Using the Python BeautifulSoup Library in the Hy Language

There are many good libraries for parsing HTML text and extracting both structure (headings, what is in bold font, etc.) and embedded raw text. I particularly like the Python Beautiful Soup library and we will use it here.

In line 4 for the following listing of file get_web_page.hy, I am setting the default user agent to a descriptive string “HyLangBook” but for some web sites you might need to set this to appear as a Firefox or Chrome browser (iOS, Android, Windows, Linux, or macOS). The function get-raw-data gets the entire contents of a web site as a single string value.

1 (import urllib.request [Request urlopen])
2 
3 (defn get-raw-data-from-web [aUri [anAgent {"User-Agent" "HyLangBook/1.0"}]]
4   (setv req (Request aUri :headers anAgent))
5   (setv httpResponse (urlopen req))
6   (setv data (.read httpResponse))
7   data)

Let’s test this function in a REPL:

 1 $ uv run hy                 
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import get-page-data [get-raw-data-from-web])
 4 => (get-raw-data-from-web "http://knowledgebooks.com")
 5 b'<!DOCTYPE html><html><head><title>KnowledgeBooks.com - research on the Knowledge M\
 6 anagement, and the Semantic Web ...'
 7 => 
 8 => (import get-page-data [get-page-html-elements])
 9 => (get-page-html-elements "http://knowledgebooks.com")
10 {'title': [<title>KnowledgeBooks.com - research on the Knowledge Management, and the\
11  Semantic Web </title>],
12 'a': [<a class="brand" href="#">KnowledgeBooks.com  </a>,  ...
13 =>

This REPL session shows the the function get-raw-data-from-web defined in the previous listing returns a web page as a string. In line 9 we use a function get-page-html-elements to find all elements in a string containing HTML. This function is defined in the next listing and shows how to parse and process the string contents of a web pages. Note: you will need to install the lxml library for this example (using pip or pip3 depending on your Python configuration):

1 pip install lxml

The following listing of file get_page_data.hy uses the Beautiful Soup library to parse the string data for HTML text from a web site. The function get-page-html-elements returns names and associated data with each element in HTML represented as a string (the extra code on lines 20-24 is just debug example code):

 1 (import get_web_page [get-raw-data-from-web])
 2 
 3 (import bs4 [BeautifulSoup])
 4 
 5 (defn get-element-data [anElement]
 6   {"text" (.getText anElement)
 7    "name" (. anElement name)
 8    "class" (.get anElement "class")
 9    "href" (.get anElement "href")})
10 
11 (defn get-page-html-elements [aUri]
12   (setv raw-data (get-raw-data-from-web aUri))
13   (setv soup (BeautifulSoup raw-data "lxml"))
14   (setv title (.find_all soup "title"))
15   (setv a (.find_all soup "a"))
16   (setv h1 (.find_all soup "h1"))
17   (setv h2 (.find_all soup "h2"))
18   {"title" title "a" a "h1" h1 "h2" h2})
19 
20 (setv elements (get-page-html-elements "http://markwatson.com"))
21 
22 (print (get elements "a"))
23 
24 (for [ta (get elements "a")] (print (get-element-data ta)))

The function get-element-data defined in lines 5-9 accepts as an argument an HTML element object (as defined in the Beautiful soup library) and extracts data, if available, for text, name, class, and href values. The function get-page-html-elements defied in lines 11-18 accepts as an argument a string containing a URI and returns a dictionary (or map, or hash table) containing lists of all a, h1, h2, and title elements in the web page pointed to by the input URI. You can modify get-page-html-elements to add additional HTML element types, as needed.

Here is the output (with many lines removed for brevity):

 1 {'text': 'Mark Watson artificial intelligence consultant and author',
 2  'name': 'a', 'class': ['navbar-brand'], 'href': '#'}
 3 {'text': 'Home page', 'name': 'a', 'class': None, 'href': '/'}
 4 {'text': 'My Blog', 'name': 'a', 'class': None,
 5  'href': 'https://mark-watson.blogspot.com'}
 6 {'text': 'GitHub', 'name': 'a', 'class': None,
 7  'href': 'https://github.com/mark-watson'}
 8 {'text': 'Twitter', 'name': 'a', 'class': None, 'href': 'https://twitter.com/mark_l_\
 9 watson'}
10 {'text': 'WikiData', 'name': 'a', 'class': None, 'href': 'https://www.wikidata.org/w\
11 iki/Q18670263'}

Getting HTML Links from the DemocracyNow.org News Web Site

I financially support and rely on both NPR.org and DemocracyNow.org news as my main sources of news so I will use their news sites for examples here and in the next section. Web sites differ so much in format that it is often necessary to build highly customized web scrapers for individual web sites and to maintain the web scraping code as the format of the site changes in time.

Before working through this example and/or the example in the next section use the file Makefile to fetch data:

make data

This should copy the home pages for both web sites to the files:

democracynow_home_page.html (used here)
npr_home_page.html (used for the example in the next section)

The following listing shows democracynow_front_page.hy

 1 (import get-web-page [get-web-page-from-disk])
 2 (import bs4 [BeautifulSoup])
 3 
 4 ;; you need to run 'make data' to fetch sample HTML data for dev and testing
 5 
 6 (defn get-democracy-now-links []
 7   (setv test-html (get-web-page-from-disk "democracynow_home_page.html"))
 8   (setv bs (BeautifulSoup test-html :features "lxml"))
 9   (setv all-anchor-elements (.findAll bs "a"))
10   (lfor e all-anchor-elements
11           :if (> (len (.get-text e)) 0)
12           (, (.get e "href") (.get-text e))))
13 
14 (when (= __name__ "__main__")
15   (for [[uri text] (get-democracy-now-links)]
16     (print uri ":" text)))

This simply prints our URIs and text (separated with the string “:”) for each link on the home page. On line 13 we discard any anchor elements that do not contain text. On line 14 the comma character at the start of the return list indicates that we are constructing a tuple. Lines 16-18 define a main function that is used when running this file at the command line. This is similar to how main functions can be defined in Python to allow a library file to also be run as a command line tool.

A few lines of output from today’s front page is:

/2020/1/7/the_great_hack_cambridge_analytica : Meet Brittany Kaiser, Cambridge Analy\
tica Whistleblower Releasing Troves of New Files from Data Firm
/2019/11/8/remembering_orangeburg_massacre_1968_south_carolina : Remembering the 196\
8 Orangeburg Massacre When Police Shot Dead Three Unarmed Black Students
/2020/1/15/democratic_debate_higher_education_universal_programs : Democrats Debate \
Wealth Tax, Free Public College & Student Debt Relief as Part of New Economic Plan
/2020/1/14/dahlia_lithwick_impeachment : GOP Debate on Impeachment Witnesses Intensi\
fies as Pelosi Prepares to Send Articles to Senate
/2020/1/14/oakland_california_moms_4_housing : Moms 4 Housing: Meet the Oakland Moth\
ers Facing Eviction After Two Months Occupying Vacant House
/2020/1/14/luis_garden_acosta_martin_espada : “Morir Soñando”: Martín Espada Reads P\
oem About Luis Garden Acosta, Young Lord & Community Activist

The URIs are relative to the root URI https://www.democracynow.org/.

Getting Summaries of Front Page from the NPR.org News Web Site

This example is similar to the example in the last section except that text from home page links is formatted to provide a daily news summary. I am assuming that you ran the example in the last section so the web site home pages have been copied to local files.

The following listing shows npr_front_page_summary.hy

 1 (import get-web-page [get-web-page-from-disk])
 2 (import bs4 [BeautifulSoup])
 3 
 4 ;; you need to run 'make data' to fetch sample HTML data for dev and testing
 5 
 6 (defn get-npr-links []
 7   (setv test-html (get-web-page-from-disk "npr_home_page.html"))
 8   (setv bs (BeautifulSoup test-html :features "lxml"))
 9   (setv all-anchor-elements (.findAll bs "a"))
10   (setv filtered-a
11     (lfor e all-anchor-elements
12           :if (> (len (.get-text e)) 0)
13           #((.get e "href") (.get-text e))))
14   filtered-a)
15 
16 (defn create-npr-summary []
17   (setv links (get-npr-links))
18   (setv filtered-links (lfor [uri text] links :if (> (len (.strip text)) 40) (.strip\
19  text)))
20   (.join "\n\n" filtered-links))
21 
22 (when (= __name__ "__main__")
23   (print (create-npr-summary)))

In lines 12-15 we are filtering out (or removing) all anchor HTML elements that do not contain text. The following shows a few lines of the generated output for data collected today:

January 16, 2020  Birds change the shape of their wings far more than
planes. The complexities of bird flight have posed a major design challenge
for scientists trying to translate the way birds fly into robots.

FBI Vows To Warn More Election Officials If Discovering A Cyberattack

January 16, 2020  The bureau was faulted after the Russian attack on the
2016 election for keeping too much information from state and local
authorities. It says it'll use a new policy going forward.

Ukraine Is Investigating Whether U.S. Ambassador Yovanovitch Was Surveilled

January 16, 2020  Ukraine's Internal Affairs Ministry says it's asking the
FBI to help determine whether international laws were broken, or "whether it
is just a bravado and a fake information" from a U.S. politician.

Electric Burn: Those Who Bet Against Elon Musk And Tesla Are Paying A Big Price

January 16, 2020  For years, Elon Musk skeptics have shorted Tesla stock, confident \
the electric carmaker was on the brink of disaster. Instead, share value has skyrock\
eted, costing short sellers billions.

TSA Says It Seized A Record Number Of Firearms At U.S. Airports Last Year

The examples seen here are simple but should be sufficient to get you started gathering text data from the web.

Using the Brave Search APIs

You will need to get a free API key at https://brave.com/search/api/ to use the following code examples. You can use the search API 2000 times a month for free or pay $5/month to get 20 million API calls a month.

Setting an Environment Variable for the Access Key for Brave Search APIs

Once you get a key for https://brave.com/search/api/ set the following environment variable:

export BRAVE_SEARCH_API_KEY=BSGhQ-Nd-......

That is not my real subscription key!

Example Search Script

The following shows the file brave.hy:

It takes very little Hy code to access the Brave search APIs. Here we define a function named brave_search that takes one parameter query. We get the API subscription ket from an environment variable, define the URI for the Brave search endpoint, and set up an HTTP request to this endpoint. I encourage you, dear reader, to experiment with printing out the HTTP response to see all data returned from the Brave search API. Here we only collect the tile, URL, and description for each search result:

(import os requests)
(import pprint [pprint])

(defn brave_search [query]
  (setv subscription-key (get os.environ "BRAVE_SEARCH_API_KEY"))
  (setv endpoint "https://api.search.brave.com/res/v1/web/search")

  ;; Construct a request
  (setv params {"q" query})
  (setv headers {"X-Subscription-Token" subscription-key})

  ;; Call the API
  (setv response (requests.get endpoint :headers headers :params params))

  ;; Pull out results
  (setv results (get (get (response.json) "web") "results"))

  ;; Create a list of lists containing title, URL, and description
  (setv res (lfor result results
                 [(get result "title")
                  (get result "url")
                  (get result "description")]))

  ;; Return the results
  res)

;; Example usage:
;;(setv search-results (brave-search "site:wikidata.org Sedona Arizona"))
;;(pprint search-results)

You can use search hints like “site:wikidata.org” to only search specific web sites. In the following example I use the search query:

1 "site:wikidata.org Sedona Arizona"

The example call:

(setv search-results (brave-search "site:wikidata.org Sedona Arizona"))
(pprint search-results)

produces the output (edited for brevity):

```
$ uv run hy -i brave.hy 
Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
=> (brave-search "site:wikidata.org Sedona Arizona")
[['Sedona - Wikidata',
  'https://m.wikidata.org/wiki/Q80041',
  'city in counties of Yavapai and Coconino, <strong>Arizona</strong>, United '
  'States'],
 ['Category:People from Sedona, Arizona - Wikidata',
  'https://www.wikidata.org/wiki/Q8748837',
  'All structured data from the main, Property, Lexeme, and EntitySchema '
  'namespaces is available under the Creative Commons CC0 License; text in the '
  'other namespaces is available under the Creative Commons '
  'Attribution-ShareAlike License; additional terms may apply.'],
 ['Category:Films set in Sedona, Arizona - Wikidata',
  'https://www.wikidata.org/wiki/Q25109087',
  'All structured data from the main, Property, Lexeme, and EntitySchema '
  'namespaces is available under the Creative Commons CC0 License; text in the '
  'other namespaces is available under the Creative Commons '
  'Attribution-ShareAlike License; additional terms may apply.'],

 ...

Running Hy using uv with the -i parameter loads the file and then puts the user into a Hy REPL.

Wrap-up

In addition to using automated web scraping to get data for my personal research, I often use automated web search. I find the Brave search APIs are the most convenient to use and I like paying for services that I use. The search engine Duck Duck Go also provides free search APIs but even though I use Duck Duck Go for 90% of my manual web searches, when I build automated systems I prefer to rely on services that I pay for.

Deep Learning

Most of my professional career since 2014 has involved Deep Learning, mostly with TensorFlow using the Keras APIs. In the late 1980s I was on a DARPA neural network technology advisory panel for a year. I wrote the first prototype of the SAIC ANSim neural network library commercial product, and I wrote the neural network prediction code for a bomb detector my company designed and built for the FAA for deployment in airports. More recently I have used GAN (generative adversarial networks) models for synthesizing numeric spreadsheet data and LSTM (long short term memory) models to synthesize highly structured text data like nested JSON and for NLP (natural language processing). I have 55 USA and several European patents using neural network and Deep Learning technology.

The Hy language utilities and example programs we develop here all use TensorFlow and Keras “under the hood” to do the heavy lifting. Keras is a simpler to use API for TensorFlow and I usually use Keras rather than the lower level TensorFlow APIs.

There are other libraries and frameworks that might interest you in addition to TensorFlow and Keras. I particularly like the Flux library for the Julia programming language. Currently Python has the most comprehensive libraries for Deep Learning but other languages that support differential computing (more on this later) like Julia and Swift may gain popularity in the future.

Here we will learn a vocabulary for discussing Deep Learning neural network models, look at possible architectures, and show two Hy language examples that should be sufficient to get you familiar to using Keras with the Hy language. If you already have Deep Learning application development experience you might want to skip the following review material and go directly to the Hy language examples.

If you want to use Deep Learning professionally, there are two specific online resources that I recommend: Andrew Ng leads the efforts at deeplearning.ai and Jeremy Howard leads the efforts at fast.ai. Here I will show you how to use a few useful techniques. Andrew and Jeremy will teach you skills that may lead a professional level of expertise if you take their courses.

There are many Deep Learning neural architectures in current practical use; a few types that I use are:

Multi-layer perceptron networks with many fully connected layers. An input layer contains placeholders for input data. Each element in the input layer is connected by a two-dimensional weight matrix to each element in the first hidden layer. We can use any number of fully connected hidden layers, with the last hidden layer connected to an output layer.
Convolutional networks for image processing and text classification. Convolutions, or filters, are small windows that can process input images (filters are two-dimensional) or sequences like text (filters are one-dimensional). Each filter uses a single set of learned weights independent of where the filter is applied in an input image or input sequence.
Autoencoders have the same number of input layer and output layer elements with one or more hidden fully connected layers. Autoencoders are trained to produce the same output as training input values using a relatively small number of hidden layer elements. Autoencoders are capable of removing noise in input data.
LSTM (long short term memory) process elements in a sequence in order and are capable of remembering patterns that they have seen earlier in the sequence.
GAN (generative adversarial networks) models comprise two different and competing neural models, the generator and the discriminator. GANs are often trained on input images (although in my work I have applied GANs to two-dimensional numeric spreadsheet data). The generator model takes as input a “latent input vector” (this is just a vector of specific size with random values) and generates a random output image. The weights of the generator model are trained to produce random images that are similar to how training images look. The discriminator model is trained to recognize if an arbitrary output image is original training data or an image created by the generator model. The generator and discriminator models are trained together.

The core functionality of libraries like TensorFlow are written in C++ and take advantage of special hardware like GPUs, custom ASICs, and devices like Google’s TPUs. Most people who work with Deep Learning models don’t need to even be aware of the low level optimizations used to make training and using Deep Learning models more efficient. That said, in the following section I am going to show you how simple neural networks are trained and used.

Simple Multi-layer Perceptron Neural Networks

I use the terms Multi-layer perceptron neural networks, backpropagation neural networks and delta-rule networks interchangeably. Backpropagation refers to the model training process of calculating the output errors when training inputs are passed in the forward direction from input layer, to hidden layers, and then to the output layer. There will be an error which is the difference between the calculated outputs and the training outputs. This error can be used to adjust the weights from the last hidden layer to the output layer to reduce the error. The error is then backprogated backwards through the hidden layers, updating all weights in the model. I have detailed example code in any of my older artificial intelligence books. Here I am satisfied to give you an intuition to how simple neural networks are trained.

The basic idea is that we start with a network initialized with random weights and for each training case we propagate the inputs through the network towards the output neurons, calculate the output errors, and back-up the errors from the output neurons back towards the input neurons in order to make small changes to the weights to lower the error for the current training example. We repeat this process by cycling through the training examples many times.

The following figure shows a simple backpropagation network with one hidden layer. Neurons in adjacent layers are connected by floating point connection strength weights. These weights start out as small random values that change as the network is trained. Weights are represented in the following figure by arrows; in the code the weights connecting the input to the output neurons are represented as a two-dimensional array.

Example Backpropagation network with One Hidden Layer

Each non-input neuron has an activation value that is calculated from the activation values of connected neurons feeding into it, gated (adjusted) by the connection weights. For example, in the above figure, the value of Output 1 neuron is calculated by summing the activation of Input 1 times weight W1,1 and Input 2 activation times weight W2,1 and applying a “squashing function” like Sigmoid or Relu (see figures below) to this sum to get the final value for Output 1’s activation value. We want to flatten activation values to a relatively small range but still maintain relative values. To do this flattening we use the Sigmoid function that is seen in the next figure, along with the derivative of the Sigmoid function which we will use in the code for training a network by adjusting the weights.

Sigmoid Function and Derivative of Sigmoid Function (SigmoidP)

Simple neural network architectures with just one or two hidden layers are easy to train using backpropagation and I have examples of from-scratch code for this several of my previous books. You can see Java and Common Lisp from-scratch implementations in two of my books that you can read online: Practical Artificial Intelligence Programming With Java and Loving Common Lisp, or the Savvy Programmer’s Secret Weapon. However, here we are using Hy to write models using the TensorFlow framework which has the huge advantage that small models you experiment with on your laptop can be scaled to more parameters (usually this means more neurons in hidden layers which increases the number of weights in a model) and run in the cloud using multiple GPUs.

Except for pedantic purposes, I now never write neural network code from scratch, instead I take advantage of the many person-years of engineering work put into the development of frameworks like TensorFlow, PyTorch, mxnet, etc. We now move on to two examples built with TensorFlow.

Deep Learning

Deep Learning models are generally understood to have many more hidden layers than simple multi-layer perceptron neural networks and often comprise multiple simple models combined together in series or in parallel. Complex architectures can be iteratively developed by manually adjusting the size of model components, changing the components, etc. Alternatively, model architecture search can be automated. At Capital One I used Google’s AdaNet project that efficiently searches for effective model architectures inside a single TensorFlow session. The model architecture used here is simple: one input layer representing the input values in a sample of University of Wisconsin cancer data, one hidden layer, and an output layer consisting of one neuron whose activation value will be interpreted as a prediction of benign or malignant.

The material in this chapter is intended to serve two purposes:

If you are already familiar with Deep Learning and TensorFlow then the examples here will serve to show you how to call the TensorFlow APIs from Hy.
If you have little or no exposure with Deep Learning then the short Hy language examples will provide you with concise code to experiment with and you can then decide to study further.

Once again, I recommend that you consider taking two online Deep Learning course sequences. For no cost, Jeremy Howard provides lessons at fast.ai that are very good and the later classes use PyTorch which is a framework that is similar to TensorFlow. For a modest cost Andrew Ng provides classes at deeplearning.ai that use TensorFlow. I have been working in the field of machine learning since the 1980s, but I still take Andrew’s online classes to stay up-to-date. In the last eight years I have taken his Stanford University machine learning class twice and also his complete course sequence using TensorFlow. I have also worked through much of Jeremy’s material. I recommend both course sequences without reservation.

Using Keras and TensorFlow to Model The Wisconsin Cancer Data Set

The University of Wisconsin cancer database has 646 samples. Each sample has 9 input values and one output value:

Cl.thickness: Clump Thickness
Cell.size: Uniformity of Cell Size
Cell.shape: Uniformity of Cell Shape
Marg.adhesion: Marginal Adhesion
Epith.c.size: Single Epithelial Cell Size
Bare.nuclei: Bare Nuclei
Bl.cromatin: Bland Chromatin
Normal.nucleoli: Normal Nucleoli
Mitoses: Mitoses
Class: Class (0 for benign, 1 for malignant)

Each row represents a sample with different measurements related to cell properties, and the final column ‘Class’ indicates whether the sample is benign (0) or malignant (1).

Let’s perform some basic analysis on this data:

Check for missing values.
Get the summary statistics of the dataset.
Check the balance of the classes (benign and malignant).

Here’s the analysis:

Missing Values: There are no missing values in the dataset. Each column has complete data.
Summary Statistics: The mean and median (50%) values of most features are quite different, indicating that the data distribution for these features might be skewed.
The range (min to max) for all features is from 1 to 10, indicating that the measurements are likely based on a scale or ranking system of 1 to 10.
Class Balance: The dataset is somewhat imbalanced. Approximately 65% of the samples are benign (0) and 35% are malignant (1). This imbalance might influence the performance of machine learning models trained on this data.

Now, it would be beneficial to visualize the data to get a better understanding of the distribution of each feature and the relationship between different features:

Here are histograms of each feature, broken down by class (benign or malignant). Some observations:

Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size, Bare.nuclei, Bl.cromatin, Normal.nucleoli: For these features, higher values seem to be associated with the malignant class. This might suggest that these characteristics are significant in the determination of malignancy. Mitoses: This feature shows a different trend, with a majority of both benign and malignant cases having low mitoses values. However, there are more malignant cases with higher mitoses values than benign cases.

We will use separate training and test files hy-lisp-python/deeplearning/train.csv and hy-lisp-python/deeplearning/test.csv. Here are a few samples from the training file:

6,2,1,1,1,1,7,1,1,0
2,5,3,3,6,7,7,5,1,1
10,4,3,1,3,3,6,5,2,1
6,10,10,2,8,10,7,3,3,1
5,6,5,6,10,1,3,1,1,1
1,1,1,1,2,1,2,1,2,0
3,7,7,4,4,9,4,8,1,1
1,1,1,1,2,1,2,1,1,0

After you look at this data, if you did not have much experience with machine learning then it might not be obvious how to build a model to accept a sample for a patient like we see in the Wisconsin data set and then predict if the sample implies benign or cancerous outcome for the patient. Using TensorFlow with a simple neural network model, we will implement a model in about 40 lines of Hy code to implement this example.

Since there are nine input values we will need nine input neurons that will represent the input values for a sample in either training or separate test data. These nine input neurons (created in lines 9-10 in the following listing) will be completely connected to twelve neurons in a hidden layer. Here, completely connected means that each of the nine input neurons is connected via a weight to each hidden layer neuron. There are 9 * 12 = 108 weights between the input and hidden layers. There is a single output layer neuron that is connected to each hidden layer neuron.

Notice that in lines 12 and 14 in the following listing that we specify a relu activation function while the activation function connecting the hidden layer to the output layer uses the sigmoid activation function that we saw plotted earlier.

There is an example in the git example repo directory hy-lisp-python/matplotlib in the file plot_relu.hy that generated the following figure:

The following listing shows the use of the Keras TensorFlow APIs to build a model (lines 9-19) with one input layer, two hidden layers, and an output layer with just one neuron. After we build the model, we define two utility functions train (lines 21-23) to train a model given training inputs (x argument) and corresponding training outputs (y** argument), and we also define predict (lines 25-26) using a trained model to make a cancer or benign prediction given test input values (x-data argument).

Lines 28-33 show a utility function load-data that loads a University of Wisconsin cancer data set CSV file, scales the input and output values to the range [0.0, 1.0] and returns a list containing input (x-data) and target output data (y-data). You may want to load this example in a REPL and evaluate load-data on one of the CSV files.

The function main (lines 35-45) loads training and test (evaluation of model accuracy on data not used for training), trains a model, and then tests the accuracy of the model on the test (evaluation) data:

 1 #!/usr/bin/env hy
 2 
 3 (import argparse)
 4 (import os)
 5 (import keras.models [Sequential])
 6 (import keras.layers [Dense])
 7 (import keras.optimizers [RMSprop])
 8 
 9 (import pandas [read-csv])
10 (import pandas)
11 
12 (defn build-model []
13   (setv model (Sequential))
14   (.add model (Dense 9
15                  :activation "relu"))
16   (.add model (Dense 12
17                  :activation "relu"))
18   (.add model (Dense 1
19                  :activation "sigmoid"))
20   (.compile model :loss      "binary_crossentropy"
21                   :optimizer (RMSprop))
22   model)
23 
24 (defn first [x] (get x 0))
25 
26 (defn train [batch-size model x y]
27   (for [it (range 50)]
28     (.fit model x y :batch-size batch-size :epochs 10 :verbose False)))
29 
30 (defn predict [model x-data]
31     (.predict model x-data))
32 
33 (defn load-data [file-name]
34   (setv all-data (read-csv file-name :header None))
35   (setv x-data10 (. all-data.iloc [#((slice 0 10) [0 1 2 3 4 5 6 7 8])] values))
36   (setv x-data (* 0.1 x-data10))
37   (setv y-data (. all-data.iloc [#((slice 0 10) [9])] values))
38   [x-data y-data])
39 
40 (defn main []
41   (setv xyd (load-data "train.csv"))
42   (setv model (build-model))
43   (setv xytest (load-data "test.csv"))
44   (train 10 model (. xyd [0]) (. xyd [1]))
45   (print "* predictions (calculated, expected):")
46   (setv predictions (list (map first (predict model (. xytest [0])))))
47   (setv expected (list (map first (. xytest [1]))))
48   (print
49     (list
50       (zip predictions expected))))
51 
52 
53 (main)

The following listing shows the output:

1 $ uv run hy wisconsin.hy
2 * predictions (calculated, expected):
3 [(0.9998953, 1), (0.9999737, 1), (0.9172243, 1), (0.9975936, 1), (0.38985246, 0), (0\
4 .4301587, 0), (0.99999213, 1), (0.855, 0), (0.3810781, 0), (0.9999431, 1)]

Let’s look at the first test case: the “real” output from the training data is a value of 1 and the calculated predicted value (using the trained model) is 0.9759052. In making predictions, we can choose a cutoff value, 0.5 for example, and interpret any calculated prediction value less than the cutoff as a Boolean false prediction and calculated prediction value greater to or equal to the cutoff value is a Boolean true prediction.

Using a LSTM Recurrent Neural Network to Generate English Text Similar to the Philosopher Nietzsche’s Writing

We will translate a Python example program from Google’s Keras documentation (listing of LSTM.py that is included with the example Hy code) to Hy. This is a moderately long example and you can use the original Python and the translated Hy code as a general guide for converting other models implemented in Python using Keras that you want use in Hy. I have, in most cases, kept the same variable names to make it easier to compare the Python and Hy code.

Note that using the nietzsche.txt data set requires a fair amount of memory. If your computer has less than 16G of RAM, you might want to reduce the size of the training text by first running the following example until you see the printout “Create sentences and next_chars data…” then kill the program. The first time you run this program, the training data is fetched from the web and stored locally. You can manually edit the file ~/.keras/datasets/nietzsche.txt to remove 75% of the data by:

1     pushd ~/.keras/datasets/
2     mv nietzsche.txt nietzsche_large.txt
3     head -800 nietzsche_large.txt > nietzsche.txt
4     popd

The next time you run the example, the Keras example data loading utilities will notice a local copy and even though the file now is much smaller, the data loading utilities will not download a new copy.

When I start training a new Deep Learning model I like to monitor system resources using the top command line activity, watching for page faults when training on a CPU which might indicate that I am trying to train too large a model for my system memory. If you are using CUDA and a GPU then use the CUDA command line utilities for monitoring the state of the GPU utilization. It is beyond the scope of this introductory tutorial, but the tool TensorBoard is very useful for monitoring the state of model training.

There are a few things that make the following example code more complex than the example using the University of Wisconsin cancer data set. We need to convert each character in the training data to a one-hot encoding which is a vector of all 0.0 values except for a single value of 1.0. I am going to show you a short REPL session so that you understand how this works and then we will look at the complete Hy code example.

 1 $ hy
 2 => (import keras.callbacks [LambdaCallback])
 3 Using TensorFlow backend.
 4 => (import keras.src.callbacks [LambdaCallback])
 5 => (import keras.src.models [Sequential])
 6 => (import keras.src.layers [Dense LSTM])
 7 => (import keras.src.optimizers [RMSprop])
 8 => (import keras.src.utils [get_file])
 9 => (import numpy :as np) ;; note the syntax for aliasing a module name
10 => (import random sys io)
11 => (with [f (io.open "/Users/markw/.keras/datasets/nietzsche.txt" :encoding "utf-8")]
12 ... (setv text (.read f)))
13 => (cut text 98 130)
14 'philosophers, in so far as they '
15 => (setv chars (sorted (list (set text))))
16 => chars
17 ['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6\
18 ', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', '\
19 K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '_', 'a', \
20 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',\
21  's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
22 => (setv char_indices (dict (lfor i (enumerate chars) (, (last i) (first i)))))
23 => char_indices
24 {'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8, '.': 9, '0\
25 ': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': \
26 19, ':': 20, ';': 21, '?': 22, 'A': 23, 'B': 24, 'C': 25, 'D': 26, 'E': 27, 'F': 28,\
27  'G': 29, 'H': 30, 'I': 31, 'J': 32, 'K': 33, 'L': 34, 'M': 35, 'N': 36, 'O': 37, 'P\
28 ': 38, 'Q': 39, 'R': 40, 'S': 41, 'T': 42, 'U': 43, 'V': 44, 'W': 45, 'X': 46, 'Y': \
29 47, '_': 48, 'a': 49, 'b': 50, 'c': 51, 'd': 52, 'e': 53, 'f': 54, 'g': 55, 'h': 56,\
30  'i': 57, 'j': 58, 'k': 59, 'l': 60, 'm': 61, 'n': 62, 'o': 63, 'p': 64, 'q': 65, 'r\
31 ': 66, 's': 67, 't': 68, 'u': 69, 'v': 70, 'w': 71, 'x': 72, 'y': 73, 'z': 74}
32 => (setv indices_char (dict (lfor i (enumerate chars) i)))
33 => indices_char
34 {0: '\n', 1: ' ', 2: '!', 3: '"', 4: "'", 5: '(', 6: ')', 7: ',', 8: '-', 9: '.', 10\
35 : '0', 11: '1', 12: '2', 13: '3', 14: '4', 15: '5', 16: '6', 17: '7', 18: '8', 19: '\
36 9', 20: ':', 21: ';', 22: '?', 23: 'A', 24: 'B', 25: 'C', 26: 'D', 27: 'E', 28: 'F',\
37  29: 'G', 30: 'H', 31: 'I', 32: 'J', 33: 'K', 34: 'L', 35: 'M', 36: 'N', 37: 'O', 38\
38 : 'P', 39: 'Q', 40: 'R', 41: 'S', 42: 'T', 43: 'U', 44: 'V', 45: 'W', 46: 'X', 47: '\
39 Y', 48: '_', 49: 'a', 50: 'b', 51: 'c', 52: 'd', 53: 'e', 54: 'f', 55: 'g', 56: 'h',\
40  57: 'i', 58: 'j', 59: 'k', 60: 'l', 61: 'm', 62: 'n', 63: 'o', 64: 'p', 65: 'q', 66\
41 : 'r', 67: 's', 68: 't', 69: 'u', 70: 'v', 71: 'w', 72: 'x', 73: 'y', 74: 'z'}
42 => (setv maxlen 40)
43 => (setv s "Oh! I saw 1 dog (yesterday)")
44 => (setv x_pred (np.zeros [1 maxlen (len chars)]))
45 => (for [[t char] (lfor j (enumerate s) j)]
46 ... (setv (get x_pred 0 t (get char_indices char)) 1))
47 => x_pred
48 array([[[0., 0., 0., ..., 0., 0., 0.],
49         [0., 0., 0., ..., 0., 0., 0.],
50         [0., 0., 1., ..., 0., 0., 0.],   // here 1. is the third character "!"
51         ...,
52         [0., 0., 0., ..., 0., 0., 0.],
53         [0., 0., 0., ..., 0., 0., 0.],
54         [0., 0., 0., ..., 0., 0., 0.]]])
55 =>

For lines 48-54, each line represents a single character one-hot encoded. Notice how the third character shown on line 50 has a value of “1.” at index 2, which corresponds to the one-hot encoding of the letter “!”.

Now that you have a feeling for how one-hot encoding works, hopefully the following example will make sense to you. We will further discuss one-hot-encoding after the next code listing. For training, we take 40 characters (the value of the variable maxlen) at a time, and using one one-hot encode a character at a time as input and the target output will be the one-hot encoding of the following character in the input sequence. We are iterating on training the model for a while and then given a few characters of text, predict a likely next character - and keep repeating this process. The generated text is then used as input to the model to generate yet more text. You can repeat this process until you have generated sufficient text.

This is a powerful technique that I used to model JSON with complex deeply nested schemas and then generate synthetic JSON in the same schema as the training data. Here, training a model to mimic the philosopher Nietzsche’s writing is much easier than learning highly structured data like JSON:

 1 #!/usr/bin/env hy
 2 
 3 ;; This example was translated from the Python example in the Keras
 4 ;; documentation at: https://keras.io/examples/lstm_text_generation/ that
 5 ;; was written with very old versions of tensorflow and keras.
 6 ;; This Hy version is translated to use current versions of keras and
 7 ;; tensorflow:
 8 
 9 (import keras.src.callbacks [LambdaCallback])
10 (import keras.src.models [Sequential])
11 (import keras.src.layers [Dense LSTM])
12 (import keras.src.optimizers [RMSprop])
13 (import keras.src.utils [get_file])
14 (import numpy :as np) ;; note the syntax for aliasing a module name
15 (import random sys io)
16 
17 (setv path
18       (get_file        ;; this saves a local copy in ~/.keras/datasets
19         "nietzsche.txt"
20         :origin "https://s3.amazonaws.com/text-datasets/nietzsche.txt"))
21 
22 (with [f (io.open path :encoding "utf-8")]
23   (setv text (.read f))) ;; note: sometimes we use (.lower text) to
24                          ;;       convert text to all lower case
25 (print "corpus length:" (len text))
26 
27 (setv chars (sorted (list (set text))))
28 (print "total chars (unique characters in input text):" (len chars))
29 ;;(setv char_indices (dict (lfor i (enumerate chars) (, (last i) (first i)))))
30 (setv char_indices (dict (lfor i (enumerate chars) #((get i -1) (get i 0)))))
31 (setv indices_char (dict (lfor i (enumerate chars) i)))
32 
33 ;; cut the text in semi-redundant sequences of maxlen characters
34 (setv maxlen 40)
35 (setv step 3) ;; when we sample text, slide sampling window 3 characters
36 (setv sentences (list))
37 (setv next_chars (list))
38 
39 (print "Create sentences and next_chars data...")
40 (for [i (range 0 (- (len text) maxlen) step)]
41   (.append sentences (cut text i (+ i maxlen)))
42   (.append next_chars (get text (+ i maxlen))))
43 
44 (print "Vectorization...")
45 (setv x (np.zeros [(len sentences) maxlen (len chars)] :dtype bool))
46 (setv y (np.zeros [(len sentences) (len chars)] :dtype bool))
47 (for [[i sentence] (lfor j (enumerate sentences) j)]
48   (for [[t char] (lfor j (enumerate sentence) j)]
49     (setv (get x i t (get char_indices char)) 1))
50   (setv (get y i (get char_indices (get next_chars i))) 1))
51 (print "Done creating one-hot encoded training data.")
52 
53 (print "Building model...")
54 (setv model (Sequential))
55 (.add model (LSTM 128 :input_shape [maxlen (len chars)]))
56 (.add model (Dense (len chars) :activation "softmax"))
57 
58 (setv optimizer (RMSprop 0.01))
59 (.compile model :loss "categorical_crossentropy" :optimizer optimizer)
60 
61 (defn sample [preds &optional [temperature 1.0]]
62   (setv preds (.astype (np.array preds) "float64"))
63   (setv preds (/ (np.log preds) temperature))
64   (setv exp_preds (np.exp preds))
65   (setv preds (/ exp_preds (np.sum exp_preds)))
66   (setv probas (np.random.multinomial 1 preds 1))
67   (np.argmax probas))
68 
69 (defn on_epoch_end [epoch [not-used None]]
70   (print)
71   (print "----- Generating text after Epoch:" epoch)
72   (setv start_index (random.randint 0 (- (len text) maxlen 1)))
73   (for [diversity [0.2 0.5 1.0 1.2]]
74     (print "----- diversity:" diversity)
75     (setv generated "")
76     (setv sentence (cut text start_index (+ start_index maxlen)))
77     (setv generated (+ generated sentence))
78     (print "----- Generating with seed:" sentence)
79     (sys.stdout.write generated)
80     (for [i (range 400)]
81       (setv x_pred (np.zeros [1 maxlen (len chars)]))
82       (for [[t char] (lfor j (enumerate sentence) j)]
83         (setv (get x_pred 0 t (get char_indices char)) 1))
84 ;;      (setv preds (first (model.predict x_pred :verbose 0)))
85       (setv preds (get (model.predict x_pred :verbose 0) 0))
86       ;;;(print "** preds=" preds)
87       (setv next_index (sample preds diversity))
88       (setv next_char (get indices_char next_index))
89       (setv sentence (+ (cut sentence 1) next_char))
90       (sys.stdout.write next_char)
91       (sys.stdout.flush))
92     (print)))
93 
94 (setv print_callback (LambdaCallback :on_epoch_end on_epoch_end))
95 
96 (model.fit x y :batch_size 128 :epochs 60 :callbacks [print_callback])

We run this example using:

1 uv run hy lstm.hy

In lines 52-54 we defined a model using the Keras APIs and in lines 56-57 compiled the model using a categorical crossentropy loss function with an RMSprop optimizer.

In lines 59-65 we define a function sample that takes a first required argument preds which is a one-hot predicted encoded character that might look like (maxlen or 40 values):

[2.80193929e-02 6.78635418e-01 7.85831537e-04 4.92034527e-03 . . . 6.62320468e-04 9.14627407e-03 2.31375365e-04]

Now, here the predicted one hot encoding values are not strictly 0 or 1, rather they are small floating point numbers of a single number much larger than the others. The largest number is 6.78635418e-01 at index 1 which corresponds to a one-hot encoding for a “ “ space character.

If we print out the number of characters in text and the unique list of characters (variable chars) in the training text file nietzsche.txt we see:

corpus length: 600893
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6\
', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', '\
J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', \
'[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\
 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'Æ', 'ä', 'æ', 'é', 'ë'\
]

A review of one-hot encoding:

Let’s review our earlier discussion of one-hot encoding with a simpler case. It is important to understand how we one-hot encode input text to inputs for the model and decode back one-hot vectors to text when we use a trained model to generate text. It will help to see the dictionaries for converting characters to indices and then reverse indices to original characters as we saw earlier, some output removed:

char_indices:
 {'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8, '.': 9, '\
0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9':\
 19, 
   . . .
 'f': 58, 'g': 59, 'h': 60, 'i': 61, 'j': 62, 'k': 63, 'l': 64, 'm': 65, 'n': 66, 'o\
': 67, 'p': 68, 'q': 69, 'r': 70, 's': 71, 't': 72, 'u': 73, 'v': 74, 'w': 75, 'x': \
76, 'y': 77, 'z': 78, 'Æ': 79, 'ä': 80, 'æ': 81, 'é': 82, 'ë': 83}
indices_char:
 {0: '\n', 1: ' ', 2: '!', 3: '"', 4: "'", 5: '(', 6: ')', 7: ',', 8: '-', 9: '.', 1\
0: '0', 11: '1', 12: '2', 13: '3', 14: '4', 15: '5', 16: '6', 17: '7', 18: '8', 19: \
'9', 
   . . .
 'o', 68: 'p', 69: 'q', 70: 'r', 71: 's', 72: 't', 73: 'u', 74: 'v', 75: 'w', 76: 'x\
', 77: 'y', 78: 'z', 79: 'Æ', 80: 'ä', 81: 'æ', 82: 'é', 83: 'ë'}

We prepare the input and target output data in lines 43-48 in the last code listing. Using a short string, let’s look in the next REPL session listing at how these input and output training examples are extracted for an input string:

 1 $ uv run hy
 2 => (setv text "0123456789abcdefg")
 3 => (setv maxlen 4)
 4 => (setv i 3)
 5 => (cut text i (+ i maxlen))
 6 '3456'
 7 => (cut text (+ 1 maxlen))
 8 '56789abcdefg'
 9 => (setv i 4)                 ;; i is the for loop variable for
10 => (cut text i (+ i maxlen))  ;; defining sentences and next_chars
11 '4567'
12 => (cut text (+ i maxlen))
13 '89abcdefg'
14 =>

So the input training sentences are each maxlen characters long and the next-chars target outputs each start with the character after the last character in the corresponding input training sentence.

This script pauses during each training epoc to generate text given diversity values of 0.2, 0.5, 1.0, and 1.2. The smaller the diversity value the more closely the generated text matches the training text. The generated text is more realistic after many training epocs. In the following, I list a highly edited copy of running through several training epochs. I only show generated text for diversity equal to 0.2:

 1 $ uv run hy wisconsin.hy
 2 ----- Generating text after Epoch: 0
 3 ----- diversity: 0.2
 4 ----- Generating with seed: ocity. Equally so, gratitude.--Justice r
 5 ocity. Equally so, gratitude.--Justice read in the become to the conscience the seen\
 6 er and the conception that the becess of the power to the procentical that the becau\
 7 se and the prostice of the prostice and the will to the conscience of the power of t\
 8 he perhaps the self-distance of the all the soul and the world and the soul of the s\
 9 oul of the world and the soul and an an and the profound the self-dister the all the\
10  belief and the
11 
12 ----- Generating text after Epoch: 8
13 ----- diversity: 0.2
14 ----- Generating with seed: nations
15 laboring simultaneously under th
16 nations
17 laboring simultaneously under the subjection of the soul of the same to the subjecti\
18 on of the subjection of the same not a strong the soul of the spiritual to the same \
19 really the propers to the stree be the subjection of the spiritual that is to probab\
20 ly the stree concerning the spiritual the sublicities and the spiritual to the proce\
21 ssities the spirit to the soul of the subjection of the self-constitution and proper\
22 s to the
23 
24 ----- Generating text after Epoch: 14
25 ----- diversity: 0.2
26 ----- Generating with seed:  to which no other path could conduct us
27  to which no other path could conduct us a stronger that is the self-delight and the\
28  strange the soul of the world of the sense of the sense of the consider the such a \
29 state of the sense of the sense of the sense of such a sandine and interpretation of\
30  the process of the sense of the sense of the sense of the soul of the process of th\
31 e world in the sense of the sense of the spirit and superstetion of the world the se\
32 nse of the
33 
34 ----- Generating text after Epoch: 17
35 ----- diversity: 0.2
36 ----- Generating with seed: hemselves although they could easily hav
37 hemselves although they could easily have been moral morality and the self-in which \
38 the self-in the world to the same man in the standard to the possibility that is to \
39 the strength of the sense-in the former the sense-in the special and the same man in\
40  the consequently the soul of the superstition of the special in the end to the poss\
41 ible that it is will not be a sort of the superior of the superstition of the same m\
42 an to the same man

Here we trained on examples, translated to English, of the philosopher Nietzsche. I have used similar code to this example to train on highly structured JSON data and the resulting LSTM bsed model was usually able to generate similarly structured JSON. I have seen other examples where the training data was code in C++.

How is this example working? The model learns what combinations of characters tend to appear together and in what order.

In the next chapter we will use pre-trained Deep Learning models for natural language processing (NLP).

Natural Language Processing

I have been working in the field of Natural Language Processing (NLP) since 1985 so I ‘lived through’ the revolutionary change in NLP that has occurred since 2014: deep learning results out-classed results from previous symbolic methods.

We won’t be Large Language Models (LLMs) in this chapter, we will cover LLMs in six chapters at the end of this book. Here we use the spaCy NLP library that is simple to use and provides good results.

I will not cover older symbolic methods of NLP here, rather I refer you to my previous books Practical Artificial Intelligence Programming With Java, [Loving Common Lisp, The Savvy Programmer’s Secret Weapon, and Haskell Tutorial and Cookbook for examples. We get better results using Deep Learning (DL) for NLP and the library spaCy (https://spacy.io) that we use in this chapter provides near state-of-the-art performance. The authors of spaCy frequently update it to use the latest breakthroughs in the field.

You will learn how to apply both DL and NLP by using the state-of-the-art full-feature library spaCy. This chapter concentrates on how to use spaCy in the Hy language for solutions to a few selected problems in NLP that I use in my own work. I urge you to also review the “Guides” section of the spaCy documentation where examples are in Python but after experimenting with the examples in this chapter you should have no difficulty in translating any spaCy Python examples to the Hy language.

If you have not already done so install the spaCy library and the full English language model:

Install uv if it is not already on your system.

One time setup:

1 $ cd hy-lisp-python-book/source_code_for_examples/nlp
2 $ uv run python -m spacy download en_core_web_sm

Exploring the spaCy Library

We will use the Hy REPL to experiment with spaCy, Lisp style. The following REPL listings are all from the same session, split into separate listings so that I can talk you through the examples:

 1 Marks-MacBook:nlp $ uv run hy
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import spacy)
 4 => (setv nlp-model (spacy.load "enen_core_web_sm"))
 5 => (setv doc (nlp-model "President George Bush went to Mexico and he had a very good\
 6  meal"))
 7 => doc
 8 President George Bush went to Mexico and he had a very good meal
 9 => (dir doc)
10 ['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__fo\
11 rmat__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init_\
12 _', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new\
13 __', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__\
14 setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merg\
15 e', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count\
16 _by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_\
17 extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed'\
18 , 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun\
19 _chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sen\
20 ts', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 't\
21 o_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_span_hooks', 'user_\
22 token_hooks', 'vector', 'vector_norm', 'vocab']

In lines 3-6 we import the spaCy library, load the English language model, and create a document from input text. What is a spaCy document? In line 9 we use the standard Python function dir to look at all names and functions defined for the object doc returned from applying a spaCy model to a string containing text. The value printed shows many built in “dunder” (double underscore attributes), and we can remove these:

In lines 23-26 we use the dir function again to see the attributes and methods for this class, but filter out any attributes containing the characters “__”:

23 => (lfor x (dir doc) :if (not (.startswith x "__")) x)
24 ['_', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'c\
25 har_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', '\
26 from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_ne\
27 red', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'no\
28 un_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', \
29 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws\
30 ', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_sp\
31 an_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']
32 =>

The to_json method looks promising so we will import the Python pretty print library and look at the pretty printed result of calling the to_json method on our document stored in doc:

 36 => (import [pprint [pprint]])
 37 => (pprint (doc.to_json))
 38 {'ents': [{'end': 21, 'label': 'PERSON', 'start': 10},
 39           {'end': 36, 'label': 'GPE', 'start': 30}],
 40  'sents': [{'end': 64, 'start': 0}],
 41  'text': 'President George Bush went to Mexico and he had a very good meal',
 42  'tokens': [{'dep': 'compound',
 43              'end': 9,
 44              'head': 2,
 45              'id': 0,
 46              'pos': 'PROPN',
 47              'start': 0,
 48              'tag': 'NNP'},
 49             {'dep': 'compound',
 50              'end': 16,
 51              'head': 2,
 52              'id': 1,
 53              'pos': 'PROPN',
 54              'start': 10,
 55              'tag': 'NNP'},
 56             {'dep': 'nsubj',
 57              'end': 21,
 58              'head': 3,
 59              'id': 2,
 60              'pos': 'PROPN',
 61              'start': 17,
 62              'tag': 'NNP'},
 63             {'dep': 'ROOT',
 64              'end': 26,
 65              'head': 3,
 66              'id': 3,
 67              'pos': 'VERB',
 68              'start': 22,
 69              'tag': 'VBD'},
 70             {'dep': 'prep',
 71              'end': 29,
 72              'head': 3,
 73              'id': 4,
 74              'pos': 'ADP',
 75              'start': 27,
 76              'tag': 'IN'},
 77             {'dep': 'pobj',
 78              'end': 36,
 79              'head': 4,
 80              'id': 5,
 81              'pos': 'PROPN',
 82              'start': 30,
 83              'tag': 'NNP'},
 84             {'dep': 'cc',
 85              'end': 40,
 86              'head': 3,
 87              'id': 6,
 88              'pos': 'CCONJ',
 89              'start': 37,
 90              'tag': 'CC'},
 91             {'dep': 'nsubj',
 92              'end': 43,
 93              'head': 8,
 94              'id': 7,
 95              'pos': 'PRON',
 96              'start': 41,
 97              'tag': 'PRP'},
 98             {'dep': 'conj',
 99              'end': 47,
100              'head': 3,
101              'id': 8,
102              'pos': 'VERB',
103              'start': 44,
104              'tag': 'VBD'},
105             {'dep': 'det',
106              'end': 49,
107              'head': 12,
108              'id': 9,
109              'pos': 'DET',
110              'start': 48,
111              'tag': 'DT'},
112             {'dep': 'advmod',
113              'end': 54,
114              'head': 11,
115              'id': 10,
116              'pos': 'ADV',
117              'start': 50,
118              'tag': 'RB'},
119             {'dep': 'amod',
120              'end': 59,
121              'head': 12,
122              'id': 11,
123              'pos': 'ADJ',
124              'start': 55,
125              'tag': 'JJ'},
126             {'dep': 'dobj',
127              'end': 64,
128              'head': 8,
129              'id': 12,
130              'pos': 'NOUN',
131              'start': 60,
132              'tag': 'NN'}]}
133 =>

The JSON data is nested dictionaries. In a later chapter on Knowledge Graphs, we will want to get the named entities like people, organizations, etc., from text and use this information to automatically generate data for Knowledge Graphs. The values for the key ents (stands for “entities”) will be useful. Notice that the words in the original text are specified by beginning and ending text token indices (values of head and end in lines 52 to 142).

The values for the key tokens listed on lines 42-132 contains the head (or starting index, ending index, the token number (id), and the part of speech (pos). We will list what the parts of speech mean later.

We would like the words for each entity to be concatenated into a single string for each entity and we do this here in lines 136-137 and see the results in lines 138-139.

I like to add the entity name strings back into the dictionary representing a document and line 140 shows the use of lfor to create a list of lists where the sublists contain the entity name as a single string and the type of entity. We list the entity types supported by spaCy in the next section.

134 => doc.ents
135 (George Bush, Mexico)
136 => (for [entity doc.ents]
137 ... (print "entity text:" entity.text "entity label:" entity.label_))
138 entity text: George Bush entity label: PERSON
139 entity text: Mexico entity label: GPE
140 => (lfor entity doc.ents [entity.text entity.label_])
141 [['George Bush', 'PERSON'], ['Mexico', 'GPE']]
142 =>

We can also access each sentence as a separate string. In this example the original text used to create our sample document had only a single sentence so the sents property returns a list containing a single string:

147 => (list doc.sents)
148 [President George Bush went to Mexico and he had a very good meal]
149 =>

The last example showing how to use a spaCy document object is listing each word with its part of speech:

150 => (for [word doc]
151 ... (print word.text word.pos_))
152 President PROPN
153 George PROPN
154 Bush PROPN
155 went VERB
156 to ADP
157 Mexico PROPN
158 and CCONJ
159 he PRON
160 had VERB
161 a DET
162 very ADV
163 good ADJ
164 meal NOUN
165 =>

The following list shows the definitions for the part of speech (POS) tags:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary verb
CONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

Implementing a HyNLP Wrapper for the Python spaCy Library

We will generate two libraries (in the file nlp_lib.hy). The first is a general NLP library. The test program is in the file nlp_example.hy.

For an example in a later chapter, we will use the library developed here to automatically generate Knowledge Graphs from text data. We will need the ability to find person, company, location, etc. names in text. We use spaCy here to do this. The types of named entities on which spaCy is pre-trained includes:

CARDINAL: any number that is not identified as a more specific type, like money, time, etc.
DATE
FAC: facilities like highways, bridges, airports, etc.
GPE: Countries, states (or provinces), and cities
LOC: any non-GPE location
PRODUCT
EVENT
LANGUAGE: any named language
MONEY: any monetary value or unit of money
NORP: nationalities or religious groups
ORG: any organization like a company, non-profit, school, etc.
PERCENT: any number in [0, 100] followed by the percent % character
PERSON
ORDINAL: any number spelled out, like “one”, “two”, etc.
TIME

Listing for hy-lisp-python/nlp/nlp_lib.hy:

 1 (import spacy)
 2 
 3 (setv nlp-model (spacy.load "en_core_web_sm"))
 4 
 5 (defn nlp [some-text]
 6   (setv doc (nlp-model some-text))
 7   (setv entities (lfor entity doc.ents [entity.text entity.label_]))
 8   (setv j (doc.to_json))
 9   (setv (get j "entities") entities)
10   j)

Listing for hy-lisp-python/nlp/nlp_example.hy:

1 (import nlp-lib [nlp])
2 
3 (print
4   (nlp "President George Bush went to Mexico and he had a very good meal"))
5 
6 (print
7   (nlp "Lucy threw a ball to Bill and he caught it"))

 1 Marks-MacBook:nlp $ uv run hy nlp_example.hy 
 2 {'text': 'President George Bush went to Mexico and he had a very good meal', 'ents':\
 3  [{'start': 10, 'end': 21, 'label': 'PERSON'}, {'start': 30, 'end': 36, 'label': 'GP\
 4 E'}], 'sents': [{'start': 0, 'end': 64}], 'tokens': [{'id': 0, 'start': 0, 'end': 9,\
 5  'tag': 'NNP', 'pos': 'PROPN', 'morph'
 6 
 7  ...
 8 
 9 {'text': 'Lucy threw a ball to Bill and he caught it', 'ents': [{'start': 0, 'end': \
10 4, 'label': 'PERSON'}, {'start': 21, 'end': 25, 'label': 'PERSON'}], 'sents': [{'sta\
11 rt': 0, 'end': 42}], 'tokens': [{'id': 0, 'start': 0, 'end': 4, 'tag': 'NNP', 'pos':\
12  'PROPN', 'morph': 'Number=Sing', 'lemma': 'Lucy', 'dep': 'nsubj', 'head': 1}, {'id'\
13 : 1, 'start': 5, 'end': 10, 'tag': 'VBD', 'pos': 'VERB', 'morph': 'Tense=Past|VerbFo\
14 rm=Fin', 'lemma': 'throw', 'dep': 'ROOT', 'head': 1}, {'id': 2, 'start': 11, 'end': \
15 12, 'tag
16 
17  ...
18 
19   ..LOTS OF OUTPUT NOT SHOWN..

Wrap-up

I spent several years of development time during the period from 1984 through 2015 working on natural language processing technology and as a personal side project I sold commercial NLP libraries that I wrote on my own time in Ruby and Common Lisp. The state-of-the-art of Deep Learning enhanced NLP is very good and the open source spaCy library makes excellent use of both conventional NLP technology and pre-trained Deep Learning models. I no longer spend very much time writing my own NLP libraries and instead use spaCy or more recently LLMs that we cover later.

I urge you to read through the spaCy documentation because we covered just basic functionality here that we will also need in the later chapter on automatically generating data for Knowledge Graphs. After working through the interactive REPL sessions and the examples in this chapter, you should be able to translate any Python API example code to Hy.

Datastores

I use flat files and the PostgreSQL relational database for most data storage and processing needs in my consulting business over the last twenty years. For work on large data projects at Compass Labs and Google I used Hadoop and Big Table. I will not cover big data datastores here, rather I will concentrate on what I think of as “laptop development” requirements: a modest amount of data and optimizing speed of development and ease of infrastructure setup. We will cover three datastores:

Sqlite single-file-based relational database
PostgreSQL relational database
RDF library rdflib that is useful for semantic web and linked data applications

For graph data we will stick with RDF because it is a fairly widely used standard. Google, Microsoft, Yahoo and Yandex support schema.org for defining schemas for structured data on the web. In the next chapter we will go into more details on RDF, here we look at the “plumbing” for using the rdflib library to manipulate and query RDF data and how to export RDF data in several formats. Then in a later chapter, we will develop tools to automatically generate RDF data from raw text as a tool for developing customized Knowledge Graphs.

In one of my previous previous books Loving Common Lisp, or the Savvy Programmer’s Secret Weapon I also covered the general purpose graph database Neo4j which I like to use for some use cases, but for the purposes of this book we stick with RDF.

Sqlite

We will cover two relational databases: Sqlite and PostgreSQL. Sqlite is an embedded database. There are Sqlite libraries for many programming languages and here we use the Python library.

The following examples are simple but sufficient to show you how to open a single file Sqlite database, add data, modify data, query data, and delete data. I assume that you have some familiarity with relational databases, especially concepts like data columns and rows, and SQL queries.

Let’s start with putting common code for using Sqlite into a reusable library in the file sqlite_lib.hy:

 1 (import sqlite3)
 2 
 3 (defn create-db [db-file-path] ;; db-file-path can also be ":memory:"
 4   (setv conn (sqlite3.connect db-file-path))
 5   (print version)
 6   (conn.close))
 7 
 8 (defn connection [db-file-path] ;; db-file-path can also be ":memory:"
 9   (sqlite3.connect db-file-path))
10 
11 (defn query [conn sql [variable-bindings None]]
12   (setv cur (conn.cursor))
13   (if variable-bindings
14     (cur.execute sql variable-bindings)
15     (cur.execute sql))
16   (cur.fetchall))

The function create-db in lines 3-6 creates a database from a file path if it does not already exist. The function connection (lines 8-9) creates a persistent connection to a database defined by a file path to the single file used for a Sqlite database. This connection can be reused. The function query (lines 11-16) requires a connection object and a SQL query represented as a string, makes a database query, and returns all matching data in nested lists.

The following listing of file sqlite_example.hyshows how to use this simple library:

 1 #!/usr/bin/env hy
 2 
 3 (import sqlite-lib [create-db connection query])
 4 
 5 (defn test_sqlite-lib []
 6   (setv conn (connection ":memory:")) ;; "test.db"))
 7   (query conn "CREATE TABLE people (name TEXT, email TEXT);")
 8   (print
 9     (query conn "INSERT INTO people VALUES ('Mark', 'mark@markwatson.com')"))
10   (print
11     (query conn "INSERT INTO people VALUES ('Kiddo', 'kiddo@markwatson.com')"))
12   (print
13     (query conn "SELECT * FROM people"))
14   (print
15     (query conn "UPDATE people SET name = ? WHERE email = ?"
16       ["Mark Watson" "mark@markwatson.com"]))
17   (print
18     (query conn "SELECT * FROM people"))
19   (print
20     (query conn "DELETE FROM people  WHERE name=?" ["Kiddo"]))
21     (print
22     (query conn "SELECT * FROM people"))
23   (conn.close))
24 
25 (test_sqlite-lib)

We opened an in-memory database in lines 7 and 8 but we could have also created a persistent database on disk using, for example, “test_database.db” instead of :memory. In line 9 we create a database table with just two columns, each column holding string values.

In lines 15, 20, and 24 we are using a wild card query using the asterisk character to return all column values for each matched row in the database.

Running the example program produces the following output:

1 $ uv run hy sqlite_example.hy
2 []
3 []
4 [('Mark', 'mark@markwatson.com'), ('Kiddo', 'kiddo@markwatson.com')]
5 []
6 [('Mark Watson', 'mark@markwatson.com'), ('Kiddo', 'kiddo@markwatson.com')]
7 []
8 [('Mark Watson', 'mark@markwatson.com')]

Line 2 shows the version of SQlite we are using. The lists in lines 1-2, 4, and 6 are empty because the functions to create a table, insert data into a table, update a row in a table, and delete rows do not return values.

In the next section we will see how PostgreSQL treats JSON data as a native data type. For sqlite, you can store JSON data as a “dumped” string value but you can’t query by key/value pairs in the data. You can encode JSON as a string and then decode it back to JSON (or as a dictionary) using:

1 (import [json [dumps loads]])
2 
3 (setv json-data .....)
4 (setv s-data (json.dumps json-data))
5 (setv restored-json-data (json.loads s-data))

PostgreSQL

We just saw use cases for the Sqlite embedded database. Now we look at my favorite general purpose database, PostgreSQL. The PostgreSQL database server is available as a managed service on most cloud providers and it is easy to also run a PostgreSQL server on your laptop or on a VPS or server.

We will use the psycopg PostgreSQL adapter that is compatible with CPython and can be installed using:

1     pip install psycopg2

The following material is self-contained but before using PostgreSQL and psycopg in your own applications I recommend that you reference the psycopg documentation.

Notes for Using PostgreSQL and Setting Up an Example Database “hybook” on macOS and Linux

The following two sections may help you get PostgreSQL set up on macOS and Linux.

macOS

For macOS we use the PostgreSQL application and we will start by using the postgres command line utility to create a new database and table in this database. Using postgres account, create a new database hybook:

 1 Marks-MacBook:datastores $ psql -d "postgres"
 2 psql (9.6.3)
 3 Type "help" for help.
 4 
 5 postgres=# \d
 6 No relations found.
 7 postgres=# CREATE DATABASE hybook;
 8 CREATE DATABASE
 9 postgres=# \q
10 Marks-MacBook:datastores $

Create a table news in database hybook:

1 markw $ psql -d "hybook"
2 psql (9.6.3)
3 Type "help" for help.
4 
5 hybook=# CREATE TABLE news (uri VARCHAR(50) not null, title VARCHAR(50), articletext\
6  VARCHAR(500), nlpdata VARCHAR(50)); 
7 CREATE TABLE
8 hybook=#

Linux

For Ubuntu Linux first install PostgreSQL and then use sudo to use the account postgres:

To start a local server:

1 sudo su - postgres
2 /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start

and to stop the server:

1 sudo su - postgres
2 /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile stop

When the PostgreSQL server is running we can use the psql command line program:

 1 sudo su - postgres
 2 psql
 3 
 4 postgres@pop-os:~$ psql -d "hybook"
 5 psql (10.7 (Ubuntu 10.7-0ubuntu0.18.10.1))
 6 Type "help" for help.
 7 
 8 hybook=# CREATE TABLE news (uri VARCHAR(50) not null, title VARCHAR(50), articletext\
 9  VARCHAR(500), nlpdata VARCHAR(50));
10 CREATE TABLE
11 hybook=# \d
12         List of relations
13  Schema | Name | Type  |  Owner   
14 --------+------+-------+----------
15  public | news | table | postgres
16 (1 row)

Using Hy with PostgreSQL

When using Hy (or any other Lisp language and also for Haskell), I usually start both coding and experimenting with new libraries and APIs in a REPL. Let’s do that here to see from a high level how we can use psycopg on the table news in the database hybook that we created in the last section:

 1 Marks-MacBook:datastores $ hy
 2 => (import json psycopg2)
 3 => (setv conn (psycopg2.connect :dbname "hybook" :user "markw"))
 4 => (setv cur (conn.cursor))
 5 => (cur.execute "INSERT INTO news VALUES (%s, %s, %s, %s)"
 6       ["http://knowledgebooks.com/schema" "test schema"
 7       "text in article" (json.dumps {"type" "news"})])
 8 => (conn.commit)
 9 => (cur.execute "SELECT * FROM news")
10 => (for [record cur]
11 ... (print record))
12 ('http://knowledgebooks.com/schema', 'test schema', 'text in article',
13  '{"type": "news"}')
14 => (cur.execute "SELECT nlpdata FROM news")
15 => (for [record cur]
16 ... (print record))
17 ('{"type": "news"}',)
18 => (cur.execute "SELECT nlpdata FROM news")
19 => (for [record cur]
20 ... (print (json.loads (first record))))
21 {'type': 'news'}
22 =>

In lines 6-8 and 13-14 you notice that I am using PostgreSQL’s native JSON support.

As with most of the material in this book, I hope that you have a Hy REPL open and are experimenting with the APIs and code in the book’s interactive REPL examples.

The file postgres_lib.hy wraps commonly used functionality for accessing a database, adding, modifying, and querying data in a short reusable library:

1 (defn connection-and-cursor [dbname username]
2   (setv conn (connect :dbname dbname :user username))
3   (setv cursor (conn.cursor))
4   [conn cursor])
5 
6 (defn query [cursor sql [variable-bindings None]]
7   (if variable-bindings
8     (cursor.execute sql variable-bindings)
9     (cursor.execute sql)))

The function query in lines 8-11 executes any SQL comands so in addition to querying a database, it can also be used with appropriate SQL commands to delete rows, update rows, and create and destroy tables.

The following file postgres_example.hy contains examples for using the library we just defined:

 1 #!/usr/bin/env hy
 2 
 3 (import postgres-lib [connection-and-cursor query])
 4 
 5 (defn test-postgres-lib []
 6   (setv [conn cursor] (connection-and-cursor "hybook" "markw"))
 7   (query cursor "CREATE TABLE people (name TEXT, email TEXT);")
 8   (conn.commit)
 9   (query cursor "INSERT INTO people VALUES ('Mark',  'mark@markwatson.com')")
10   (query cursor "INSERT INTO people VALUES ('Kiddo', 'kiddo@markwatson.com')")
11   (conn.commit)
12   (query cursor "SELECT * FROM people")
13   (print (cursor.fetchall))
14   (query cursor "UPDATE people SET name = %s WHERE email = %s"
15       ["Mark Watson" "mark@markwatson.com"])
16   (query cursor "SELECT * FROM people")
17   (print (cursor.fetchall))
18   (query cursor "DELETE FROM people  WHERE name = %s" ["Kiddo"])
19   (query cursor "SELECT * FROM people")
20   (print (cursor.fetchall))
21   (query cursor "DROP TABLE people;")
22   (conn.commit)
23   (conn.close))
24 
25 (test-postgres-lib)

Here is the output from this example Hy script:

1 Marks-MacBook:datastores $ ./postgres_example.hy
2 [('Mark', 'mark@markwatson.com'), ('Kiddo', 'kiddo@markwatson.com')]
3 [('Kiddo', 'kiddo@markwatson.com'), ('Mark Watson', 'mark@markwatson.com')]
4 [('Mark Watson', 'mark@markwatson.com')]

I use PostgreSQL more than any other datastore and taking the time to learn how to manage PostgreSQL servers and write application software will save you time and effort when you are prototyping new ideas or developing data oriented product at work. I love using PostgreSQL and personally, I only use Sqlite for very small database tasks or applications.

RDF Data Using the “rdflib” Library

While the last two sections on Sqlite and PostgreSQL provided examples that you are likely to use in your own work, we will now turn to something more esoteric but still useful, the RDF notations for using data schema and RDF triple graph data in semantic web, linked data, and Knowledge Graph applications. I used graph databases working with Google’s Knowledge Graph when I worked there and I have had several consulting projects using linked data. I currently work on the Knowledge Graph team at Olive AI. You will need to understand the material in this section for the two chapters that take a deeper dive into the semantic web and linked data and also develop an example that automatically creates Knowledge Graphs.

In my work I use RDF as a notation for graph data, RDFS (RDF Schema) to define formally data types and relationship types in RDF data, and occasionally OWL (Web Ontology Language) for reasoning about RDF data and inferring new graph triple data from data explicitly defined. Here we will only cover RDF since it is the most practical linked data tool and I refer you to my other semantic web books for deeper coverage of RDF as well as RDFS and OWL.

We will go into some detail on using semantic web and linked data resources in the next chapter. Here we will study the use of library rdflib as a data store, reading RDF data from disk and from web resources, adding RDF statements (which are triples containing a subject, predicate, and object) and for serializing an in-memory graph to a file in one of the standard RDF XML, turtle, or NT formats.

You need to install both the rdflib library and the plugin for using the JSON-LD format:

rdflib
rdflib-jsonld

These libraries and Hy are confiured for the uv environment in the directory hy-lisp-python-book/source_code_for_examples/rdf.

The following REPL session shows importing the rdflib library, fetching RDF (in XML format) from my personal web site, printing out the triples in the graph in NT format, and showing how the graph can be queried. I added most of this RDF to my web site in 2005, with a few updates since then. The following REPL session is split up into several listings (with some long output removed) so I can explain how the rdflib is being used. In the first REPL listing I load an RDF file in XML format from my web site and print it in NT format. NT format can have either subject/predicate/object all on one line separated by spaces and terminated by a period or as shown below, the subject is on one line with predicate and objects printed indented on two additional lines. In both cases a period character “.” is used to terminate search RDF NT statement. The statements are displayed in arbitrary order.

 1 datastores $ uv run hy
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import rdflib [Graph])
 4 => (setv graph (Graph))
 5 => (graph.parse "https://www.w3.org/2000/10/rdf-tests/RDF-Model-Syntax_1.0/ms_4.1_1.\
 6 rdf")
 7 => (for [[subject predicate object] graph]
 8 ... (print subject "\n  " predicate "\n  " object " ."))
 9 N4836630a0d1b4585ab03d381011c092d 
10    http://description.org/schema/attributedTo 
11    Ralph Swick  .
12 N4836630a0d1b4585ab03d381011c092d 
13    http://www.w3.org/1999/02/22-rdf-syntax-ns#subject 
14    http://www.w3.org/Home/Lassila  .
15 N4836630a0d1b4585ab03d381011c092d 
16    http://www.w3.org/1999/02/22-rdf-syntax-ns#type 
17    http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement  .
18 N4836630a0d1b4585ab03d381011c092d 
19    http://www.w3.org/1999/02/22-rdf-syntax-ns#object 
20    Ora Lassila  .
21 N4836630a0d1b4585ab03d381011c092d 
22    http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate 
23    http://description.org/schema/Creator  .
24 => (for [[subject predicate object] graph] (print subject "\n  " predicate "\n  " ob\
25 ject " ."))
26 N0606a4fc1dce4d79843ead3a26db6c76 
27    http://www.w3.org/1999/02/22-rdf-syntax-ns#type 
28    http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement  .
29 N0606a4fc1dce4d79843ead3a26db6c76 
30    http://description.org/schema/attributedTo 
31    Ralph Swick  .
32 N0606a4fc1dce4d79843ead3a26db6c76 
33    http://www.w3.org/1999/02/22-rdf-syntax-ns#object 
34    Ora Lassila  .
35 N0606a4fc1dce4d79843ead3a26db6c76 
36    http://www.w3.org/1999/02/22-rdf-syntax-ns#subject 
37    http://www.w3.org/Home/Lassila  .
38 N0606a4fc1dce4d79843ead3a26db6c76 
39    http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate 
40    http://description.org/schema/Creator  .
41 =>

There are several available formats for serializing RDF data. Here we will serialize using the JSON-LD format (later we will also see examples for serializing in NT and Turtle formats):

 1 => (import rdflib [plugin])
 2 => (import rdflib.serializer [Serializer])
 3 => (print (graph.serialize :format "json-ld" :indent 2))
 4 [
 5   {
 6     "@id": "_:N0606a4fc1dce4d79843ead3a26db6c76",
 7     "@type": [
 8       "http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement"
 9     ],
10     "http://description.org/schema/attributedTo": [
11       {
12         "@value": "Ralph Swick"
13       }
14     ],
15     "http://www.w3.org/1999/02/22-rdf-syntax-ns#object": [
16       {
17         "@value": "Ora Lassila"
18       }
19     ],
20     "http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate": [
21       {
22         "@id": "http://description.org/schema/Creator"
23       }
24     ],
25     "http://www.w3.org/1999/02/22-rdf-syntax-ns#subject": [
26       {
27         "@id": "http://www.w3.org/Home/Lassila"
28       }
29     ]
30   }
31 ]

JSON-LD is convenient for implementing APIs that are intended for use by developers who are not familiar with RDF technology.

We will cover the SPARQL query language in more detail in the next chapter but for now, notice that SPARQL is similar to SQL queries. SPARQL queries can find triples in a graph matching simple patterns, match complex patterns, and update and delete triples in a graph. The following simple SPARQL query finds all triples with the predicate equal to http://www.w3.org/2000/10/swap/pim/contact#company and prints out the subject and object of any matching triples:

84 => (for [[subject object]
85 ... (graph.query
86 ...  "select ?subject ?object where { ?subject <http://description.org/schema/attrib\
87 utedTo> ?object }")]
88 ... (print subject "attributedTo: " object))
89 N0606a4fc1dce4d79843ead3a26db6c76 attributedTo:  Ralph Swick

We will see more examples of the SPARQL query language in the next chapter. For now, notice that the general form of a select query statement is a list of query variables (names beginning with a question mark) and a where clause in curly brackets that contains matching patterns. This SPARQL query is simple, but like SQL queries, SPARQL queries can get very complex. I only lightly cover SPARQL in this book. You can get PDF copies of my two older semantic web books for free: Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition and Practical Semantic Web and Linked Data Applications, Common Lisp Edition. There are links to relevant git repos and other information on my book web page.

In addition to the Turtle format I also use the simpler NT format that puts URI prefixes inline and unlike Turtle does not use prefix abrieviations. Here in line 159 we serialize to NT format:

159 => (graph.serialize :format "nt")
160 _:N0606a4fc1dce4d79843ead3a26db6c76
161   <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
162   <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement> .
163 _:N0606a4fc1dce4d79843ead3a26db6c76
164   <http://description.org/schema/attributedTo>
165   "Ralph Swick\" .
166 _:N0606a4fc1dce4d79843ead3a26db6c76
167   <http://www.w3.org/1999/02/22-rdf-syntax-ns#object>
168   "Ora Lassila\" .
169 _:N0606a4fc1dce4d79843ead3a26db6c76
170   <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>
171   <http://www.w3.org/Home/Lassila> .
172 _:N0606a4fc1dce4d79843ead3a26db6c76
173   <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http://description.org/sch\
174 ema/Creator> .
175 =>

Using RDFLIB with in-memory RDF triple storage is very convenient with small or mid-size RDF data sets as long as initializing the data store by reading a local file containing RDF triples is a fast operation. If I need to use large RDF data sets I prefer to not use rdflib and instead use SPARQL to access a free or open source standalone RDF data store like OpenLink Virtuoso or GraphDB™ Free Edition. I also like and recommend the commercial AllegroGraph and Stardog RDF server products.

Wrap-up

We will go into much more detail on practical uses of RDF and SPARQL in the next chapter. I hope that you worked through the REPL examples in this section because if you understand the basics of using the rdflib then you will have an easier time understanding the more abstract material in the next chapter.

Linked Data, the Semantic Web, and Knowledge Graphs

Tim Berners Lee, James Hendler, and Ora Lassila wrote in 2001 an article for Scientific American where they introduced the term Semantic Web. Here I do not capitalize semantic web and use the similar term linked data somewhat interchangeably with semantic web. Most work using these technologies now is building corporate Knowledge Graphs. I worked at Google with their Knowledge Graph in 2013 and I worked with the Knowledge Graph team at Olive AI during 2020-2021.

In ths chapter we will only be using the Hy REPL in the directory hy-lisp-python-book/source_code_for_examples/rdf:

1 $ cd hy-lisp-python-book/source_code_for_examples/rdf
2 $ uv sync
3 $ uv run hy
4 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
5 =>

In the same way that the web allows links between related web pages, linked data supports linking associated data on the web together. I view linked data as a relatively simple way to specify relationships between data sources on the web while the semantic web has a much larger vision: the semantic web has the potential to be the entirety of human knowledge represented as data on the web in a form that software agents can work with to answer questions, perform research, and to infer new data from existing data.

While the “web” describes information for human readers, the semantic web is meant to provide structured data for ingestion by software agents. This distinction will be clear as we compare WikiPedia, made for human readers, with DBPedia which uses the info boxes on WikiPedia topics to automatically extract RDF data describing WikiPedia topics. Let’s look at the WikiPedia topic for the town I live in, Sedona Arizona, and show how the info box on the English version of the WikiPedia topic page for Sedona https://en.wikipedia.org/wiki/Sedona,_Arizona maps to the DBPedia page http://dbpedia.org/page/Sedona,_Arizona. Please open both of these WikiPedia and DBPedia URIs in two browser tabs and keep them open for reference.

I assume that the format of the WikiPedia page is familiar so let’s look at the DBPedia page for Sedona that in human readable form shows the RDF statements with Sedona Arizona as the subject. RDF is used to model and represent data. RDF is defined by three values so an instance of an RDF statement is called a triple with three parts:

subject: a URI (also referred to as a “Resource”)
property: a URI (also referred to as a “Resource”)
value: a URI (also referred to as a “Resource”) or a literal value (like a string)

The subject for each Sedona related triple is the above URI for the DBPedia human readable page. The subject and property references in an RDF triple will almost always be a URI that can ground an entity to information on the web. The human readable page for Sedona lists several properties and the values of these properties. One of the properties is “dbo:areaCode” where “dbo” is a name space reference (in this case for a DatatypeProperty).

The following two figures show an abstract representation of linked data and then a sample of linked data with actual web URIs for resources and properties:

Abstract RDF representation with 2 Resources, 2 literal values, and 3 Properties

Concrete example using RDF seen in last chapter showing the RDF representation with 2 Resources, 2 literal values, and 3 Properties

We saw a SPARQL Query (SPARQL for RDF data is similar to SQL for relational database queries) in the last chapter. Let’s look at another example using the RDF in the last figure:

1     select ?v where {  <http://markwatson.com/index.rdf#Sun_ONE>
2                        <http://www.ontoweb.org/ontology/1#booktitle>
3                        ?v }

This query should return the result “Sun ONE Services - J2EE”. If you wanted to query for all URI resources that are books with the literal value of their titles, then you can use:

1     select ?s ?v where {  ?s
2                           <http://www.ontoweb.org/ontology/1#booktitle>
3                           ?v }

Note that ?s and ?v are arbitrary query variable names, here standing for “subject” and “value”. You can use more descriptive variable names like:

1     select ?bookURI ?bookTitle where 
2         { ?bookURI
3           <http://www.ontoweb.org/ontology/1#booktitle>
4           ?bookTitle }

We will be diving a little deeper into RDF examples in the next chapter when we write a tool for generating RDF data from raw text input. For now I want you to understand the idea of RDF statements represented as triples, that web URIs represent things, properties, and sometimes values, and that URIs can be followed manually (often called “dereferencing”) to see what they reference in human readable form.

Understanding the Resource Description Framework (RDF)

Text data on the web has some structure in the form of HTML elements like headers, page titles, anchor links, etc. but this structure is too imprecise for general use by software agents. RDF is a method for encoding structured data in a more precise way.

We used the RDF data on my web site in the last chapter to introduce the “plumbing” of using the rdflib Python library to access, manipulate, and query RDF data.

Resource Namespaces Provided in rdflib

The following standard namespaces are predefined in rdflib:

RDF https://www.w3.org/TR/rdf-syntax-grammar/
RDFS https://www.w3.org/TR/rdf-schema/
OWL http://www.w3.org/2002/07/owl#
XSD http://www.w3.org/2001/XMLSchema#
FOAF http://xmlns.com/foaf/0.1/
SKOS http://www.w3.org/2004/02/skos/core#
DOAP http://usefulinc.com/ns/doap#
DC http://purl.org/dc/elements/1.1/
DCTERMS http://purl.org/dc/terms/
VOID http://rdfs.org/ns/void#

Let’s look into the Friend of a Friend (FOAF) namespace. Click on the above link for FOAF http://xmlns.com/foaf/0.1/ and find the definitions for the FOAF Core:

 1     Agent
 2     Person
 3     name
 4     title
 5     img
 6     depiction (depicts)
 7     familyName
 8     givenName
 9     knows
10     based_near
11     age
12     made (maker)
13     primaryTopic (primaryTopicOf)
14     Project
15     Organization
16     Group
17     member
18     Document
19     Image

and for the Social Web:

 1 nick
 2 mbox
 3 homepage
 4 weblog
 5 openid
 6 jabberID
 7 mbox_sha1sum
 8 interest
 9 topic_interest
10 topic (page)
11 workplaceHomepage
12 workInfoHomepage
13 schoolHomepage
14 publications
15 currentProject
16 pastProject
17 account
18 OnlineAccount
19 accountName
20 accountServiceHomepage
21 PersonalProfileDocument
22 tipjar
23 sha1
24 thumbnail
25 logo

You now have seen a few common Schemas for RDF data. Another Schema that is widely used for annotating web sites, that we won’t need for our examples here, is schema.org. Let’s now use a Hy REPL session to explore namespaces and programatically create RDF using rdflib:

 1 Marks-MacBook:database $ uv run hy
 2 Hy 1.1.0 (Business Hugs) using CPython(main) 3.12.0 on Darwin
 3 => (import rdflib.namespace [FOAF])
 4 => FOAF
 5 Namespace('http://xmlns.com/foaf/0.1/')
 6 => FOAF.name
 7 rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name')
 8 => FOAF.title
 9 rdflib.term.URIRef('http://xmlns.com/foaf/0.1/title')
10 => (import rdflib)
11 => (setv graph (rdflib.Graph))
12 => (setv mark (rdflib.BNode))
13 => (graph.bind "foaf" FOAF)
14 => (import rdflib [RDF])
15 => (graph.add [mark RDF.type FOAF.Person])
16 => (graph.add [mark FOAF.nick (rdflib.Literal "Mark" :lang "en")])
17 => (graph.add [mark FOAF.name (rdflib.Literal "Mark Watson" :lang "en")])
18 => (for [node graph] (print node))
19 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://www\
20 .w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://xmlns.com/foaf/0\
21 .1/Person'))
22 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://xml\
23 ns.com/foaf/0.1/name'), rdflib.term.Literal('Mark Watson', lang='en'))
24 (rdflib.term.BNode('N21c7fa7385b545eb8a7e3821b7cb5'), rdflib.term.URIRef('http://xml\
25 ns.com/foaf/0.1/nick'), rdflib.term.Literal('Mark', lang='en'))
26 => (graph.serialize :format "pretty-xml")
27 b'<?xml version="1.0" encoding="utf-8"?>
28 <rdf:RDF
29     xmlns:foaf="http://xmlns.com/foaf/0.1/"
30     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
31 >
32   <foaf:Person rdf:nodeID="N21c7fa7385b545eb8a7e3821b75b9cb5">
33     <foaf:name xml:lang="en">Mark Watson</foaf:name>
34     <foaf:nick xml:lang="en">Mark</foaf:nick>
35   </foaf:Person>
36 </rdf:RDF>\n'
37 => (graph.serialize :format "turtle")
38 @prefix foaf: <http://xmlns.com/foaf/0.1/> .
39 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
40 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
41 @prefix xml: <http://www.w3.org/XML/1998/namespace> .
42 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
43 
44 [] a foaf:Person ;
45      foaf:name "Mark Watson"@en ;
46      foaf:nick "Mark"@en .
47 
48 => (graph.serialize :format "nt")
49 _:N21c7fa7385b545eb8a7e3821b75b9cb5
50    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
51    <http://xmlns.com/foaf/0.1/Person> .
52 _:N21c7fa7385b545eb8a7e3821b75b9cb5 <http://xmlns.com/foaf/0.1/name> "Mark Watson"@e\
53 n .
54 _:N21c7fa7385b545eb8a7e3821b75b9cb5 <http://xmlns.com/foaf/0.1/nick> "Mark"@en .
55 =>

Understanding the SPARQL Query Language

For the purposes of the material in this book, the two sample SPARQL queries here and in the last chapter are sufficient for you to get started using rdflib with arbitrary RDF data sources and simple queries.

The Apache Foundation has a good introduction to SPARQL that I refer you to for more information.

Wrapping the Python rdflib Library

I hope that I have provided you with enough motivation to explore RDF data sources and consider the use of linked data/semantic web technologies for your projects.

If I depend on a library, regardless of the programming language, I like to keep an up-to-date copy of the source code ready at hand. There is sometimes no substitute for having library code available to read.

Knowledge Graph Creator

A Knowledge Graph, that I often abbreviate as KG, is a graph database using a schema to define types (both objects and relationships between objects) and properties that link property values to objects. The term “Knowledge Graph” is both a general term and also sometimes refers to the specific Knowledge Graph used at Google which I worked with while working there in 2013. Here, we use KG to reference the general technology of storing knowledge in graph databases.

The application we develop here, the Knowledge Graph Creator (which I often refer to as KGCreator) is a utility that I use to generate small Knowledge Graphs from input text.

Knowledge engineering and knowledge representation are disciplines that started in the 1980s and are still both current research topics and used in industry. I view linked data, the semantic web, and KGs as extensions of this earlier work.

We base our work here on RDF. There is a general type of KGs that are also widely used in industry and that we will not cover here: property graphs, as used in Neo4J. Property graphs are general graphs that place no restrictions on the number of links a graph node may have and allow general data structures to be stored as node data and for the property links between nodes. Property links can have attributes, like nodes in the graph.

Semantic web data as represented by subject/property/value RDF triples are more constrained than property graphs but support powerful logic inferencing to better use data that is implicit in a graph but not explicitly stated (i.e., data is more easily inferred).

We covered RDF data in some detail in the last chapter. Here we will implement a toolset for converting unstructured text into RDF data using a few schema definitions from schema.org. I believe in both the RDF and the general graph database approaches but here we will just use RDF.

Historically Knowledge Graphs used semantic web technology like Resource Description Framework (RDF) and Web Ontology Language (OWL). I wrote two books in 2010 on semantic web technologies and you can get free PDFs for the Common Lisp version (code is here) and the Java/Clojure/Scala version (code is here). These free books might interest you after working through the material in this chapter.

I have an ongoing personal research project for creating knowledge graphs from various data sources. You can read more at my KGCreator web site. I have simplified versions of my KGCreator software implemented in both my Haskell Book and in my most recent Common Lisp book. The example here is similar to my Common Lisp implementation, except that it is implemented in the Hy language and I only support generating RDF. The examples in my Haskell and Common Lisp books also generate data for the Neo4J graph database.

What is a KG? It is a modern way to organize and access structured data and integrate data and metadata with other automated systems.

A Knowledge Graph is different from just a graph database containing graph data. The difference is that a KG will in general use Schemas, Taxonomy’s and Ontology’s that define the allowed types and structure of data and allowed relationships.

There is also an executable aspect of KGs since their primary use may be to support other systems in an organization.

Recommended Industrial Use of Knowledge Graphs

Who needs a KG? How do you get started?

If people in your organization are spending much time doing general web search, it might be a signal that you should maintain your organization’s curated knowledge in a human searchable and software accessible way. A possible application is an internal search engine that mixes public web search APIs with search for knowledge used internally inside your organization.

Here are a few use cases:

At Google we used their Knowledge Graph for researching new internal systems that were built on their standard Knowledge Graph, with new schemas and data added.
Digital transformations: start by using a KG to hold metadata for current data in already existing databases. A KG of metadata can provide you with a virtual data lake. It is common to build a large data lake and then have staff not be able to find data. Don’t try to do everything at once.
Capture and preserve senior human expertise. The act of building an Ontology for in-house knowledge helps to understand how to organize data and provides people with a common vocabulary to discuss and model business processes.
KYC (Know Your Customer) applications using data from many diverse data sources.
Take advantage of expertise in a domain (e.g., healthcare or financial services) to build a Taxonomy and Ontology to use to organize available data. For most domains, there are standard existing Schemas, Taxonomy’s and Ontology’s that can be identified and used as-is or extended for your organization.

To get started:

Start small with just one use case.
Design a Schema that identifies object types and relationships
Write some acceptance test cases that you want a prototype to be able to serve as a baseline to develop against.
Avoid having too many stakeholders in early prototype projects — try to choose stakeholders based on potential stakeholders’ initial enthusiasm.

A good way to start is to identify a single problem, determine the best data sources to use, define an Ontology that is just sufficient to solve the current problem and build a prototype “vertical slice” application. Lessons learned with a quick prototype will inform you on what was valuable and what to put effort into when expanding your KG. Start small and don’t try to build a huge system without taking many small development and evaluation steps.

What about KGs for small organizations? Small companies have less development resources but starting small and implementing a system that models the key data relationships, customer relationships, etc., does not require excessive resources. Just capturing where data comes from and who is responsible for maintaining important data sources can be valuable.

What about KGs for individuals? Given the effort involved in building custom KGs, one possible individual use case is developing KGs for commercial sale.

The application that we develop next is one way to quickly bootstrap a new KG by populating it with automatically generated RDF than can be manually curated by removing statements and adding new statements as appropriate.

Design of KGCreator Application

The example application developed here processes input text files in the sub-directory test_data. For each file with the extension .txt in test_data, there should be a matching file with the extension .meta that contains the origin URI for the corresponding text file. The git repository for this book has a few files in test_data that you can experiment with or replace with your own data:

$ ls test_data 
test1.meta test1.txt test2.meta test2.txt test3.meta test3.txt

The *.txt files contain plain text for analysis and the *.meta files contain the original web source URI for the corresponding *.txt files. Using the spaCy library and Python/Hy’s standard libraries for file access, the KGCreator is simple to implement. Here is the overall design of this example:

Overview of the Knowledge Graph Creator script

We will develop two versions of the Knowledge Graph Creator. The first generates RDF that uses string values for the object part of generated RDF statements. The second implementation attempts to resolve these string values to DBPedia URIs.

Using only the spaCy NLP library that we used earlier and the built in Hy/Python libraries, this first example (uses strings a object values) is implemented in just 58 lines of Hy code that is seen in the following three code listings:

 1 #!/usr/bin/env hy
 2 
 3 (import os [scandir])
 4 (import os.path [splitext exists])
 5 (import spacy)
 6 
 7 (setv nlp-model (spacy.load "en"))
 8 
 9 (defn find-entities-in-text [some-text]
10   (defn clean [s]
11     (.strip (.replace s "\n" " ")))
12   (setv doc (nlp-model some-text))
13   (map list (lfor entity doc.ents [(clean entity.text) entity.label_])))

In lines 3 and 4 we import three standard Python utilities we need for finding all files in a directory, checking to see if a file exists, and splitting text into tokens. In line 7 we load the English language spaCy model and save the value of the model in the variable nlp-model. The function find-entities-in-text uses the spaCy English language model to find entities like organizations, people, etc., in text and cleans entity names by removing new line characters and other unnecessary white space (nested function clean in lines 10 and 11). We can run a test in a REPL:

=> (list (find-entities-in-text "John Smith went to Los Angeles to work at IBM"))
[['John Smith', 'PERSON'], ['Los Angeles', 'GPE'], ['IBM', 'ORG']]

The function find-entities-in-text returns a map object so I wrapped the results in a list to print out the entities in the test sentence. The entity types used by spaCy were defined in an earlier chapter, here we just use the entity types defined in lines 21-26 in the following listing:

14 (defn data2Rdf [meta-data entities fout]
15   (for [[value abbreviation] entities]
16     (if (in abbreviation e2umap)
17       (.write fout (+ "<" meta-data ">\t" (get e2umap abbreviation) "\t" "\""
18                        value "\"" " .\n")))))
19 
20 (setv e2umap {
21   "ORG" "<https://schema.org/Organization>"
22   "LOC" "<https://schema.org/location>"
23   "GPE" "<https://schema.org/location>"
24   "NORP" "<https://schema.org/nationality>"
25   "PRODUCT" "<https://schema.org/Product>"
26   "PERSON" "<https://schema.org/Person>"})

In lines 28-39 we open an output file for writing generated RDF data and loop through all text files in the input directory and call the function process-file for each text + meta file pair in the input directory:

28 (defn process-directory [directory-name output-rdf]
29   (with [frdf (open output-rdf "w")]
30     (with [entries (scandir directory-name)]
31       (for [entry entries]
32         (setv [_ file-extension] (splitext entry.name))
33         (if (= file-extension ".txt")
34             (do
35               (setv check-file-name (+ (cut entry.path 0 -4) ".meta"))
36               (if (exists check-file-name)
37                   (process-file entry.path check-file-name frdf)
38                   (print "Warning: no .meta file for" entry.path
39                          "in directory" directory-name))))))))

40 (defn process-file [txt-path meta-path frdf]
41   
42   (defn read-data [text-path meta-path]
43     (with [f (open text-path)] (setv t1 (.read f)))
44     (with [f (open meta-path)] (setv t2 (.read f)))
45     [t1 t2])
46   
47   (defn modify-entity-names [ename]
48     (.replace ename "the " ""))
49   
50   (setv [txt meta] (read-data txt-path meta-path))
51   (setv entities (find-entities-in-text txt))
52   (setv entities ;; only operate on a few entity types
53         (lfor [e t] entities
54               :if (in t ["NORP" "ORG" "PRODUCT" "GPE" "PERSON" "LOC"])
55               [(modify-entity-names e) t]))
56   (data2Rdf meta entities frdf))
57 
58 (process-directory "test_data" "output.rdf")

Run using:

40 $ uv sync
41 $ uv run hy kgcreator.hy

We will look at generated output, problems with it, and how to fix these problems in the next section.

Problems with using Literal Values in RDF

Using the Hy script in the last section, let’s look at some of the generated RDF for the text files in the input test directory (most output is not shown). In each triple the first item, the subject, is the URI of the data source, the second item in each statement is a URI representing a relationship (or property), and the third item is a literal string value:

<https://newsshop.com/may/a1023.html>
  <https://schema.org/nationality>	"Portuguese" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"Banco Espirito Santo SA" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Person>	      "John Evans" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"Banco Espirito" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"The Wall Street Journal" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"IBM" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/location>	"Canada" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"Australian Broadcasting Corporation" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Person>	"Frank Smith" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"Australian Writers Guild" .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>	"American University" .
<https://localnews.com/june/z902.html>
  <https://schema.org/Organization>	"The Wall Street Journal" .
<https://localnews.com/june/z902.html>
  <https://schema.org/location>	"Mexico" .
<https://localnews.com/june/z902.html>
  <https://schema.org/location>	"Canada" .
<https://localnews.com/june/z902.html>
  <https://schema.org/Person>	"Bill Clinton" .
<https://localnews.com/june/z902.html>
  <https://schema.org/Organization>	"IBM" .
<https://localnews.com/june/z902.html>
  <https://schema.org/Organization>	"Microsoft" .
<https://abcnews.go.com/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/st\
ory?id=63146361>
  <https://schema.org/Person>	"Jane Deerborn" .
<https://abcnews.go.com/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/st\
ory?id=63146361>
  <https://schema.org/location>	"Texas" .

Let’s visualize the results in a bash shell:

$ git clone https://github.com/fatestigma/ontology-visualization
$ cd ontology-visualization
$ chmod +x ontology_viz.py
$ ./ontology_viz.py -o test.dot output.rdf  -O ontology.ttl
$ # copy the file output.rdf from examples repo directory hy-lisp-python/kgcreator
$ dot -Tpng -o test.png test.dot
$ open test.png

Edited to fit on the page, the output looks like:

Because we used literal values, notice how for example the node for the entity IBM is not shared and thus a software agent using this RDF data cannot, for example, infer relationships between two news sources that both have articles about IBM. We will work on a solution to this problem in the next section.

Revisiting This Example Using URIs Instead of Literal Values

Note that in the figure in the previous section that nodes for literal values (e.g., for “IBM”) are not shared. In this section we will copy the file kgcreator.hy to kgcreator_uri.hy add a few additions to map string literal values for entity names to http://dbpedia.org URIs by individually searching Google using the pattern “DBPedia ‘entity name’” and defining a new map v2umap for mapping literal values to DBPedia URIs.

Note: In a production system (not a book example), I would use https://www.wikidata.org database download to download all of WikiData (which includes DBPedia data) and use a fuzzy text matching to find WikiData URIs for string literals. The compressed WikiData JSON data file is about 50 GB. Here we will manually find DBPedia for entity names that are in the example data.

In kgcreator_uri.hy we add a map v2umap for selected entity literal names to DBPedia URIs that I manually created using a web search on the DBPedia domain:

(setv v2umap { ;; object literal value to URI mapping
  "IBM" "<http://dbpedia.org/page/IBM>"
  "The Wall Street Journal" "<http://dbpedia.org/page/The_Wall_Street_Journal>"
  "Banco Espirito" "<http://dbpedia.org/page/Banco_Esp%C3%ADrito_Santo>"
  "Australian Broadcasting Corporation"
  "http://dbpedia.org/page/Australian_Broadcasting_Corporation"
  "Australian Writers Guild"
  "http://dbpedia.org/page/Australian_Broadcasting_Corporation"
  "Microsoft" "http://dbpedia.org/page/Microsoft"})

We also make a change in the function data2Rdf to use the map v2umap:

(defn data2Rdf [meta-data entities fout]
  (for [[value abbreviation] entities]
    (setv a-literal (+ "\"" value "\""))
    (if (in value v2umap) (setv a-literal (get v2umap value)))
    (if (in abbreviation e2umap)
      (.write fout (+ "<" meta-data ">\t" (get e2umap abbreviation)
                      "\t" a-literal " .\n")))))

Here is some of the generated RDF that has changed:

<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>
  <http://dbpedia.org/page/IBM> .
<https://newsshop.com/may/a1023.html>
  <https://schema.org/Organization>
  <http://dbpedia.org/page/Banco_Esp%C3%ADrito_Santo> .

Run using:

40 $ uv sync
41 $ uv run hy kgcreator_uri.hy

Now when we visualize generated RDF, we share nodes for The Wall Street Journal and IBM:

Part of the RDF graph that shows shared nodes when URIs are used for RDF values instead of literal strings

While literal values sometimes are useful in generated RDF, using literals for the values in RDF triples prevents types of queries and inference that can be performed on the data.

Wrap-up

In the field of Artificial Intelligence there are two topics that get me the most excited and I have been fortunate to be paid to work on both: Deep Learning and Knowledge Graphs. Here we have just touched the surface for creating data for Knowledge Graphs but I hope that between this chapter and the material on RDF in the chapter Datastores that you have enough information and experience playing with the examples to get started prototyping a Knowledge Graph in your organization. My advice is to “start small” by picking a problem that your organization has that can be solved by not moving data around, but rather, by creating a custom Knowledge Graph for metadata for existing information in your organization.

Knowledge Graph Navigator

The Knowledge Graph Navigator (which I will often refer to as KGN) is a tool for processing a set of entity names and automatically exploring the public Knowledge Graph DBPedia using SPARQL queries. I wrote KGN in Common Lisp for my own use to automate some things I used to do manually when exploring Knowledge Graphs, and later thought that KGN might be useful also for educational purposes. KGN uses NLP code developed in earlier chapters and we will reuse that code with a short review of using the APIs.

Please note that the example is a simplified version that I first wrote in Common Lisp and is also an example in my book Loving Common Lisp, or the Savvy Programmer’s Secret Weapon that you can read free online.

The code for this application is in the directory kgn and this example is pre-configured to use uv.

One time only, you will need to download spacy language model that we used in the earlier chapter on natural language processing. Install this language model requirement in the directory kgn (edited to fit page width):

$ pwd
~/GITHUB/hy-lisp-python-book/source_code_for_examples/kgn
$ uv run python -m spacy download en_core_web_sm

The following listing shows the text based user interface for this example. This example application asks the user for a list of entity names and uses SPARQL queries to discover potential matches in DBPedia.

 1 $ uv run hy kgn.hy
 2 Enter a list of entities: Bill Gates worked at Microsoft
 3 Generated SPARQL to get DBPedia entity URIs from a name:
 4 select distinct ?s ?comment { ?s ?p "Bill Gates"@en . ?s <http://www.w3.org/2000/01/\
 5 rdf-schema#comment> ?comment . FILTER (lang(?comment) = 'en') . ?s <http://www.w3.or\
 6 g/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> . } limit 15 
 7 Generated SPARQL to get DBPedia entity URIs from a name:
 8 select distinct ?s ?comment { ?s ?p "Microsoft"@en . ?s <http://www.w3.org/2000/01/r\
 9 df-schema#comment> ?comment . FILTER (lang(?comment) = 'en') . ?s <http://www.w3.org\
10 /1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Organisation> . } limit\
11  15 
12 Please select entities from the list below:
13   1. Bill Gates || Cascade Investment, L.L.C. is an American holding company...
14   2. Bill Gates || William Henry Gates III (born October 28, 1955) is an Ame...
15   3. Bill Gates || Simon Wood is a British cook and winner of the 2015 editi...
16   4. Bill Gates || Harry Roy Lewis (born 1947) is an American computer scien...
17   5. Bill Gates || Jerry P. Dyer (born May 3, 1959) is an American politicia...
18   6. Microsoft || Press Play ApS was a Danish video game development studio ...
19   7. Microsoft || The AMD Professional Gamers League (PGL), founded around 1...
20   8. Microsoft || The CSS Working Group (Cascading Style Sheets Working Grou...
21   9. Microsoft || Microsoft Corporation is an American multinational technol...
22   10. Microsoft || Secure Islands Technologies Ltd. was an Israeli privately...
23   11. Microsoft || Microsoft Innovation Centers (MICs) are local government ...
24 
25 Enter the numbers of the entities you want to process, separated by commas (e.g., 1,\
26  3): 2,9
27 []
28 1
29 Bill Gates || William Henry Gates III (born October 28, 1955) is an American busines\
30 ...
31 8
32 Microsoft || Microsoft Corporation is an American multinational technology corporat.\
33 ..
34 ****** user-selected-entities
35 ['Bill Gates || William Henry Gates III (born October 28, 1955) is an American '
36  'busines...',
37  'Microsoft || Microsoft Corporation is an American multinational technology '
38  'corporat...']
39 Generated SPARQL to get relationships between two entities:
40 SELECT DISTINCT ?p { <http://dbpedia.org/resource/Bill_Gates> ?p <http://dbpedia.org\
41 /resource/Microsoft> . FILTER (!regex(str(?p), 'wikiPage', 'i')) } LIMIT 5 
42 Generated SPARQL to get relationships between two entities:
43 SELECT DISTINCT ?p { <http://dbpedia.org/resource/Microsoft> ?p <http://dbpedia.org/\
44 resource/Bill_Gates> . FILTER (!regex(str(?p), 'wikiPage', 'i')) } LIMIT 5 
45 Generated SPARQL to get relationships between two entities:
46 SELECT DISTINCT ?p { <http://dbpedia.org/resource/Microsoft> ?p <http://dbpedia.org/\
47 resource/Bill_Gates> . FILTER (!regex(str(?p), 'wikiPage', 'i')) } LIMIT 5 
48 Generated SPARQL to get relationships between two entities:
49 SELECT DISTINCT ?p { <http://dbpedia.org/resource/Bill_Gates> ?p <http://dbpedia.org\
50 /resource/Microsoft> . FILTER (!regex(str(?p), 'wikiPage', 'i')) } LIMIT 5 
51 
52 Discovered relationship links:
53 [['<http://dbpedia.org/resource/Bill_Gates>',
54   '<http://dbpedia.org/resource/Microsoft>',
55   [['p', 'http://dbpedia.org/ontology/knownFor']]],
56  ['<http://dbpedia.org/resource/Bill_Gates>',
57   '<http://dbpedia.org/resource/Microsoft>',
58   [['p', 'http://dbpedia.org/property/founders']]],
59  ['<http://dbpedia.org/resource/Bill_Gates>',
60   '<http://dbpedia.org/resource/Microsoft>',
61   [['p', 'http://dbpedia.org/ontology/foundedBy']]],
62  ['<http://dbpedia.org/resource/Microsoft>',
63   '<http://dbpedia.org/resource/Bill_Gates>',
64   [['p', 'http://dbpedia.org/property/founders']]],
65  ['<http://dbpedia.org/resource/Microsoft>',
66   '<http://dbpedia.org/resource/Bill_Gates>',
67   [['p', 'http://dbpedia.org/ontology/foundedBy']]],
68  ['<http://dbpedia.org/resource/Microsoft>',
69   '<http://dbpedia.org/resource/Bill_Gates>',
70   [['p', 'http://dbpedia.org/ontology/knownFor']]]]
71 Enter a list of entities:

To select found entities of interest, type the entity index numbers you want analyzed. In the last example we chose indices 2 and 9:

1 Enter the numbers of the entities you want to process, separated
2 by commas (e.g., 1, 3): 2,9

Note: in the last listing, if you run this example yourself you will see that generated SPARQL queries are colorized for better readability. This colorization does not appear in the last listing.

After listing the generated SPARQL for finding information for the entities in the query, KGN searches for relationships between these entities. These discovered relationships can be seen at the end of the last screen shot. Please note that this step makes SPARQL queries on O(n^2) where n is the number of entities. Local caching of SPARQL queries to DBPedia helps make processing many entities possible.

Review of NLP Utilities Used in Application

We covered NLP in a previous chapter, so the following is just a quick review. The NLP code we use is near the top of the file kgn.hy:

(import os)
(import sys)
(import pprint [pprint])

(import textui [select-entities get-query])
(import kgnutils [dbpedia-get-entities-by-name first second])
(import relationships [entity-results->relationship-links])

(import spacy)

(setv nlp-model (spacy.load "en_core_web_sm"))

(defn entities-in-text [s]
  (setv doc (nlp-model s))
  (setv ret {})
  (for
    [[ename etype] (lfor entity doc.ents [entity.text entity.label_])]
    
    (if (in etype ret)
        (setv (get ret etype) (+ (get ret etype) [ename]))
        (setv (get ret etype) [ename])))
  ret)

Here is an example use of this function:

=> (kgn.entities-in-text "Bill Gates, Microsoft, Seattle")
{'PERSON': ['Bill Gates'], 'ORG': ['Microsoft'], 'GPE': ['Seattle']}

The entity type “GPE” indicates that the entity is some type of location.

SPARQL Utilities

We will use the caching code from the last section and also the standard Python library requests to access the DBPedia servers. The following code is found in the file sparql.hy and also provides support for using both DBPedia and WikiData. We only use DBPedia in this chapter but when you start incorporating SPARQL queries into applications that you write, you will also probably want to use WikiData.

The function do-query-helper contains generic code for SPARQL queries and is used in functions wikidata-sparql and dbpedia-sparql:

(import json)
(import requests)

(setv wikidata-endpoint "https://query.wikidata.org/bigdata/namespace/wdq/sparql")
(setv dbpedia-endpoint "https://dbpedia.org/sparql")

(defn do-query-helper [endpoint query]
  ;; Construct a request
  (setv params { "query" query "format" "json"})
        
  ;; Call the API
  (setv response (requests.get endpoint :params params))
        
  (setv json-data (response.json))
        
  (setv vars (get (get json-data "head") "vars"))
        
  (setv results (get json-data "results"))
        
  (if (in "bindings" results)
    (do
      (setv bindings (get results "bindings"))
      (setv qr
            (lfor binding bindings
                (lfor var vars
                   [var (get (get binding var) "value")])))
      qr)
    []))

(defn wikidata-sparql [query]
  (do-query-helper wikidata-endpoint query))

(defn dbpedia-sparql [query]
  (do-query-helper dbpedia-endpoint query))

Here is an example query (manually formatted for page width):

$ hy
=> (import sparql)
=> (sparql.dbpedia-sparql
     "select ?s ?p ?o { ?s ?p ?o } limit 1")
[[['s', 'http://www.openlinksw.com/virtrdf-data-formats#default-iid'],
  ['p', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'],
  ['o', 'http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat']]]
=>

This is a wild-card SPARQL query that will match any of the 9.5 billion RDF triples in DBPedia and return just one result.

This caching layer greatly speeds up my own personal use of KGN. Without caching, queries that contain many entity references simply take too long to run.

Utilities to Colorize SPARQL and Generated Output

When I first had the basic functionality of KGN working, I was disappointed by how the application looked as normal text. Every editor and IDE I use colorizes text in an appropriate way so I used standard ANSI terminal escape sequences to implement color hilting SPARQL queries.

The code in the following listing is in the file colorize.hy.

(require [hy.contrib.walk [let]])
(import [io [StringIO]])

;; Utilities to add ANSI terminal escape sequences to colorize text.
;; note: the following 5 functions return string values that then need to
;;       be printed.

(defn blue [s] (.format "{}{}{}" "\033[94m" s "\033[0m"))
(defn red [s] (.format "{}{}{}" "\033[91m" s "\033[0m"))
(defn green [s] (.format "{}{}{}" "\033[92m" s "\033[0m"))
(defn pink [s] (.format "{}{}{}" "\033[95m" s "\033[0m"))
(defn bold [s] (.format "{}{}{}" "\033[1m" s "\033[0m"))

(defn tokenize-keep-uris [s]
  (.split s))

(defn colorize-sparql [s]
  (setv tokens
        (tokenize-keep-uris
          (.replace (.replace s "{" " { ") "}" " } ")))
  (setv ret (StringIO)) ;; ret is an output stream for a string buffer
  (for [token tokens]
    (when (> (len token) 0)
        (if (= (get token 0) "?")
            (.write ret (red token))
            (if (in
                  token
                  ["where" "select" "distinct" "option" "filter"
                    "FILTER" "OPTION" "DISTINCT" "SELECT" "WHERE"])
                (.write ret (blue token))
                (if (= (get token 0) "<")
                    (.write ret (bold token))
                    (.write ret token)))))
    (when (not (= token "?"))
        (.write ret " ")))
  (.seek ret 0)
  (.read ret))

Text Utilities for Queries and Results

The application low level utility functions are in the file kgn-utils.hy. The function dbpedia-get-entities-by-name requires two arguments:

The name of an entity to search for.
A URI representing the entity type that we are looking for.

We embed a SPARQL query that has placeholders for the entity name and type. The filter expression specifies that we only want triple results with comment values in the English language by using (lang(?comment) = ‘en’):

 1 (import sparql [dbpedia-sparql])
 2 (import colorize [colorize-sparql])
 3 
 4 (import pprint [pprint])
 5 
 6 (defn dbpedia-get-entities-by-name [name dbpedia-type]
 7   (setv sparql
 8         (.format "select distinct ?s ?comment {{ ?s ?p \"{}\"@en . ?s <http://www.w3\
 9 .org/2000/01/rdf-schema#comment>  ?comment  . FILTER  (lang(?comment) = 'en') . ?s <\
10 http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {} . }} limit 15" name dbpedia-type\
11 ))
12   (print "Generated SPARQL to get DBPedia entity URIs from a name:")
13   (print (colorize-sparql sparql))
14   (dbpedia-sparql sparql))
15 
16 ;;(pprint (dbpedia-get-entities-by-name "Bill Gates" "<http://dbpedia.org/ontology/P\
17 erson>"))
18 
19 (defn first [a-list]
20   (get a-list 0))
21 
22 (defn second [a-list]
23   (get a-list 1))

Finishing the Main Function for KGN

We already looked at the NLP code near the beginning of the file kgn.hy. Let’s look at the remainder of the implementation.

We need a dictionary (or hash table) to convert spaCy entity type names to DBPedia type URIs:

1 (setv entity-type-to-type-uri
2       {"PERSON" "<http://dbpedia.org/ontology/Person>"
3        "GPE" "<http://dbpedia.org/ontology/Place>"
4        "ORG" "<http://dbpedia.org/ontology/Organisation>"
5        })

When we get entity results from DBPedia, the comments describing entities can be a few paragraphs of text. We want to shorten the comments so they fit in a single line of the entity selection list that we have seen earlier. The following code defines a comment shortening function and also a global variable that we will use to store the entity URIs for each shortened comment:

1 (setv short-comment-to-uri {})
2 
3 (defn shorten-comment [comment uri]
4   (setv sc (+ (cut comment 0 70) "..."))
5   (assoc short-comment-to-uri sc uri)
6   sc)

In line 5, we use the function assoc to add a key and value pair to an existing dictionary short-comment-to-uri.

Finally, let’s look at the main application loop. In line 4 we are using the function get-query (defined in file textui.hy) to get a list of entity names from the user. In line 7 we use the function entities-in-text that we saw earlier to map text to entity types and names. In the nested loops in lines 13-26 we build one-line descriptions of people, place, and organizations that we will use to show the user a menu for selecting entities found in DBPedia from the original query. We are giving the use a chance to select only the discovered entities that they are interested in.

In lines 34-36 we are converting the shortened comment strings the user selected back to DBPedia entity URIs. Finally in line 36 we use the function entity-results->relationship-links to find relationships between the user selected entities.

 1 (defn kgn []
 2   (while
 3     True
 4     (setv query (get-query))
 5     (when (or (= query "quit") (= query "q"))
 6         (break))
 7     (setv elist (entities-in-text query))
 8     (setv people-found-on-dbpedia [])
 9     (setv places-found-on-dbpedia [])
10     (setv organizations-found-on-dbpedia [])
11     (global short-comment-to-uri)
12     (setv short-comment-to-uri {})
13     (for [key elist]
14       (setv type-uri (get entity-type-to-type-uri key))
15       (for [name (get elist key)]
16         (setv dbp (dbpedia-get-entities-by-name name type-uri))
17         (for [d dbp]
18           (setv
19             short-comment
20             (shorten-comment (second (second d)) (second (first d))))
21           (when (= key "PERSON")
22               (.extend people-found-on-dbpedia [(+ name  " || " short-comment)]))
23           (when (= key "GPE")
24               (.extend places-found-on-dbpedia [(+ name  " || " short-comment)]))
25           (when (= key "ORG")
26               (.extend organizations-found-on-dbpedia
27                        [(+ name  " || " short-comment)])))))
28     (setv user-selected-entities
29           (select-entities
30             people-found-on-dbpedia
31             places-found-on-dbpedia
32             organizations-found-on-dbpedia))
33     (setv uri-list [])
34     (for [entity (get user-selected-entities "entities")]
35       (setv short-comment (cut entity (+ 4 (.index entity " || ")) (len entity)))
36       (.extend uri-list [(get short-comment-to-uri short-comment)]))
37     (setv relation-data (entity-results->relationship-links uri-list))
38     (print "\nDiscovered relationship links:")
39     (pprint relation-data)))

If you have not already done so, I hope you experiment running this example application. The first time you specify an entity name expect some delay while DBPedia is accessed. Thereafter the cache will make the application more responsive when you use the same name again in a different query.

Wrap-up

If you enjoyed running and experimenting with this example and want to modify it for your own projects then I hope that I provided a sufficient road map for you to do so.

I got the idea for the KGN application because I was spending quite a bit of time manually setting up SPARQL queries for DBPedia and other public sources like WikiData, and I wanted to experiment with partially automating this exploration process.

Using OpenAI GPT

I use frequently use the OpenAI APIs in my work. In this chapter we use the GPT-5 API since it works well for our examples.

OpenAI Text Completion API

OpenAI GPT (Generative Pre-trained Transformer) models like gpt-4o, gpt-4o-mini, and gpt-5 are advanced language processing models developed by OpenAI. There are three general classes of OpenAI API services:

GPT which performs a variety of natural language tasks.
Codex which translates natural language to code.
DALL·E which creates and edits original images.

GPT-5 is capable of generating human-like text, completing tasks such as language translation, summarization, and question answering, and much more.

Overall, the OpenAI APIs provide a powerful and easy-to-use tool for developers to integrate advanced language processing capabilities into their applications, and can be a game changer for developers looking to add natural language processing capabilities to their projects.

The following examples are derived from the official set of cookbook examples at https://github.com/openai/openai-cookbook. The first example calls the OpenAI gpt-4o-mini Completion API with a sample of input text and the model completes the text.

Here is a listing or the source file openai/text_completion.hy:

 1 (import os)
 2 (import openai)
 3 
 4 (setv openai.api_key (os.environ.get "OPENAI_KEY"))
 5 
 6 (setv client (openai.OpenAI))
 7 
 8 (defn completion [query] ; return a Completion object
 9   (setv
10     completion
11     (client.chat.completions.create
12       :model "gpt-5"
13       :messages
14       [{"role" "user"
15         "content" query
16         }]))
17   (print completion)
18   (get completion.choices 0))
19 
20 (setv x (completion "how to fix leaky faucet?"))
21 
22 (print x.message.content)

Every time you run this example you get different output. Here is one example run (output truncated for brevity):

 1 Fixing a leaky faucet can be a straightforward process, and you can often do it your\
 2 self with some basic tools. Here’s a step-by-step guide:
 3 
 4 ### Tools and Materials Needed:
 5 - Adjustable wrench
 6 - Screwdriver (flathead or Phillips, depending on your faucet)
 7 - Replacement parts (O-rings, washers, or cartridge, depending on your faucet type)
 8 - Plumber's grease
 9 - Towel or rag
10 
11 ### Steps to Fix a Leaky Faucet:
12 
13 1. **Turn Off the Water Supply**:
14    - Look for shut-off valves under the sink and turn them clockwise to close. If th\
15 ere are no shut-off valves, you may need to turn off the main water supply to your h\
16 ome.
17 
18 2. **Drain the Faucet**:
19    - Open the faucet to let any remaining water drain out.
20 
21 etc.

Using Google Gemini API

As I write this chapter in May 2025, I primarily choose Google Gemini when I use commercial LLM APIs (most of my work involves running local LLM models using Ollama).

Overall, the Google Gemini APIs provide a powerful and easy-to-use tool for developers to integrate advanced language processing capabilities into their applications, and can be a game changer for developers looking to add natural language processing capabilities to their projects.

Google Gemini offers two features that set it apart from other commercial APIs:

Supports a one million token context size.
Very low cost.

We will look at two ways to access Gemini and we will look at examples for each technique:

Use the Python requests library to use Gemini’s REST style interface.
Use Google’s Python google-genai package (and we will look at tool use in the same example).

REST Interface

The following example calls the Gemini completion API and stores user chat in a persistent context.

Here is a listing or the source file google-gemini/chat.hy:

 1 (import os)
 2 (import requests)
 3 (import json) ;; Explicitly import json for dumps
 4 
 5 ;; Get API key from environment variable (standard practice)
 6 (setv api-key (os.getenv "GOOGLE_API_KEY"))
 7 
 8 ;; Gemini API endpoint
 9 (setv api-url f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-f\
10 lash:generateContent?key={api-key}")
11 
12 ;; Initialize the chat history (Note: Gemini uses 'user' and 'model')
13 (setv chat-history [])
14 
15 (defn call-gemini [chat-history user-input]
16   "Calls the Gemini API with the chat history and user input using requests."
17 
18   (setv headers {"Content-Type" "application/json"})
19 
20   ;; Build the contents list, correctly alternating roles.
21   (setv contents [])
22   (for [message chat-history]
23     (.append contents message))
24   (.append contents {"role" "user" "parts" [{"text" user-input}]})
25 
26   (setv data {
27               "contents" contents
28               "generationConfig" {
29                                   "maxOutputTokens" 200
30                                   "temperature" 1.2
31                                   }})
32 
33   ;; Use json.dumps to convert the Python/Hy dict to a JSON string
34   (setv response (requests.post api-url :headers headers :data (json.dumps data)))
35 
36   ;; Raise HTTPError for bad responses (4xx or 5xx)
37   (. response raise_for-status)
38 
39   ;; Return the JSON response as a Hy dictionary/list
40   (response.json))
41 
42 ;; --- Main Chat Loop ---
43 (while True
44   ;; Get user input from the console
45   (setv user-input (input "You: "))
46 
47 
48   ;; Call the Gemini API
49   (setv response-data (call-gemini chat-history user-input))
50 
51   ;; Debug print (optional)
52   ;; (print "Raw response data:" response-data)
53 
54   ;; Extract and print the assistant's message
55   ;; Using sequential gets for clarity, assumes expected structure
56   (setv candidates (get response-data "candidates"))
57   (setv first-candidate (get candidates 0))
58   (setv content (get first-candidate "content"))
59   
60   (setv parts (get content "parts"))
61 
62   (setv assistant-message (get (get parts 0) "text"))
63   (print "Assistant:" assistant-message)
64 
65   ;; Append BOTH user and assistant messages to chat history (important for context)
66   (.append chat-history {"role" "user" "parts" [{"text" user-input}]})
67   (.append chat-history {"role" "model" "parts" [{"text" assistant-message}]}))

This example differs from the OpenAI API example in the previous chapter in two ways:

It implements a chat (multiple user input conversation) interface.
It uses the low level Python requests library since the Google Gemini library has some incompatibilities with the Hy language system.

Here is a sample output showing how the user chat complex is used:

 1 $ uv sync
 2 $ uv run hy chat.hy                    
 3 You: set the value of the variable X to 1 + 7
 4 Assistant: python
 5 X = 1 + 7
 6 
 7 
 8 This code will:
 9 
10 1. **Calculate:**  1 + 7, which results in 8.
11 2. **Assign:** Assign the value 8 to the variable named `X`.
12 
13 You: print the value of X + 3
14 Assistant: python
15 X = 1 + 7  # Make sure X is defined as 8
16 print(X + 3)
17 
18 
19 This code will:
20 
21 1. **Calculate:** Take the current value of X (which is 8) and add 3 to it, resultin\
22 g in 11.
23 2. **Print:** Display the result (11) on the console.
24 
25 You: print the value of X + 3
26 Assistant: python
27 X = 1 + 7  # Make sure X is defined as 8
28 print(X + 3)
29 
30 This code will:
31 
32 1. **Calculate:** Take the current value of `X` (which is 8) and add 3 to it, result\
33 ing in 11.
34 2. **Print:** Display the result (11) on the console.
35 
36 You:

Using Google’s Python Package to Access Gemini

We use the package google-genai in the example context_url.hy:

 1 (import os)
 2 (import google [genai])
 3 (import json) ;; Explicitly import json for dumps
 4 (import pprint [pprint])
 5 
 6 ;; Set enviroment variable: "GOOGLE_API_KEY"
 7 
 8 (setv client (genai.Client))
 9       
10 (defn context_qa [prompt]
11   "Calls the Gemini API using url_context tool with a prompt containing both a URI a\
12 nd user question"
13 
14   (setv
15     response
16     (client.models.generate_content
17       :model "gemini-2.5-flash"
18       :contents prompt
19       :config {"tools" [{"url_context" {}}]}))
20 
21   (return response.text))
22 
23 (when (= __name__ "__main__")
24   (print
25     (context_qa
26       "https://markwatson.com What musical instruments does Mark Watson play?")))

The tool url_context is called automatically when a URI is present in a user prompt. A prompt can also contain multiple URIs and they are all used in generating text from the input prompt.

The output for this example is:

1 $ uv run hy context_url.hy                                    
2 Mark Watson plays the guitar, didgeridoo, and American Indian flute.

If you want to reuse this example without using tools, just remove the option :config {“tools” [{“url_context” {}}]}.

The GitHub repository for the this Google package also contains useful examples and documentation links: https://github.com/googleapis/python-genai.

Wrap Up for Using the Gemini APIs

There are many good commercial LLM APIs (and I have most of them) but I currently most frequently use Gemini for two reasons: supports a one million token context size and is very low cost.

I discuss Gemini in more detail in another book that you can read online: https://leanpub.com/solo-ai/read.

Running Local LLMs Using Ollama

We saw in previus chapters how to use LLMs from commercial providers, for example GPT-5 from OpenAI and Gemini-2.5-flash from Google. Here we run smaller models on our own laptops or servers. You need to install Ollama: https://ollama.com.

Install Ollama and then download a model we will experiment with:

1 $ ollama pull llama3.2:latest
2 $ ollama serve

The first line is run one time to download a model. The second line is run whenever you want to call the local Ollama service.

Completions

Here we look at a “hello world” type simple example: we pass a text prompt to a local Ollama server instance. This is similar to previous examples for GPT-5 and Gemini-2.5.

The example code is in the file completion.hy:

(import ollama)

(defn completion [prompt]
  ; Initiate chat with the model
  (setv response
        (ollama.chat
          :model "llama3.2:latest"
          :messages [{"role" "user" "content" user-prompt}]))
  (print response)
  (return response.message.content))

;;;; Test code:

; User prompt
(setv
  user-prompt
  "Sally is 77, Bill is 32, and Alex is 44 years old. Pairwise, what are their age d\
ifferences? Be concise."
  )

(print
 (completion user-prompt))

The output looks like:

$ uv run hy completion.hy
model='llama3.2:latest' created_at='2025-08-15T22:35:33.82621Z' done=True done_reaso\
n='stop' total_duration=1099944375 load_duration=81426584 prompt_eval_count=56 promp\
t_eval_duration=98267500 eval_count=54 eval_duration=919762917 message=Message(role=\
'assistant', content='Here are the pairwise age differences:\n\n- Sally and Bill: 77\
 - 32 = 45\n- Sally and Alex: 77 - 44 = 33\n- Bill and Alex: 32 - 44 = -12 (Bill is \
younger)', thinking=None, images=None, tool_name=None, tool_calls=None)
Here are the pairwise age differences:

- Sally and Bill: 77 - 32 = 45
- Sally and Alex: 77 - 44 = 33
- Bill and Alex: 32 - 44 = -12 (Bill is younger)

Tool Use

In the context of this book using tools means that we define functions in the Hy language and configure a Large Language Model to use these tools.

Integrating tool use with Ollama represents a pivotal step in the evolution of local AI, bridging the gap between offline language models and interactive, real-world applications. This capability, often referred to as function calling, allows LLMs running on your own hardware to execute external code and query APIs, breaking the confines of their static training data. By equipping a local model with tools, developers can empower applications using LLMs to, for example, fetch live weather data, search a database, or control other software services. This transforms the LLM from a simple text-generation engine into a dynamic agent capable of performing complex, multi-step tasks and interacting directly with its environment, all while maintaining the privacy and control inherent to the Ollama ecosystem.

Use of Python docstrings at runtime: the Ollama Python SDK leverages docstrings as a crucial part of its runtime function calling mechanism. When defining functions that will be called by the LLM, the docstrings serve as structured metadata that gets parsed and converted into a JSON schema format. This schema describes the function’s parameters, their types, and expected behavior, which is then used by the model to understand how to properly invoke the function. The docstrings follow a specific format that includes parameter descriptions, type hints, and return value specifications, allowing the SDK to automatically generate the necessary function signatures that the LLM can understand and work with.

During runtime execution, when the LLM determines it needs to call a function, it first reads these docstring-derived schemas to understand the function’s interface. The SDK parses these docstrings using Python’s introspection capabilities (through the inspect module) and matches the LLM’s intended function call with the appropriate implementation. This system allows for a clean separation between the function’s implementation and its interface description, while maintaining human-readable documentation that serves as both API documentation and runtime function calling specifications. The docstring parsing is done lazily at runtime when the function is first accessed, and the resulting schema is typically cached to improve performance in subsequent calls.

Here is the sample tools library defined in tools.hy:

(import os)
(import httpx)
(import markdownify [markdownify])

(defn list-directory []
  "Lists files and directories in the current working directory"
  ; Args:
  ;   None
  ; Returns:
  ;   string containing the current directory name, followed by list of files in the\
 directory
  (setv current-dir (os.path.realpath "."))
  (setv files (.listdir os))

  (return f"Contents of current directory {current-dir} is: {files}"))

(defn read-file-contents [file-path]
  "Reads the contents of a file, given an input file-path"
  ; Args:
  ;   file-path: The path to the file
  ; Returns:
  ;   The contents of the file as a string
  (with [f (open file-path "r")]
    (.read f)))

(defn uri-to-markdown [uri]
  "Fetches HTML from a URI and converts it to markdown."
  (setv response (httpx.get uri))
  (.raise-for-status response)
  ; Convert the HTML text to Markdown
  (setv md (markdownify response.text))
  (return f"# Content from {uri}\n\n{md}"))

In the next example we will configure a LLM to call the tools (or functions) defined in the last code listing.

Example in ollama_tools_examples.hy:

(import ollama)

(import tools  [list-directory])
(import tools  [read-file-contents])
(import tools  [uri-to-markdown])

; (print (list-directory))
; (print (read-file-contents "requirements.txt"))
; (print (uri-to-markdown "https://markwatson.com"))

; Map function names to function objects
(setv available-functions {
  "list_directory" list-directory
  "read_file_contents" read-file-contents
  "uri_to_markdown" uri-to-markdown
})

; User prompt
(setv
  user-prompt 
;;  "read the 'requirements.txt' file"
  "convert 'https://markwatson.com' to markdown.")
;;  "Please list the contents of the current directory, read the 'requirements.txt' \
file, and convert 'https://markwatson.com' to markdown.")

; Initiate chat with the model
(setv response (ollama.chat
  :model "llama3.2:latest"
  :messages [{"role" "user" "content" user-prompt}]
  :tools [list-directory read-file-contents uri-to-markdown]
))

(print response)

;;(print (get response.message.tool_calls 0).name)

; Process the model's response
(for [tool-call (or response.message.tool_calls {"name" "none"})]
  (print tool-call)
  (print tool-call.function)
  (setv function-to-call (.get available-functions tool-call.function.name))
  (setv arguments tool-call.function.arguments)
  (print arguments)
  (print function-to-call)
  (if function-to-call
    (do
      (setv result (function-to-call #** tool-call.function.arguments))
      (print f"\n\n** Output of {tool-call.function.name}: {result}")
    )
    (print f"\n\n** Function {(.name tool-call.function)} not found.")
  )
)

This Hy script demonstrates how to integrate a large language model with external tools using the Ollama library. It begins by importing three functions: list-directory, read-file-contents, and uri-to-markdown from the local file tools.hy that we saw earlier. These functions are then mapped by their string names to the actual function objects in a dictionary called available-functions. This mapping serves as a registry, allowing the program to dynamically call a function based on a name provided by the language model. A user prompt is defined, asking the model to perform a task that requires one of the available tools.

The core of the script involves sending the prompt and the list of available tools to the Llama 3.2 model via the chat function. The model analyzes the request and, instead of generating a text-only reply, it returns a response object containing a “tool call” instruction. The script then iterates through any tool calls in the response. For each call, it retrieves the function name and arguments, looks up the corresponding function in the available-functions dictionary, and executes it with the provided arguments. The final result from the tool is then printed to the console, completing the request.

Wrap Up for Running Local LLMs Using Ollama

I spend most of my development time working with smaller LLMs running on Ollama (LM Studio is another good choice for running locally).

There are obvious privacy and security advantages running LLMs locally and very interesting and useful engineering problems can sometimes be solved with smaller models.

Agents Using the Agno Agent Framework Running On a Local Ollama Model

The example in this chapter uses a local LLM running on Ollama. The examples for this chapter are found in the directory agents_agno. If yuo skipped reading the previous chapter, please review the opening material for running the Ollama service.

An Agent For Answering Questions About A Specific Web Site

Here we construct a sophisticated web scraping agent using the agno library. This program defines a specialized tool, scrape-website-content which leverages the requests and BeautifulSoup libraries to fetch and parse the textual content from any given URL, stripping away common non-content elements like navigation bars and scripts. This tool is then integrated into an Agent powered by a local Ollama model. The agent is configured with a detailed description, a step-by-step instruction set, and a defined output format, guiding it to first scrape a user-provided URL and then answer a specific question based only on the extracted information, ensuring a focused and verifiable response.

 1 (import textwrap [dedent])
 2 (import os requests)
 3 (import bs4 [BeautifulSoup])
 4 (import agno.agent [Agent])
 5 (import agno.models.ollama [Ollama])
 6 (import agno.tools [tool])
 7 
 8 (tool ;; in Python this would be a @tool annotation
 9   (defn scrape-website-content [url]
10     "Fetches and extracts the clean, textual content from a given webpage URL.
11     Use this tool when you need to read the contents of a specific web page to
12     answer a question.
13     
14     Args:
15         url (str): The full, valid URL of the webpage to be scraped
16         (e.g., 'https://example.com').
17         
18     Returns:
19         str: The extracted text content of the webpage.
20     "
21     (try
22       ;; Set a User-Agent header to mimic a real browser.
23       (let
24         [headers
25          {"User-Agent" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\
26  (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
27          response (requests.get url :headers headers :timeout 10)]
28         (.raise_for_status response)
29         (let [soup (BeautifulSoup response.text "html.parser")]
30           ;; Remove unwanted tags
31           (for [tag (soup ["script" "style" "nav" "footer" "aside"])]
32             (.decompose tag))
33           (let [text (.get_text soup :separator "\n" :strip True)]
34             (if (not text)
35               f"Connected to {url}, but no text content could be extracted."
36               f"Successfully scraped content from {url}:\n\n{text}")))))))
37 
38 ;; Initialize the web scraping and analysis agent
39 (setv scraper-agent (Agent
40   :model (Ollama :id "qwen3:30b")
41   :tools [scrape-website-content]
42   :description (dedent
43     "You are an expert web scraping and analysis agent. You follow a strict process:
44 
45     - Given a URL in a prompt, you will first use the appropriate tool to scrape
46       its content.
47     - You will then carefully read the scraped content to understand it thoroughly.
48     - Finally, you will answer the user's question based *only* on the information
49       contained within that specific URL's content.")
50   
51   ;; The instructions are refined to provide a clear, step-by-step reasoning process.
52   :instructions (dedent
53     "1. Scrape Phase 🕸️
54        - Analyze the user's prompt to identify the target URL.
55        - Invoke the `scrape` tool with the identified URL.
56 
57     2. Analysis Phase 📊
58        - Carefully read the entire content returned by the `scrape` tool.
59        - Systematically extract the specific information required to answer the
60          user's question.
61 
62     3. Answering Phase ✍️
63        - Formulate a concise and accurate answer based exclusively on the scraped
64          information.
65        - If the information is not present, state that clearly.
66 
67     4. Quality Control ✓
68        - Reread the original query and your answer to ensure it is accurate
69          and relevant.")
70   
71   :expected_output (dedent
72     "# {Answer based on website content}
73     
74     **Source:** {URL provided by the user}")
75   
76   :markdown True
77   :show_tool_calls True
78   :add_datetime_to_instructions True))
79 
80 ;; Main execution block
81 (when (= __name__ "__main__")
82   (setv prompt "Using the web site https://markwatson.com Consultant Mark Watson has\
83  written Common Lisp, semantic web, Clojure, Java, and AI books. What musical instru\
84 ments does he play?")
85   
86   (.print-response scraper-agent
87     prompt
88     :stream True))

This code is divided into two main parts: the tool definition and the agent configuration.

The first part is the definition of the scrape-website-content function acts as the agent’s primary capability. It takes a URL, uses the requests library to perform an HTTP GET request (while mimicking a browser’s User-Agent header to improve compatibility), and then processes the resulting HTML with BeautifulSoup. Critically, it removes tags like <script>, <style>, <nav>, and <footer> that typically contain boilerplate or non-essential content. This cleaning step is vital as it provides the language model with a concise and relevant block of text, free from the noise of web page structure and styling, allowing it to focus on the core information needed to answer the user’s query.

The second part initializes the Agent from the agno library. This is where the AI’s behavior is defined. It’s configured to use a specific Ollama model and is given access to the scrape-website-content tool we defined. The description and instructions parameters are crucial; they act as a system prompt that programs the agent’s workflow, forcing it into a strict sequence of scraping, analyzing, and then answering. By specifying expected_output, we enforce a consistent structure on the agent’s final response. The main execution block demonstrates a practical example, asking the agent to find information about musical instruments from a specific website, which triggers the entire scrape-and-answer process.

Note: The Agno framework prints beautiful colored bounding boxes around blocks of output text. In the following listing these bounding boxes, represented by four specific Unicode characters, just show up here as tiny box-characters.

 1 $ uv run hy web_site_qa.hy
 2 ┏━ Message ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 3 ┃                                                                               ┃
 4 ┃ Using the web site https://markwatson.com Consultant Mark Watson has written  ┃
 5 ┃ Common Lisp, semantic web, Clojure, Java, and AI books. What musical          ┃
 6 ┃ instruments does he play?                                                     ┃
 7 ┃                                                                               ┃
 8 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 9 ┏━ Tool Calls ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
10 ┃                                                                               ┃
11 ┃ • scrape_website_content(url=https://markwatson.com)                          ┃
12 ┃                                                                               ┃
13 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
14 ┏━ Response (11.8s) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
15 ┃                                                                               ┃
16 ┃ Mark Watson plays the guitar, didgeridoo, and American Indian flute.          ┃
17 ┃                                                                               ┃
18 ┃ Source: https://markwatson.com                                                ┃
19 ┃                                                                               ┃
20 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Wrap Up for Agno Agent Example

The Python source code repository for Agno is found here: https://github.com/agno-agi/agno.

Documentation is found here: https://docs.agno.com/introduction.

There were a few Hy-specific nuances for using Agno with the Hy language. Hopefully, dear reader, the example here serves as a good example fr writing your own aganet in the Hy language.

Using Perplexity Sonar Model for Combined Web Search and LLM Based Reasoning

This chapter combines ideas from the earlier chapter “Using the Microsoft Bing Search APIs” and using LLMs for reasoning. We will use a commercial API from Perplexity. You need a Perplexity API key: https://docs.perplexity.ai/home

I buy $5 of credits at a time and these credits usually last me a few months of experimenting, your mileage may vary. If you use this API in production you will want to check the pricing information and use the cheapest model that will work for you: https://docs.perplexity.ai/guides/pricing

Perplexity’s Sonar API represents a significant advancement in leveraging large language models by seamlessly integrating real-time web search capabilities directly into the reasoning process. Unlike traditional approaches that might require complex orchestration between separate search APIs and LLMs, Sonar allows the language model to autonomously access and synthesize current information from the web as it formulates a response. This dynamic interaction ensures that the generated outputs are not limited to the potentially outdated knowledge contained within the model’s training data, thereby providing more accurate, relevant, and up-to-date answers. The API offers different models, such as Sonar, Sonar Pro, Sonar Reasoning, and Sonar Deep Research, each tailored for varying levels of search depth, reasoning complexity, and the ability to handle intricate, multi-step queries. A key benefit of using the Perplexity API for this combined approach is the ability to ground LLM responses in verifiable information, often accompanied by citations to the sources found during the real-time search. This enhances the trustworthiness and reliability of the AI’s output, which is particularly valuable for applications requiring factual accuracy. You, dear reader, can utilize the Perplexity API to build applications that require access to current events, perform in-depth research, or provide answers to questions where the information is constantly changing, effectively overcoming the limitations of models that rely solely on static knowledge and mitigating the risk of generating incorrect or fabricated information.

A Hy Language Client Library for Perplexity

The following Hy code defines a function search_llm that interacts with the Perplexity AI API to answer a user query. It begins by importing necessary libraries like os to retrieve the API key from the environment variable PERPLEXITY_API_KEY and the Python OpenAI library (Perplexity offers an OpenAI compatibility feature that we use here), which is configured to use the Perplexity API endpoint. A standard system message is defined to instruct the AI on its role as a programming and tech assistant utilizing web search and reasoning. The search_llm function takes a query string, constructs a conversation history including the system message and the user’s query, initializes the openai.OpenAI client pointing to the Perplexity API’s base URL and using the retrieved key, sends the messages to the “sonar” model via the chat completions endpoint, extracts the text content of the first message from the AI’s response, and returns this content string. Finally for testing, the script calls the search_llm function with a specific question about Mark Watson’s musical instruments and prints the resulting answer from the AI.

Here is a listing of the file perplexity_search_llm/search_llm.hy:

 1 (import os)
 2 (import openai)
 3 (import pprint [pprint]) ; Import pprint for potentially pretty printing responses
 4 
 5 ;; Set your Perplexity API key from an environment variable
 6 (setv YOUR-API-KEY (os.environ.get "PERPLEXITY_API_KEY"))
 7 
 8 ;; Define the messages for the conversation using triple quotes for multiline content
 9 (setv system-message
10       {"role" "system"
11        "content" "You are an artificial intelligence assistant for helping a user wi\
12 th programming and tech questions using web search and reasoning."})
13 
14 (defn search_llm [query]
15   (setv user-message
16         {"role" "user"
17          "content" query})
18   
19   (setv messages [system-message user-message])
20 
21   ;; Initialize the OpenAI client, pointing to the Perplexity API base URL
22   (setv client (openai.OpenAI :api-key YOUR-API-KEY
23                               :base-url "https://api.perplexity.ai"))
24 
25   ;; --- Chat completion without streaming ---
26   (setv response (client.chat.completions.create
27                    :model "sonar" ; Use a model supported by Perplexity
28                    :messages messages))
29   (setv choices-list (. response choices))
30   (setv first-choice (get choices-list 0))
31   (setv message-object-result (. first-choice message))
32   (setv content-string (. message-object-result content))
33   ;;(print content-string)
34   content-string)
35 
36 (print (search_llm "Consultant Mark Watson has written many books on AI, Lisp and th\
37 e semantic web. What musical instruments does he play?"))

In this code, I unpack the response data in several steps so you can use pprint to inspect the response and optionally use other data returned by the Perplexity API.

Example Output

Here is the output from the test query at the bottom of the last program listing:

 1 $ venv/bin/hy search_llm.hy                                                         
 2 Mark Watson, the consultant known for writing books on AI, Lisp, and the semantic we\
 3 b, plays several musical instruments as a hobby. These include the **guitar**, **did\
 4 geridoo**, and **American Indian flute**[4]. There is no indication in the available\
 5  information that he is professionally involved in music or plays any other instrume\
 6 nts beyond these. 
 7 
 8 It's worth noting that there are other individuals with the name Mark Watson who are\
 9  involved in music professionally, such as Mark Watson (@markwatsonmusic) and anothe\
10 r Mark Watson who is a bass baritone[1][5]. However, the consultant Mark Watson ment\
11 ioned in the query is distinct from these individuals and is known for his work in t\
12 echnology and AI[4].

Wrap Up for Using Perplexity

While the Perplexity API could be expensive to use in high API call volume production applications, I find it to be very useful for simplifying my code when I need to combine web search with LLM use.

Using LangChain to Chain Together Large Language Models

Harrison Chase started the LangChain project in October 2022 and as I write the first version of this chapter back in May 2023 the GitHub repository for LangChain https://github.com/hwchase17/langchain has over 200 contributors.

Note: this chapter and material last updated August 15, 2025.

The material in this chapter is a very small subset of material in my recent Python book LangChain and LlamaIndex Projects Lab Book: Hooking Large Language Models Up to the Real World. Using GPT-5, ChatGPT, and Hugging Face Models in Applications. that you can read for free online by using the link Free To Read Online.

LangChain is a framework for building applications with large language models (LLMs) through chaining different components together. Some of the applications of LangChain are chatbots, generative question-answering, summarization, data-augmented generation and more. LangChain can save time in building chatbots and other systems by providing a standard interface for chains, agents and memory, as well as integrations with other tools and end-to-end examples. We refer to “chains” as sequences of calls (to an LLMs and a different program utilities, cloud services, etc.) that go beyond just one LLM API call. LangChain provides a standard interface for chains, many integrations with other tools, and end-to-end chains for common applications. Often you will find existing chains already written that meet the requirements for your applications.

For example, one can create a chain that takes user input, formats it using a PromptTemplate, and then passes the formatted response to a Large Language Model (LLM) for processing.

While LLMs are very general in nature which means that while they can perform many tasks effectively, they often can not directly provide specific answers to questions or tasks that require deep domain knowledge or expertise. LangChain provides a standard interface for agents, a library of agents to choose from, and examples of end-to-end agents.

LangChain Memory is the concept of persisting state between calls of a chain or agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory². LangChain provides a large collection of common utils to use in your application. Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

LangChain can be integrated with one or more model providers, data stores, APIs, etc.

Installing Necessary Packages

We are using uv as a package manager and to run the examples. Run using:

1 $ cd hy-lisp-python-book/source_code_for_examples/langchain_examples
2 $ uv sync
3 $ uv run hy country_information.hy
4 $ uv run hy directions_template.hy
5 $ uv run hy doc_search.hy

Basic Usage and Examples

While I try to make the material in this book independent, something you can enjoy with no external references, you should also take advantage of the high quality documentation and the individual detailed guides for prompts, chat, document loading, indexes, etc.

As we work through some examples please keep in mind what it is like to use the ChatGPT web application: you enter text and get respponses. The way you prompt ChatGPT is obviously important if you want to get useful responses. In code examples we automate and formalize this manual process.

You need to choose a LLM to use. We will usually choose the GPT-3.5 API from OpenAI because it is general purpose and much less expensive than OpenAI’s previous model APIs. You will need to sign up for an API key and set it as an environment variable:

1 export OPENAI_API_KEY="YOUR KEY GOES HERE"

Both the libraries openai and langchain will look for this environment variable and use it. We will look at a few simple examples in a Hy REPL. We will start by just using OpenAI’s text prediction API that accepts a prompt and then continues generating text from that prompt:

1 $ uv run hy
2 Hy 1.1.0 using CPython(main)  3.12.0 on Darwin
3 => (import langchain_openai.llms [OpenAI])
4 => (setv llm (OpenAI :temperature 0.8))
5 => (llm "John got into his new sports car, and he drove it")
6 " to work. He felt really proud that he was able to afford the car and even parked i\
7 t in a prime spot so everyone could see. He felt like he had really made it."
8 =>

The temperature should have a value between 0 and 1. Use a small temperature value to get repeatable results and a large temperature value if you want very different completions each time you pass the same prompt text.

Our next example is in the source file directions_template.hy and uses the PromptTemplate class. A prompt template is a reproducible way to generate a prompt. It contains a text string (“the template”), that can take in a set of parameters from the end user and generate a prompt. The prompt template may contain language model instructions, few-shot examples to improve the model’s response, or specific questions for the model to answer.

 1 (import langchain.prompts [PromptTemplate])
 2 (import langchain_openai.llms [OpenAI])
 3 
 4 (setv llm (OpenAI :temperature 0.9))
 5 
 6 (defn get_directions [thing_to_do]
 7    (setv
 8      prompt
 9      (PromptTemplate
10        :input_variables ["thing_to_do"]
11        :template "How do I {thing_to_do}?"))
12     (setv
13       prompt_text
14       (prompt.format :thing_to_do thing_to_do))
15     ;; Print out generated prompt when you are getting started:
16     (print "\n" prompt_text ":")
17     (llm prompt_text))
18 
19 (print (get_directions "get to the store"))
20 (print (get_directions "hang a picture on the wall"))

You could just write Hy string manipulation code to create a prompt but using the utility class PromptTemplate is more legible and works with any number of prompt input variables. In this example, the prompt template is really simple. For more complex Python examples see the LangChain prompt documentation. We will later see a more complex prompt example.

Let’s change directory to hy-lisp-python/langchain and run two examples in a Hy REPL:

 1 $ uv run hy
 2 Hy 1.1.0 using CPython(main)  3.12.0 on Darwin
 3 => (import directions_template [get_directions])
 4 => (print (get_directions "hang a picture on the wall"))
 5 
 6  How do I hang a picture on the wall? :
 7 
 8 
 9 1. Gather necessary items: picture, level, appropriate hardware for your wall type (\
10 nails, screws, anchors, etc).
11 
12 2. Select the location of the picture on the wall. Use a level to ensure that the pi\
13 cture is hung straight. 
14 
15 3. Mark the wall where the hardware will be placed.
16 
17 4. Securely attach the appropriate hardware to the wall. 
18 
19 5. Hang the picture and secure with the hardware. 
20 
21 6. Stand back and admire your work!
22 => (print (get_directions "get to the store"))
23 
24  How do I get to the store? :
25 
26 
27 The best way to get to the store depends on your location. If you are using public t\
28 ransportation, you can use a bus or train to get there. If you are driving, you can \
29 use a GPS or maps app to find the fastest route.
30 =>

The next example in the file country_information.hy is derived from an example in the LangChain documentation. In this example we use PromptTemplate that contains the pattern we would like the LLM to use when returning a response.

 1 (import langchain.prompts [PromptTemplate])
 2 (import langchain_openai.llms [OpenAI])
 3 
 4 (setv llm (OpenAI :temperature 0.9))
 5 
 6 (setv
 7   template
 8   "Predict the capital and population of a country.\n\nCountry: {country_name}\nCapi\
 9 tal:\nPopulation:\n")
10 
11 (defn get_country_information [country_name]
12   (print "Processing " country_name ":")
13   (setv
14      prompt
15      (PromptTemplate
16        :input_variables ["country_name"]
17        :template template))
18   (setv
19       prompt_text
20       (prompt.format :country_name country_name))
21   ;; Print out generated prompt when you are getting started:
22   (print "\n" prompt_text ":")
23   (llm prompt_text))
24 
25 (print (get_country_information "Germany"))
26 ;; (print (get_country_information "Canada"))

You can use the ChatGPT web interface to experiment with prompts. When you find a pattern that works well then write a Python script like the last example, changing the data you supply in the PromptTemplate instance.

Here are two examples of this code for getting information about Canada and Germany:

 1 $ uv run hy
 2 Hy 1.1.0 using CPython(main)  3.12.0 on Darwin
 3 => (import country_information [get_country_information])
 4 => (print (get_country_information "Canada"))
 5 Processing  Canada :
 6 
 7  Predict the capital and population of a country.
 8 
 9 Country: Canada
10 Capital:
11 Population:
12  :
13 
14 Capital: Ottawa
15 Population: 37,592,000 (as of 2019)
16 => (print (get_country_information "Germany"))
17 Processing  Germany :
18 
19  Predict the capital and population of a country.
20 
21 Country: Germany
22 Capital:
23 Population:
24  :
25 
26 Capital: Berlin
27 Population: 83 million
28 =>

We print the generated prompt and you can try copying this text (here for Canada) into the ChatGPT web app:

1 Predict the capital and population of a country.
2 
3 Country:Canada
4 Capital:
5 Population:

So there is no magic here. We are simply generating prompts that contain context data.

Creating Embeddings

We will reference the LangChain embeddings documentation. We can use a Hy REPL to see what text to vector space embeddings might look like:

 1 $ uv run hy
 2 Hy 1.1.0 using CPython(main)  3.12.0 on Darwin
 3 => (import langchain_openai [OpenAIEmbeddings])
 4 => (setv embeddings (OpenAIEmbeddings))
 5 => (setv text "Mary has blond hair and John has brown hair. Mary lives in town and J\
 6 ohn lives in the country.")
 7 => (setv doc_embeddings (embeddings.embed_documents [text]))
 8 => doc_embeddings
 9 [[0.007754440331396565 0.0008957661819527747 -0.003335848878474548 -0.01803736554483\
10 232 -0.017987297643789046 0.028564378295111985 -0.013368429464419828 0.0047096176469\
11 93997..]]
12 => (setv query_embedding (embeddings.embed_query "Does John live in the city?"))
13 => query_embedding
14 [0.028118159621953964 0.011476404033601284 -0.009456867352128029 ...]

Notice that the doc_embeddings is a list where each list element is the embeddings for one input text document. The query_embedding is a single embedding. Please read the above linked embedding documentation.

We will use vector stores to store calculated embeddings for future use in the next example.

Using LangChain Vector Stores to Query Documents

We will reference the LangChain Vector Stores documentation. Weneed to install a few libraries that are pre-configured in the file pyproject.toml that uv uses:

chroma
chromadb
unstructured
pdf2image
pytesseract

The next document query example is contained in a single script hy-lisp-python-book/source_code_for_examples/langchain/doc_search.hy with three document queries at the end of the script. In this example we read the text file documents in the directory hy-lisp-python-book/source_code_for_examples/langchain/data and create a local embeddings datastore we use for natural language queries:

 1 (import langchain.text_splitter [CharacterTextSplitter])
 2 (import langchain_community.vectorstores [Chroma])
 3 (import langchain_openai.embeddings [OpenAIEmbeddings])
 4 (import langchain_community.document_loaders [DirectoryLoader UnstructuredMarkdownLo\
 5 ader])
 6 (import langchain.chains [VectorDBQA])
 7 (import langchain_openai.llms [OpenAI])
 8 
 9 (setv embeddings (OpenAIEmbeddings))
10 
11 (setv loader (DirectoryLoader "./data/" :glob "**/*.txt" :loader_cls UnstructuredMar\
12 kdownLoader))
13 (setv documents (loader.load))
14 (print documents)
15 
16 (setv
17   text_splitter
18   (CharacterTextSplitter :chunk_size 2500 :chunk_overlap 0))
19 
20 (setv
21   texts
22   (text_splitter.split_documents documents))
23 
24 (setv
25   docsearch
26   (Chroma.from_documents texts  embeddings))
27 
28 (setv
29   qa
30   (VectorDBQA.from_chain_type
31     :llm (OpenAI)
32     :chain_type "stuff"
33     :vectorstore docsearch))
34 
35 (defn query [q]
36   (print "Query: " q)
37   (print "Answer: " (qa.run q)))
38 
39 (query "What kinds of equipment are in a chemistry laboratory?")
40 (query "What is Austrian School of Economics?")
41 (query "Why do people engage in sports?")
42 (query "What is the effect of body chemistry on exercise?")

The DirectoryLoader class is useful for loading a directory full of input documents. In this example we specified that we only want to process text files, but the file matching pattern could have also specified PDF files, etc.

The output is:

 1 $ uv sync
 2 $ uv run hy doc_search.hy
 3 Using embedded DuckDB without persistence: data will be transient
 4 Query:  What kinds of equipment are in a chemistry laboratory?
 5 Answer:   A chemistry laboratory typically contains various glassware, as well as ot\
 6 her equipment such as beakers, flasks, test tubes, Bunsen burners, hot plates, and o\
 7 ther materials used for conducting experiments.
 8 Query:  What is Austrian School of Economics?
 9 Answer:   The Austrian School of economics is a school of economic thought that emph\
10 asizes the spontaneous organizing power of the price mechanism. Austrians hold that \
11 the complexity of subjective human choices makes mathematical modelling of the evolv\
12 ing market extremely difficult and advocate a "laissez faire" approach to the econom\
13 y. Austrian School economists advocate the strict enforcement of voluntary contractu\
14 al agreements between economic agents, and hold that commercial transactions should \
15 be subject to the smallest possible imposition of forces they consider to be (in par\
16 ticular the smallest possible amount of government intervention). The Austrian Schoo\
17 l derives its name from its predominantly Austrian founders and early supporters, in\
18 cluding Carl Menger, Eugen von Böhm-Bawerk and Ludwig von Mises.
19 Query:  Why do people engage in sports?
20 Answer:   People engage in sports because they are enjoyable activities that involve\
21  physical athleticism or dexterity, and are governed by rules to ensure fair competi\
22 tion and consistent adjudication of the winner.
23 Query:  What is the effect of body chemistry on exercise?
24 Answer:   Body chemistry can affect the transfer of energy from one chemical substan\
25 ce to another, as well as the efficiency of energy-producing systems that do not rel\
26 y on oxygen, such as anaerobic exercise. It can also affect the body's ability to pr\
27 oduce enough moisture, which can lead to dry eye and other symptoms.

If you use this example to index a large number of documents you will want to store the index for future use. Then any application can reuse your local index. If you add documents to your data directory then re-run the script to create the local index. You can see examples of persistent vector stores in my LangChain book.

LangChain Wrap Up

I wrote a Python book that goes into greater detail on both LangChain as well as the library LlamaIndex that are often used together. You can buy my book LangChain and LlamaIndex Projects Lab Book: Hooking Large Language Models Up to the Real World or read it free online using the Free To Read Online Link.

Large Language Models Experiments Using Google Colab

In addition to using LLM APIs from OpenAI, Cohere, etc. you can also run smaller LLMs locally on your laptop if you have enough memory (and optionally a good GPU). When I experiment with self-hosted LLMs I usually run them in the cloud using either Google Colab or a leased GPU server from Lambda Labs.

We will use the Hugging Face tiiuae/falcon-7b model. You can read the Hugging Face documentation for the tiiuae/falcon-7b model.

Google Colab directly supports only the Python and R languages. We can use Hy by using the %%writefile test.hy script magic to write the contents of a cell to a local file, in this case Hy language source code. For interactive development we will use the script magic %%script bash to run hy test.hy because this will use the same process when we re-evaluate the notebook cell. If we would run our Hy script using !hy test.hy then each time we evaluate the cell we would get a fresh Linux process, so the previous caching of model files, etc. would be repeated.

Here we use the Colab notebook that is shown here:

If you have a laptop that can run this example, you can also run it locally by installing the dependencies:

1 pip install hy transformers accelerate einops

Here is the code example:

 1 (import transformers [AutoTokenizer pipeline])
 2 (import torch)
 3 
 4 (setv model "tiiuae/falcon-7b")
 5 (setv tokenizer (AutoTokenizer.from_pretrained model))
 6 (setv pipel
 7   (pipeline "text-generation" :model model :tokenizer tokenizer
 8                               :torch_dtype torch.bfloat16
 9                               :device_map "auto"))
10 (setv sequences
11   (pipel "Sam bought a new sports car and wanted to see Mary. Sam got in his sports \
12 car and"
13        :max_length 100 :do_sample True :top_k 10
14        :num_return_sequences 1 :eos_token_id tokenizer.eos_token_id))
15 (print sequences)

The generated text varies for each run. Here is example output:

1 Sam bought a new sports car and wanted to see Mary. Sam got in his sports car and dr\
2 ove the 20 miles to Mary’s house. The weather was perfect and the road was nice and \
3 smooth. It didn’t take long to get there and Sam had a lot of time to relax on his w\
4 ay. The car was a lot of fun to drive because it had all kinds of new safety feature\
5 s, and Sam really felt in control of his sports car.

As I write this chapter in September 2023, more small LLM models are being released that can run on laptops. If you use M1 or M2 Macs and you have at least 16G of shared memory, it is now also easier to run LLMs locally. Macs with 64G or more shared memory are very capable of both local self-hosted fine tuning and inference. While it is certainly simpler to use APIs from OpenAI and other vendors there are privacy and control advantages to running self-hosted models.

Book Wrap-up

I love programming in Lisp languages but I often need to use Python libraries for Deep Learning and NLP. The Hy language is a good fit for me, it is simple to install along with the Python libraries that I use for my work and it is a fun language to write code in. Most importantly, Hy fits well with the type of iterative bottom-up REPL-based development that I prefer.

I hope that you enjoyed this short book and that at least a few things that you have learned here will both help you in your work and give you ideas for new personal projects.

Best regards,

Mark Watson

May 23, 2023

Table of Contents

Cover Material, Copyright, and License

Preface

Requests from the Author

Hire the Author as a Consultant

Setting Up Your Development Environment

What is Lisp Programming Style?

Hy is Python, But With a Lisp Syntax

How This Book Reflects My Views on Artificial Intelligence and the Future of Society and Technology

About the Book Cover

Introduction to the Hy Language

Using Python Libraries

Global vs. Local Variables

Using Python Code in Hy Programs

Using Hy Libraries in Python Programs

Replacing the Python slice (cut) Notation with the Hy Functional Form

Iterating Through a List With Index of Each Element

Formatted Output

Importing Libraries from Different Directories on Your Laptop

Hy Looks Like Clojure: How Similar Are They?

Plotting Data Using the Numpy and the Matplotlib Libraries

Bonus Points: Configuration for macOS and ITerm2 for Generating Plots Inline in a Hy REPL and Shell

Why Lisp?

I Hated the Waterfall Method in the 1970s but Learned to Love a Bottom-Up Programming Style

First Introduction to Lisp

Commercial Product Development and Deployment Using Lisp

Performing Bottom Up Development Inside a REPL is a Lifestyle Choice

Writing Web Applications

Getting Started With Flask: Using Python Decorators in Hy

Using Jinja2 Templates To Generate HTML

Handling HTTP Sessions and Cookies

Deploying Hy Language Flask Apps to Google Cloud Platform AppEngine

Going Forward

Wrap Up

Responsible Web Scraping

Using the Python BeautifulSoup Library in the Hy Language

Getting HTML Links from the DemocracyNow.org News Web Site

Getting Summaries of Front Page from the NPR.org News Web Site

Using the Brave Search APIs

Setting an Environment Variable for the Access Key for Brave Search APIs

Example Search Script

Wrap-up

Deep Learning

Simple Multi-layer Perceptron Neural Networks

Deep Learning

Using Keras and TensorFlow to Model The Wisconsin Cancer Data Set

Using a LSTM Recurrent Neural Network to Generate English Text Similar to the Philosopher Nietzsche’s Writing

Natural Language Processing

Exploring the spaCy Library

Implementing a HyNLP Wrapper for the Python spaCy Library

Wrap-up

Datastores

Sqlite

PostgreSQL

Notes for Using PostgreSQL and Setting Up an Example Database “hybook” on macOS and Linux

macOS

Linux

Using Hy with PostgreSQL

RDF Data Using the “rdflib” Library

Wrap-up

Linked Data, the Semantic Web, and Knowledge Graphs

Understanding the Resource Description Framework (RDF)

Resource Namespaces Provided in rdflib

Understanding the SPARQL Query Language

Wrapping the Python rdflib Library

Knowledge Graph Creator

Recommended Industrial Use of Knowledge Graphs

Design of KGCreator Application

Problems with using Literal Values in RDF

Revisiting This Example Using URIs Instead of Literal Values

Wrap-up

Knowledge Graph Navigator

Review of NLP Utilities Used in Application

SPARQL Utilities

Utilities to Colorize SPARQL and Generated Output

Text Utilities for Queries and Results

Finishing the Main Function for KGN

Wrap-up

Using OpenAI GPT

OpenAI Text Completion API