Practical Artificial Intelligence Development With Racket
Practical Artificial Intelligence Development With Racket
Mark Watson
Buy on Leanpub

Preface

I have been using Lisp languages since the 1970s. In 1982 my company bought a Lisp Machine for my use. A Lisp Machine provided an “all batteries included” working environment, but now no one seriously uses Lisp Machines. In this book I try to lead you, dear reader, through a process of creating a “batteries included” working environment using Racket Scheme.

The latest edition is always available for purchase at https://leanpub.com/racket-ai. You can also read free online at https://leanpub.com/racket-ai/read. I offer the purchase option for readers who wish to directly support my work.

If you read my eBooks free online then please consider tipping me https://markwatson.com/#tip.

This is a “live book:” there will never be a second edition. As I add material and make corrections, I simply update the book and the free to read online copy and all eBook formats for purchase get updated.

I have been developing commercial Artificial Intelligence (AI) tools and applications since the 1980s and I usually use the Lisp languages Common Lisp, Clojure, Racket Scheme, and Gambit Scheme. Here you will find Racket code that I wrote for my own use and I am wrapping in a book in the hopes that it will also be useful to you, dear reader.

Mark Watson
Mark Watson

I wrote this book for both professional programmers and home hobbyists who already know how to program in Racket (or another Scheme dialect) and who want to learn practical AI programming and information processing techniques. I have tried to make this an enjoyable book to work through. In the style of a “cook book,” the chapters can be read in any order.

You can find the code examples and the files for the manuscript for this book in the following GitHub repository:

https://github.com/mark-watson/Racket-AI-book

Git pull requests with code improvements for either the source code or manuscript markdown files will be appreciated by me and the readers of this book.

License for Book Manuscript: Creative Commons

Copyright 2022-2024 Mark Watson. All rights reserved.

This book may be shared using the Creative Commons “share and share alike, no modifications, no commercial reuse” license.

Book Example Programs

The following diagram showing Racket software examples configured for your local laptop. There are several combined examples that build both to a Racket package that get installed locally, as well as command line programs that get built and deployed to ~/bin. Other examples are either a command line tool or a Racket package.

Example programs are packages and/or command line tools
Example programs are packages and/or command line tools

Racket, Scheme, and Common Lisp

I like Common Lisp slightly more than Racket and other Scheme dialects, even though Common Lisp is ancient and has defects. Then why do I use Racket? Racket is a family of languages, a very good IDE, and a rich ecosystem supported by many core Racket developers and Racket library authors. Choosing Racket Scheme was an easy decision, but there are also other Scheme dialects that I enjoy using:

  • Gambit/C Scheme
  • Gerbil Scheme (based on Gambit/C)
  • Chez Scheme

Personal Artificial Intelligence Journey: or, Life as a Lisp Developer

I have been interested in AI since reading Bertram Raphael’s excellent book Thinking Computer: Mind Inside Matter in the early 1980s. I have also had the good fortune to work on many interesting AI projects including the development of commercial expert system tools for the Xerox LISP machines and the Apple Macintosh, development of commercial neural network tools, application of natural language and expert systems technology, medical information systems, application of AI technologies to Nintendo and PC video games, and the application of AI technologies to the financial markets. I have also applied statistical natural language processing techniques to analyzing social media data from Twitter and Facebook. I worked at Google on their Knowledge Graph and I managed a deep learning team at Capital One where I was awarded 55 US patents.

I enjoy AI programming, and hopefully this enthusiasm will also infect you.

Acknowledgements

I produced the manuscript for this book using the leanpub.com publishing system and I recommend leanpub.com to other authors.

Editor: Carol Watson

Thanks to the following people who found typos in this and earlier book editions: David Rupp.

A Quick Racket Tutorial

If you are an experienced Racket developer then feel free to skip this chapter! I wrote this tutorial to cover just the aspects of using Racket that you, dear reader, will need in the book example programs.

I assume that you have read the section Racket Essentials in the The Racket Guide written by Matthew Flatt, Robert Bruce Findler, and the PLT group. Here I just cover some basics of getting started so you can enjoy the later code examples without encountering “road blocks.”

Installing Packages

The DrRacket IDE lets you interactively install packages. I prefer using the command line so, for example, I would install SQlite support using:

1 raco pkg install sqlite-table

We can then require the code in this package in our Racket programs:

1 (require sqlite-table)

Note that when the argument to require is a symbol (not a string) then modules are searched and loaded from your system. When the argument is a string like “utils.rkt” that a module is loaded from a file in the current directory.

Installing Local Packages In Place

In a later chapter Natural Language Processing (NLP) we define a fairly complicated local package. This package has one unusual requirement that you may or may not need in your own projects: My NLP library requires static linguistic data files that are stored in the directory Racket-AI-book-code/nlp/data. If I am in the directory Racket-AI-book-code/nlp working on the Racket code, it is simple enough to just open the files in ./data/….

The default for installing your own Racket packages is to link to the original source directory on your laptop’s file system. Let’s walk through this. First, I will make sure my library code is compiled and then install the code in the current directory:

1 cd Racket-AI-book-code/nlp/
2 raco make *.rkt
3 raco pkg install --scope user

Then I can run the racket REPL (or DrRacket) on my laptop and use my NLP package by requiring the code in this package in our Racket programs (shown in a REPL):

 1 > (require nlp)
 2 loading lex-hash......done.
 3 > (parts-of-speech (list->vector '("the" "cat" "ran")))
 4 '#("DT" "NN" "VBD")
 5 > (find-place-names
 6     '#("George" "Bush" "went" "to" "San" "Diego"
 7        "and" "London") '())
 8 '("London" "San Diego")
 9 > (find-place-names
10     '#("George" "Bush" "went" "to" "San" "Diego"
11        "and" "London") '())
12 '("London" "San Diego")
13 > 

Mapping Over Lists

We will be using functions that take other functions as arguments:

1 > (range 5)
2 '(0 1 2 3 4)
3 > (define (1+ n) (+ n 1))
4 > (map 1+ (range 5))
5 '(1 2 3 4 5)
6 > (map + (range 5) '(100 101 102 103 104))
7 '(100 102 104 106 108)
8 > 

Hash Tables

The following listing shows the file misc_code/hash_tests.rkt:

 1 #lang racket
 2 
 3 (define h1 (hash "dog" '("friendly" 5) "cat" '("not friendly" 2))) ;; not mutable
 4 
 5 (define cat (hash-ref h1 "cat"))
 6 
 7 (define h2 (make-hash)) ;; mutable
 8 (hash-set! h2 "snake" '("smooth" 4))
 9 
10 ;; make-hash also accepts a second argument that is a list of pairs:
11 (define h3 (make-hash '(("dog" '("friendly" 5)) ("cat" '("not friendly" 2)))))
12 (hash-set! h3 "snake" '("smooth" 4))
13 (hash-set! h3 "cat" '("sometimes likeable" 3)) ;; overwrite key value
14 
15 ;; for can be used with hash tables:
16 
17 (for ([(k v) h3]) (println '("key:" k "value:" v)))

Here is a lising of the output window after running this file and then manually evaluating h1, h2, and h3 in the REPL (like all listings in this book, I manually edit the output to fit page width):

 1 Welcome to DrRacket, version 8.10 [cs].
 2 Language: racket, with debugging; memory limit: 128 MB.
 3 '("key:" k "value:" v)
 4 '("key:" k "value:" v)
 5 '("key:" k "value:" v)
 6 > h1
 7 '#hash(("cat" . ("not friendly" 2)) ("dog" . ("friendly" 5)))
 8 > h2
 9 '#hash(("snake" . ("smooth" 4)))
10 > h3
11 '#hash(("cat" . ("sometimes likeable" 3))
12                 ("dog" . ('("friendly" 5)))
13                 ("snake" . ("smooth" 4)))
14 > 

Racket Structure Types

A structurer type is like a list that has named list elements. When you define a structure the Racket system writes getter and setter methods to access and change structure attribute values. Racket also generates a constructor function with the structure name. Let’s look at a simple example in a Racket REPL of creating a structure with mutable elements:

 1 > (struct person (name age email) #:mutable)
 2 > (define henry (person "Henry Higgans" 45 "henry@higgans.com"))
 3 > (person-age henry)
 4 45
 5 > (set-person-age! henry 46)
 6 > (person-age henry)
 7 46
 8 > (set-person-email! henry "henryh674373551@gmail.com")
 9 > (person-email henry)
10 "henryh674373551@gmail.com"
11 > 

If you don’t add #:mutable to a struct definition, then no set-NAME-ATTRIBUTE! methods are generated.

Racket also supports object oriented programming style classes with methods. I don’t use classes in the book examples so you, dear reader, can read the official Racket documentation on classes if you want to use Racket in a non-functional way.

Simple HTTP GET and POST Operations

We will be using HTTP GET and POST instructions in later chapters for web scraping and accessing remote APIs, such as those for OpenAI GPT-4, Hugging Face, etc. We will see more detail later but for now, you can try a simple example:

 1 #lang racket
 2 
 3 (require net/http-easy)
 4 (require html-parsing)
 5 (require net/url xml xml/path)
 6 (require racket/pretty)
 7 
 8 (define res-stream
 9   (get "https://markwatson.com" #:stream? #t))
10 
11 (define lst
12   (html->xexp (response-output res-stream)))
13 
14 (response-close! res-stream)
15 
16 (displayln "\nParagraph text:\n")
17 
18 (pretty-print (take (se-path*/list '(p) lst) 8))
19 
20 (displayln "\nLI text:\n")
21 
22 (pretty-print (take (se-path*/list '(li) lst) 8))

The output is:

 1 Paragraph text:
 2 
 3 '("My customer list includes: Google, Capital One, Babylist, Olive AI, CompassLabs, \
 4 Mind AI, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, \
 5 Sitescout.com, Embed.ly, and Webmind Corporation."
 6   "I have worked in the fields of general\n"
 7   "      artificial intelligence, machine learning, semantic web and linked data, an\
 8 d\n"
 9   "      natural language processing since 1982."
10   "My eBooks are available to read for FREE or you can purchase them at "
11   (a (@ (href "https://leanpub.com/u/markwatson")) "leanpub")
12   "."
13   "My standard ")
14 
15 LI text:
16 
17 '((@ (class "list f4 f3-ns fw4 dib pr3"))
18   "\n"
19   "            "
20   (& nbsp)
21   (& nbsp)
22   (a
23    (@
24     (class "hover-white no-underline white-90")
25     (href "https://mark-watson.blogspot.com")
26     (target "new")
27     (title "Mark's Blog on Blogspot"))
28    "\n"
29    "              Read my Blog\n"
30    "            ")
31   "\n"
32   "          ")

Using Racket ~/.racketrc Initialization File

In my Racket workflow I don’t usually use ~/.racketrc to define initial forms that are automatically loaded when starting the racket command line tool or the DrRacket application. That said I do like to use ~/.racketrc for temporary initialization forms when working on a specific project to increase the velocity of interactive development.

Here is an example use:

1 $ cat ~/.racketrc
2 (define (foo-list x)
3   (list x x x))
4 $ racket
5 Welcome to Racket v8.10 [cs].
6 > (foo-list 3.14159)
7 '(3.14159 3.14159 3.14159)
8 > 

If you have local and public libraries you frequently load you can permanently keep require forms for them in ~/.racketrc but that will slightly slow down the startup times of racket and DrRacket.

Tutorial Wrap Up

The rest of this book is comprised of example Racket programs that I have written for my own enjoyment that I hope will also be useful to you, dear reader. Please refer to the https://docs.racket-lang.org/guide/ for more technical detail on effectively using the Racket language and ecosystem.

Datastores

For my personal research projects the only datastores that I often use are the embedded relational database and Resource Description Framework (RDF) datastores that might be local to my laptop or public Knowledge Graphs like DBPedia and WikiData. The use of RDF data and the SPARQL query language is part of the fields of the semantic web and linked data.

Accessing Public RDF Knowledge Graphs - a DBPedia Example

I will not cover RDF data and the SPARQL query language in great detail here. Rather, please reference the following link to read the RDF and SPARQL tutorial data in my Common Lisp book: Loving Common Lisp, or the Savvy Programmer’s Secret Weapon.

In the following Racket code example for accesing data on DBPedia using SPARQL, the primary objective is to interact with DBpedia’s SPARQL endpoint to query information regarding a person based on their name or URI. The code is structured into several functions, each encapsulating a specific aspect of the querying process, thereby promoting modular design and ease of maintenance.

Function Definitions:

  • sparql-dbpedia-for-person: This function takes a person-uri as an argument and constructs a SPARQL query to retrieve the comment and website associated with the person. The @string-append macro helps in constructing the SPARQL query string by concatenating the literals and the person-uri argument.
  • sparql-dbpedia-person-uri: Similar to the above function, this function accepts a person-name argument and constructs a SPARQL query to fetch the URI and comment of the person from DBpedia.
  • sparql-query->hash: This function encapsulates the logic for sending the constructed SPARQL query to the DBpedia endpoint. It takes a query argument, encodes it into a URL format, and sends an HTTP request to the DBpedia SPARQL endpoint. The response, expected in JSON format, is then converted to a Racket expression using string->jsexpr.
  • json->listvals: This function is designed to transform the JSON expression obtained from the SPARQL endpoint into a more manageable list of values. It processes the hash data structure, extracting the relevant bindings and converting them into a list format.
  • gd (Data Processing Function): This function processes the data structure obtained from json->listvals. It defines four inner functions gg1, gg2, gg3, and gg4, each designed to handle a specific number of variables returned in the SPARQL query result. It uses a case statement to determine which inner function to call based on the length of the data.
  • sparql-dbpedia: This is the entry function which accepts a sparql argument, invokes sparql-query->hash to execute the SPARQL query, and then calls gd to process the resulting data structure.

Usage Flow:

The typical flow would be to call sparql-dbpedia-person-uri with a person’s name to obtain the person’s URI and comment from DBpedia. Following that, sparql-dbpedia-for-person can be invoked with the obtained URI to fetch more detailed information like websites associated with the person. The results from these queries are then processed through sparql-query->hash, json->listvals, and gd to transform the raw JSON response into a structured list format, making it easier to work with within the Racket environment.

  1 #lang at-exp racket
  2 
  3 (provide sparql-dbpedia-person-uri)
  4 (provide sparql-query->hash)
  5 (provide json->listvals)
  6 (provide sparql-dbpedia)
  7 
  8 (require net/url)
  9 (require net/uri-codec)
 10 (require json)
 11 (require racket/pretty)
 12 
 13 (define (sparql-dbpedia-for-person person-uri)
 14   @string-append{
 15      SELECT
 16       (GROUP_CONCAT(DISTINCT ?website; SEPARATOR="  |  ")
 17                                    AS ?website) ?comment {
 18       OPTIONAL {
 19        @person-uri
 20        <http://www.w3.org/2000/01/rdf-schema#comment>
 21        ?comment . FILTER (lang(?comment) = 'en')
 22       } .
 23       OPTIONAL {
 24        @person-uri
 25        <http://dbpedia.org/ontology/wikiPageExternalLink>
 26        ?website
 27         . FILTER( !regex(str(?website), "dbpedia", "i"))
 28       }
 29      } LIMIT 4})
 30 
 31 (define (sparql-dbpedia-person-uri person-name)
 32   @string-append{
 33     SELECT DISTINCT ?personuri ?comment {
 34       ?personuri
 35         <http://xmlns.com/foaf/0.1/name>
 36         "@person-name"@"@"en .
 37       ?personuri
 38         <http://www.w3.org/2000/01/rdf-schema#comment>
 39         ?comment .
 40              FILTER  (lang(?comment) = 'en') .
 41 }})
 42 
 43 
 44 (define (sparql-query->hash query)
 45   (call/input-url
 46    (string->url
 47     (string-append
 48      "https://dbpedia.org/sparql?query="
 49      (uri-encode query)))
 50    get-pure-port
 51    (lambda (port)
 52      (string->jsexpr (port->string port)))
 53    '("Accept: application/json")))
 54 
 55 (define (json->listvals a-hash)
 56   (let ((bindings (hash->list a-hash)))
 57     (let* ((head (first bindings))
 58            (vars (hash-ref (cdr head) 'vars))
 59            (results (second bindings)))
 60       (let* ((x (cdr results))
 61              (b (hash-ref x 'bindings)))
 62         (for/list
 63             ([var vars])
 64           (for/list ([bc b])
 65             (let ((bcequal
 66                    (make-hash (hash->list bc))))
 67               (let ((a-value
 68                      (hash-ref
 69                       (hash-ref
 70                        bcequal
 71                        (string->symbol var)) 'value)))
 72                 (list var a-value)))))))))
 73 
 74 
 75 (define gd (lambda (data)
 76 
 77     (let ((jd (json->listvals data)))
 78 
 79       (define gg1
 80         (lambda (jd) (map list (car jd))))
 81       (define gg2
 82         (lambda (jd) (map list (car jd) (cadr jd))))
 83       (define gg3
 84         (lambda (jd)
 85           (map list (car jd) (cadr jd) (caddr jd))))
 86       (define gg4
 87         (lambda (jd)
 88           (map list
 89                (car jd) (cadr jd)
 90                (caddr jd) (cadddr jd))))
 91 
 92       (case (length (json->listvals data))
 93         [(1) (gg1 (json->listvals data))]
 94         [(2) (gg2 (json->listvals data))]
 95         [(3) (gg3 (json->listvals data))]
 96         [(4) (gg4 (json->listvals data))]
 97         [else
 98          (error "sparql queries with 1 to 4 vars")]))))
 99 
100 
101 (define sparql-dbpedia
102   (lambda (sparql)
103     (gd (sparql-query->hash sparql))))
104 
105 ;; (sparql-dbpedia (sparql-dbpedia-person-uri "Steve Jobs"))

Let’s try an example in a Racket REPL:

 1 '((("personuri" "http://dbpedia.org/resource/Steve_Jobs")
 2    ("comment"
 3     "Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American entrepre\
 4 neur, industrial designer, media proprietor, and investor. He was the co-founder, ch\
 5 airman, and CEO of Apple; the chairman and majority shareholder of Pixar; a member o\
 6 f The Walt Disney Company's board of directors following its acquisition of Pixar; a\
 7 nd the founder, chairman, and CEO of NeXT. He is widely recognized as a pioneer of t\
 8 he personal computer revolution of the 1970s and 1980s, along with his early busines\
 9 s partner and fellow Apple co-founder Steve Wozniak."))
10   (("personuri" "http://dbpedia.org/resource/Steve_Jobs_(film)")
11    ("comment"
12     "Steve Jobs is a 2015 biographical drama film directed by Danny Boyle and writte\
13 n by Aaron Sorkin. A British-American co-production, it was adapted from the 2011 bi\
14 ography by Walter Isaacson and interviews conducted by Sorkin, and covers 14 years (\
15 1984–1998) in the life of Apple Inc. co-founder Steve Jobs. Jobs is portrayed by Mic\
16 hael Fassbender, with Kate Winslet as Joanna Hoffman and Seth Rogen, Katherine Water\
17 ston, Michael Stuhlbarg, and Jeff Daniels in supporting roles."))
18   (("personuri" "http://dbpedia.org/resource/Steve_Jobs_(book)")
19    ("comment"
20     "Steve Jobs is the authorized self-titled biography of American business magnate\
21  and Apple co-founder Steve Jobs. The book was written at the request of Jobs by Wal\
22 ter Isaacson, a former executive at CNN and TIME who has written best-selling biogra\
23 phies of Benjamin Franklin and Albert Einstein. The book was released on October 24,\
24  2011, by Simon & Schuster in the United States, 19 days after Jobs's death. A film \
25 adaptation written by Aaron Sorkin and directed by Danny Boyle, with Michael Fassben\
26 der starring in the title role, was released on October 9, 2015.")))

In practice, I start exploring data on DBPedia using the SPARQL query web app https://dbpedia.org/sparql. I experiment with different SPARQL queries for whatever application I am working on and then embed those queries in my Racket, Common Lisp, Clojure (link to read my Clojure AI book free online), and other programming languages I use.

In addition to using DBPedia I often also use the WikiData public Knowledge Graph and local RDF data stores hosted on my laptop with Apache Jena. I might add examples for these two use cases in future versions of this live eBook.

Sqlite

Using SQlite in Racket is simple so we will just look at a simple example. We will be using the Racket source file sqlite.rkt in the directory Racket-AI-book-code/misc_code for the code snippets in this REPL:

 1 $ racket
 2 Welcome to Racket v8.10 [cs].
 3 > (require db)
 4 > (require sqlite-table)
 5 > (define db-file "test.db")
 6 > (define db (sqlite3-connect #:database db-file #:mode 'create))
 7 > (query-exec db
 8      "create temporary table the_numbers (n integer, d varchar(20))")
 9 > (query-exec db
10      "create  table person (name varchar(30), age integer, email varchar(20))")
11 > (query-exec db
12      "insert into person values ('Mary', 34, 'mary@test.com')")
13 > (query-rows db "select * from person")
14 '(#("Mary" 34 "mary@test.com"))
15 > 

Here we see how to interact with a SQLite database using the db and sqlite-table libraries in Racket. The sqlite3-connect function is used to connect to the SQLite database specified by the string value of db-file. The #:mode ‘create keyword argument indicates that a new database should be created if it doesn’t already exist. The database connection object is bound to the identifier db.

The query-exec function call is made to create a permanent table named person with three columns: name of type varchar(30), age of type integer, and email of type varchar(20). The next query-exec function call is made to insert a new row into the person table with the values ‘Mary’, 34, and ‘mary@test.com’. There is a function query that we don’t use here that returns the types of the columns returned by a query. We use the alternative function query-rows that only returns the query results with no type information.

Implementing a Simple RDF Datastore With Partial SPARQL Support in Racket

This chapter explains a Racket implementation of a simple RDF (Resource Description Framework) datastore with partial SPARQL (SPARQL Protocol and RDF Query Language) support. We’ll cover the core RDF data structures, query parsing and execution, helper functions, and the main function with example queries. The file rdf_sparql.rkt can be found online at https://github.com/mark-watson/Racket-AI-book/source-code/simple_RDF_SPARQL.

Before looking at the code we look at sample use and output. The function test function demonstrates the usage of the RDF datastore and SPARQL query execution:

 1 (define (test)
 2   (set! rdf-store '())
 3 
 4   (add-triple "John" "age" "30")
 5   (add-triple "John" "likes" "pizza")
 6   (add-triple "Mary" "age" "25")
 7   (add-triple "Mary" "likes" "sushi")
 8   (add-triple "Bob" "age" "35")
 9   (add-triple "Bob" "likes" "burger")
10 
11   (print-all-triples)
12 
13   (define (print-query-results query-string)
14     (printf "Query: ~a\n" query-string)
15     (let ([results (execute-sparql-query query-string)])
16       (printf "Final Results:\n")
17       (if (null? results)
18           (printf "  No results\n")
19           (for ([result results])
20             (printf "  ~a\n"
21                     (string-join
22                      (map (lambda (pair)
23                             (format "~a: ~a" (car pair) (cdr pair)))
24                           result)
25                      ", "))))
26       (printf "\n")))
27 
28   (print-query-results "select * where { ?name age ?age . ?name likes ?food }")
29   (print-query-results "select ?s ?o where { ?s likes ?o }")
30   (print-query-results "select * where { ?name age ?age . ?name likes pizza }"))

This function test:

  1. Initializes the RDF store with sample data.
  2. Prints all triples in the datastore.
  3. Defines a print-query-results function to execute and display query results.
  4. Executes three example SPARQL queries:
    • Query all name-age-food combinations.
    • Query all subject-object pairs for the “likes” predicate.
    • Query all people who like pizza and their ages.

Function test generates this output:

 1 All triples in the datastore:
 2 Bob likes burger
 3 Bob age 35
 4 Mary likes sushi
 5 Mary age 25
 6 John likes pizza
 7 John age 30
 8 
 9 Query: select * where { ?name age ?age . ?name likes ?food }
10 Final Results:
11   ?age: 35, ?name: Bob, ?food: burger
12   ?age: 25, ?name: Mary, ?food: sushi
13   ?age: 30, ?name: John, ?food: pizza
14 
15 Query: select ?s ?o where { ?s likes ?o }
16 Final Results:
17   ?s: Bob, ?o: burger
18   ?s: Mary, ?o: sushi
19   ?s: John, ?o: pizza
20 
21 Query: select * where { ?name age ?age . ?name likes pizza }
22 Final Results:
23   ?age: 30, ?name: John

1. Core RDF Data Structures and Basic Operations

There are two parts to this example in file rdf_sparql.rkt: a simple unindexed RDF datastore and a partial SPARQL query implementation that supports compound where clause matches like: select * where { ?name age ?age . ?name likes pizza }.

1.1 RDF Triple Structure

The foundation of our RDF datastore is the triple structure:

1 (struct triple (subject predicate object) #:transparent)

This structure represents an RDF triple, consisting of a subject, predicate, and object. The #:transparent keyword makes the structure’s fields accessible for easier debugging and printing.

1.2 RDF Datastore

The RDF datastore is implemented as a simple list:

1 (define rdf-store '())

1.3 Basic Operations

Two fundamental operations are defined for the datastore:

  1. Adding a triple:
1 (define (add-triple subject predicate object)
2   (set! rdf-store (cons (triple subject predicate object) rdf-store)))
  1. Removing a triple:
1 (define (remove-triple subject predicate object)
2   (set! rdf-store
3         (filter (lambda (t)
4                   (not (and (equal? (triple-subject t) subject)
5                             (equal? (triple-predicate t) predicate)
6                             (equal? (triple-object t) object))))
7                 rdf-store)))

2. Query Parsing and Execution

2.1 SPARQL Query Structure

A simple SPARQL query is represented by the sparql-query structure:

1 (struct sparql-query (select-vars where-patterns) #:transparent)

2.2 Query Parsing

The parse-sparql-query function takes a query string and converts it into a sparql-query structure:

 1 (define (parse-sparql-query query-string)
 2   (define tokens (filter (lambda (token) (not (member token '("{" "}") string=?)))
 3                          (split-string query-string)))
 4   (define select-index (index-of tokens "select" string-ci=?))
 5   (define where-index (index-of tokens "where" string-ci=?))
 6   (define (sublist lst start end)
 7     (take (drop lst start) (- end start)))
 8   (define select-vars (sublist tokens (add1 select-index) where-index))
 9   (define where-clause (drop tokens (add1 where-index)))
10   (define where-patterns (parse-where-patterns where-clause))
11   (sparql-query select-vars where-patterns))

2.3 Query Execution

The main query execution function is execute-sparql-query:

1 (define (execute-sparql-query query-string)
2   (let* ([query (parse-sparql-query query-string)]
3          [where-patterns (sparql-query-where-patterns query)]
4          [select-vars (sparql-query-select-vars query)]
5          [results (execute-where-patterns where-patterns)]
6          [projected-results (project-results results select-vars)])
7     projected-results))

This function parses the query, executes the WHERE patterns, and projects the results based on the SELECT variables.

3. Helper Functions and Utilities

Several helper functions are implemented to support query execution:

  1. variable?: Checks if a string is a SPARQL variable (starts with ‘?’).
  2. triple-to-binding: Converts a triple to a binding based on a pattern.
  3. query-triples: Filters triples based on a given pattern.
  4. apply-bindings: Applies bindings to a pattern.
  5. merge-bindings: Merges two sets of bindings.
  6. project-results: Projects the final results based on the SELECT variables.
 1 (define (variable? str)
 2   (and (string? str) (> (string-length str) 0) (char=? (string-ref str 0) #\?)))
 3 
 4 (define (triple-to-binding t [pattern #f])
 5   (define binding '())
 6   (when (and pattern (variable? (first pattern)))
 7     (set! binding (cons (cons (first pattern) (triple-subject t)) binding)))
 8   (when (and pattern (variable? (second pattern)))
 9     (set! binding (cons (cons (second pattern) (triple-predicate t)) binding)))
10   (when (and pattern (variable? (third pattern)))
11     (set! binding (cons (cons (third pattern) (triple-object t)) binding)))
12   binding)
13 
14 (define (query-triples subject predicate object)
15   (filter
16    (lambda (t)
17     (and
18       (or (not subject) (variable? subject) (equal? (triple-subject t) subject))
19       (or (not predicate) (variable? predicate) (equal? (triple-predicate t) predica\
20 te))
21       (or (not object) (variable? object) (equal? (triple-object t) object))))
22    rdf-store))
23 
24 (define (apply-bindings pattern bindings)
25   (map (lambda (item)
26          (if (variable? item)
27              (or (dict-ref bindings item #f) item)
28              item))
29        pattern))
30 
31 (define (merge-bindings binding1 binding2)
32   (append binding1 binding2))
33 
34 (define (project-results results select-vars)
35   (if (equal? select-vars '("*"))
36       (map remove-duplicate-bindings results)
37       (map (lambda (result)
38              (remove-duplicate-bindings
39               (map (lambda (var)
40                      (cons var (dict-ref result var #f)))
41                    select-vars)))
42            results)))

Conclusion

This implementation provides a basic framework for an RDF datastore with partial SPARQL support in Racket. While it lacks many features of a full-fledged RDF database and SPARQL engine, it demonstrates the core concepts and can serve as a starting point for more complex implementations. The code is simple and can be fun experimenting with.

Web Scraping

I often write software to automatically collect and use data from the web and other sources. As a practical matter, much of the data that many people use for machine learning comes from either the web or from internal data sources. This section provides some guidance and examples for getting text data from the web.

Before we start a technical discussion about web scraping I want to point out that much of the information on the web is copyright, so first you should read the terms of service for web sites to insure that your use of “scraped” or “spidered” data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites.

We start with low-level Racket code examples in the GitHub repository for this book in the directory Racket-AI-book-code/misc_code. We will then implement a standalone library in the directory Racket-AI-book-code/webscrape.

Getting Started Web Scraping

All of the examples in the section can be found in the Racket code snippet files in the directory Racket-AI-book-code/misc_code.

I have edited the output for brevity in the following REPL outoput:

 1 $ racket
 2 Welcome to Racket v8.10 [cs].
 3 > (require net/http-easy)
 4 > (require html-parsing)
 5 > (require net/url xml xml/path)
 6 > (require racket/pretty)
 7 > (define res-stream
 8     (get "https://markwatson.com" #:stream? #t))
 9 > res-stream
10 #<response>
11 > (define lst
12     (html->xexp (response-output res-stream)))
13 > lst
14 '(*TOP*
15   (*DECL* DOCTYPE html)
16   "\n"
17   (html
18    (@ (lang "en-us"))
19    "\n"
20    "  "
21    (head
22     (title
23      "Mark Watson: AI Practitioner and Author of 20+ AI Books | Mark Watson")
24   ...

Different element types are html, head, p, h1, h2, etc. If you are familiar with XPATH operations for XML data, then the function se-path/list will make more sense to your. The function se-path/list takes a list of element types from a list and recursively searches an input s-expression for lists starting with one of the target element types. In the following example we extract all elements of type p:

 1 > (se-path*/list '(p) lst) ;; get text from all p elements
 2 '("My customer list includes: Google, Capital One, Babylist, Olive AI, CompassLabs, \
 3 Mind AI, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, \
 4 Sitescout.com, Embed.ly, and Webmind Corporation."
 5   "I have worked in the fields of general\n"
 6   "   artificial intelligence, machine learning, semantic web and linked data, and\n"
 7   "      natural language processing since 1982."
 8   "My eBooks are available to read for FREE or you can   purchase them at "
 9   (a (@ (href "https://leanpub.com/u/markwatson")) "leanpub")
10   ...
1 > (define lst-p (se-path*/list '(p) lst))
2 > (filter (lambda (s) (string? s)) lst-p) ;; keep only text strings
3 '("My customer list includes: Google, Capital One, Babylist, Olive AI, CompassLabs, \
4 Mind AI, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, \
5 Sitescout.com, Embed.ly, and Webmind Corporation."
 1 #<procedure:string-normalize-spaces>
 2 > (string-normalize-spaces
 3    (string-join
 4     (filter (lambda (s) (string? s)) lst-p)
 5     "\n"))
 6 "My customer list includes: Google, Capital One, Babylist, Olive AI, CompassLabs, Mi\
 7 nd AI, Disney, SAIC, Americast, PacBell, CastTV, Lutris Technology, Arctan Group, Si\
 8 tescout.com, Embed.ly, and Webmind Corporation. I have worked in the fields of gener\
 9 al artificial intelligence, machine learning, semantic web and linked data, and natu\
10 ral language processing since 1982.
11   ...
12 "

Now we will extract HTML anchor links:

 1 > (se-path*/list '(href) lst) ;; get all links from HTML as a lisp
 2 '("/index.css"
 3   "https://mark-watson.blogspot.com"
 4   "#fun"
 5   "#books"
 6   "#opensource"
 7   "https://markwatson.com/privacy.html"
 8   "https://leanpub.com/u/markwatson"
 9   "/nda.txt"
10   "https://mastodon.social/@mark_watson"
11   "https://twitter.com/mark_l_watson"
12   "https://github.com/mark-watson"
13   "https://www.linkedin.com/in/marklwatson/"
14   "https://markwatson.com/index.rdf"
15   "https://www.wikidata.org/wiki/Q18670263"
16  ...
17 )

Implementation of a Racket Web Scraping Library

The web scraping library listed below can be found in the directory Racket-AI-book/manuscript. The following listing of webscrape.rkt should look familiar after reading the code snippets in the last section.

The provided Racket Scheme code defines three functions to interact with and process web resources: web-uri->xexp, web-uri->text, and web-uri->links.

web-uri->xexp: - Requires three libraries: net/http-easy, html-parsing, and net/url xml xml/path. - Given a URI (a-uri), it creates a stream (a-stream) using the get function from the net/http-easy library to fetch the contents of the URI. - Converts the HTML content of the URI to an S-expression (xexp) using the html->xexp function from the html-parsing library. - Closes the response stream using response-close! and returns the xexp.

web-uri->text: - Calls web-uri->xexp to convert the URI content to an xexp. - Utilizes se-path*/list from the xml/path library to extract all paragraph elements (p) from the xexp. - Filters the paragraph elements to retain only strings (excluding nested tags or other structures). - Joins these strings with a newline separator, normalizing spaces using string-normalize-spaces from the srfi/13 library.

web-uri->links: - Similar to web-uri->text, it starts by converting URI content to an xexp. - Utilizes se-path*/list to extract all href attributes from the xexp. - Filters these href attributes to retain only those that are external links (those beginning with “http”).

In summary, these functions collectively enable the extraction and processing of HTML content from a specified URI, converting HTML to a more manageable S-expression format, and then extracting text and links as required.

 1 #lang racket
 2 
 3 (require net/http-easy)
 4 (require html-parsing)
 5 (require net/url xml xml/path)
 6 (require srfi/13) ;; for strings
 7 
 8 (define (web-uri->xexp a-uri)
 9   (let* ((a-stream
10           (get a-uri #:stream? #t))
11          (lst (html->xexp (response-output a-stream))))
12     (response-close! a-stream)
13     lst))
14 
15 (define (web-uri->text a-uri)
16   (let* ((a-xexp
17           (web-uri->xexp a-uri))
18          (p-elements (se-path*/list '(p) a-xexp))
19          (lst-strings
20           (filter
21            (lambda (s) (string? s))
22            p-elements)))
23     (string-normalize-spaces
24      (string-join lst-strings "\n"))))
25 
26 (define (web-uri->links a-uri)
27   (let* ((a-xexp
28           (web-uri->xexp a-uri)))
29     ;; we want only external links so filter out local links:
30     (filter
31      (lambda (s) (string-prefix? "http" s))
32      (se-path*/list '(href) a-xexp))))

Here are a few examples in a Racket REPL (most output omitted for brevity):

 1 > (web-uri->xexp "https://knowledgebooks.com")
 2 '(*TOP*
 3   (*DECL* DOCTYPE html)
 4   "\n"
 5   (html
 6    "\n"
 7    "\n"
 8    (head
 9     "\n"
10     "    "
11     (title "KnowledgeBooks.com - research on the Knowledge Management, and the Seman\
12 tic Web ")
13     "\n"
14  ...
15 
16 > (web-uri->text "https://knowledgebooks.com")
17 "With the experience of working on Machine Learning and Knowledge Graph applications
18  ...
19 
20 > (web-uri->links "https://knowledgebooks.com")
21 '("http://markwatson.com"
22   "https://oceanprotocol.com/"
23   "https://commoncrawl.org/"
24   "http://markwatson.com/consulting/"
25   "http://kbsportal.com")

If you want to install this library on your laptop using linking (requiring the library access a link to the source code in the directory Racket-AI-book-code/webscrape) run the following in the library source directory Racket-AI-book-code/webscrape:

raco pkg install –scope user

Using the OpenAI, Anthropic, Mistral, and Local Hugging Face Large Language Model APIs in Racket

As I write the first version of this chapter in October 2023, Peter Norvig and Blaise Agüera y Arcas just wrote an article Artificial General Intelligence Is Already Here making the case that we might already have Artificial General Intelligence (AGI) because of the capabilities of Large Language Models (LLMs) to solve new tasks.

In the development of practical AI systems, LLMs like those provided by OpenAI, Anthropic, and Hugging Face have emerged as pivotal tools for numerous applications including natural language processing, generation, and understanding. These models, powered by deep learning architectures, encapsulate a wealth of knowledge and computational capabilities. As a Racket Scheme enthusiast embarking on the journey of intertwining the elegance of Racket with the power of these modern language models, you are opening a gateway to a realm of possibilities that we begin to explore here.

The OpenAI and Anthropic APIs serve as gateways to some of the most advanced language models available today. By accessing these APIs, developers can harness the power of these models for a variety of applications. Here, we delve deeper into the distinctive features and capabilities that these APIs offer, which could be harnessed through a Racket interface.

OpenAI provides an API for developers to access models like GPT-4. The OpenAI API is designed with simplicity and ease of use in mind, making it a favorable choice for developers. It provides endpoints for different types of interactions, be it text completion, translation, or semantic search among others. We will use the completion API in this chapter. The robustness and versatility of the OpenAI API make it a valuable asset for anyone looking to integrate advanced language understanding and generation capabilities into their applications.

On the other hand, Anthropic is a newer entrant in the field but with a strong emphasis on building models that are not only powerful but also understandable and steerable. The Anthropic API serves as a portal to access their language models. While the detailed offerings and capabilities might evolve, the core ethos of Anthropic is to provide models that developers can interact with in a more intuitive and controlled manner. This aligns with a growing desire within the AI community for models that are not black boxes, but instead, offer a level of interpretability and control that makes them safer and more reliable to use in different contexts. We will use the Anthropic completion API.

What if you want the total control of running open LLMs on your own computers? The company Hugging Face maintains a huge repository of pre-trained models. Some of these models are licensed for research only but many are licensed (e.g., using Apache 2) for any commercial use. Many of the Hugging Face models are derived from Meta and other companies. We will use the llama.cpp server at the end of this chapter to run our own LLM on a laptop and access it via Racket code.

Lastly, this chapter will delve into practical examples showing the synergy between systems developed in Racket and the LLMs. Whether it’s automating creative writing, conducting semantic analysis, or building intelligent chatbots, the fusion of Racket with OpenAI, Anthropic, and Hugging Face’s LLMs provides many opportunities for you, dear reader, to write innovative software that utilizes the power of LLMs.

Introduction to Large Language Models

Large Language Models (LLMs) represent a huge advance in the evolution of artificial intelligence, particularly in the domain of natural language processing (NLP). They are trained on vast corpora of text data, learning to predict subsequent words in a sequence, which imbues them with the ability to generate human-like text, comprehend the semantics of language, and perform a variety of language-related tasks. The architecture of these models, typically based on deep learning paradigms such as Transformer, empowers them to encapsulate intricate patterns and relationships within language. These models are trained utilizing substantial computational resources.

The utility of LLMs extends across a broad spectrum of applications including but not limited to text generation, translation, summarization, question answering, and sentiment analysis. Their ability to understand and process natural language makes them indispensable tools in modern AI-driven solutions. However, with great power comes great responsibility. The deployment of LLMs raises imperative considerations regarding ethics, bias, and the potential for misuse. Moreover, the black-box nature of these models presents challenges in interpretability and control, which are active areas of research in the quest to make LLMs more understandable and safe. The advent of LLMs has undeniably propelled the field of NLP to new heights, yet the journey towards fully responsible and transparent utilization of these powerful models is an ongoing endeavor. I recommend reading material at Center for Humane Technology for issues of the safe use of AI. You might also be interested in a book I wrote in April 2023 Safe For Humans AI: A “humans-first” approach to designing and building AI systems (link for reading my book free online).

Using the OpenAI APIs in Racket

We will now have some fun using Racket Scheme and OpenAI’s APIs. The combination of Racket’s language features and programming environment with OpenAI’s linguistic models opens up many possibilities for developing sophisticated AI-driven applications.

Our goal is straightforward interaction with OpenAI’s APIs. The communication between your Racket code and OpenAI’s models is orchestrated through well-defined API requests and responses, allowing for a seamless exchange of data. The following sections will show the technical aspects of interfacing Racket with OpenAI’s APIs, showcasing how requests are formulated, transmitted, and how the JSON responses are handled. Whether your goal is to automate content generation, perform semantic analysis on text data, or build intelligent systems capable of engaging in natural language interactions, the code snippets and explanations provided will serve as a valuable resource in understanding and leveraging the power of AI through Racket and OpenAI’s APIs.

The Racket code listed below defines two functions, question and completion, aimed at interacting with the OpenAI API to leverage the GPT-3.5 Turbo model for text generation. The function question accepts a prompt argument and constructs a JSON payload following the OpenAI’s chat models schema. It constructs a value for prompt-data string containing a user message that instructs the model to “Answer the question” followed by the provided prompt. The auth lambda function within question is utilized to set necessary headers for the HTTP request, including the authorization header populated with the OpenAI API key obtained from the environment variable OPENAI_API_KEY. The function post from the net/http-easy library is employed to issue a POST request to the OpenAI API endpoint “https://api.openai.com/v1/chat/completions” with the crafted JSON payload and authentication headers. The response from the API is then parsed as JSON, and the content of the message from the first choice is extracted and returned.

The function completion, on the other hand, serves a specific use case of continuing text from a given prompt. It reformats the prompt to prepend the phrase “Continue writing from the following text: “ to the provided text, and then calls the function question with this modified prompt. This setup encapsulates the task of text continuation in a separate function, making it straightforward for developers to request text extensions from the OpenAI API by merely providing the initial text to the function completion. Through these functions, the code provides a structured mechanism to generate responses or text continuations.

This example was updated May 13, 2024 when OpenAI released the new GPT-4o model.

 1 #lang racket
 2 
 3 (require net/http-easy)
 4 (require racket/set)
 5  (require racket/pretty)
 6 
 7 (provide question-openai completion-openai embeddings-openai)
 8 
 9 (define (helper-openai prefix prompt)
10   (let* ((prompt-data
11           (string-join
12            (list
13             (string-append
14              "{\"messages\": [ {\"role\": \"user\","
15              " \"content\": \"" prefix ": "
16              prompt
17              "\"}], \"model\": \"gpt-4o\"}"))))
18          (auth (lambda (uri headers params)
19                  (values
20                   (hash-set*
21                    headers
22                    'authorization
23                    (string-join
24                     (list
25                      "Bearer "
26                      (getenv "OPENAI_API_KEY")))
27                    'content-type "application/json")
28                   params)))
29          (p
30           (post
31            "https://api.openai.com/v1/chat/completions"
32            #:auth auth
33            #:data prompt-data))
34          (r (response-json p)))
35     ;;(pretty-print r)
36     (hash-ref
37      (hash-ref (first (hash-ref r 'choices)) 'message)
38      'content)))
39 
40 
41 (define (question-openai prompt)
42   (helper-openai "Answer the question: " prompt))
43 
44 (define (completion-openai prompt)
45   (helper-openai "Continue writing from the following text: "
46     prompt))
47 
48 (define (embeddings-openai text)
49     (let* ((prompt-data
50             (string-join
51              (list
52               (string-append
53                "{\"input\": \"" text "\","
54                " \"model\": \"text-embedding-ada-002\"}"))))
55            (auth (lambda (uri headers params)
56                  (values
57                   (hash-set*
58                    headers
59                    'authorization
60                    (string-join
61                     (list
62                      "Bearer "
63                      (getenv "OPENAI_API_KEY")))
64                    'content-type "application/json")
65                   params)))
66          (p
67           (post
68            "https://api.openai.com/v1/embeddings"
69            #:auth auth
70            #:data prompt-data))
71          (r (response-json p)))
72      (hash-ref
73        (first (hash-ref r 'data))
74        'embedding)))

The output looks like (output from the second example shortened for brevity):

 1 > (question "Mary is 30 and Harry is 25. Who is older?")
 2 "Mary is older than Harry."
 3 > (displayln
 4     (completion
 5       "Frank bought a new sports car. Frank drove"))
 6 Frank bought a new sports car. Frank drove it out of the dealership with a wide grin\
 7  on his face. The sleek, aerodynamic design of the car hugged the road as he acceler\
 8 ated, feeling the power under his hands. The adrenaline surged through his veins, an\
 9 d he couldn't help but let out a triumphant shout as he merged onto the highway.
10 
11 As he cruised down the open road, the wind whipping through his hair, Frank couldn't\
12  help but reflect on how far he had come. It had been a lifelong dream of his to own\
13  a sports car, a symbol of success and freedom in his eyes. He had worked tirelessly\
14 , saving every penny, making sacrifices along the way to finally make this dream a r\
15 eality.
16 ...
17 > 

Using the Anthropic APIs in Racket

The Racket code listed below defines two functions, question and completion, which facilitate interaction with the Anthropic API to access a language model named claude-instant-1 for text generation purposes. The function question takes two arguments: a prompt and a max-tokens value, which are used to construct a JSON payload that will be sent to the Anthropic API. Inside the function, several Racket libraries are utilized for handling HTTP requests and processing data. A POST request is initiated to the Anthropic API endpoint “https://api.anthropic.com/v1/complete” with the crafted JSON payload. This payload includes the prompt text, maximum tokens to sample, and specifies the model to be used. The auth lambda function is used to inject necessary headers for authentication and specifying the API version. Upon receiving the response from the API, it extracts the completion field from the JSON response, trims any leading or trailing whitespace, and returns it.

The function completion is defined to provide a more specific use-case scenario, where it is intended to continue text from a given prompt. It also accepts a max-tokens argument to limit the length of the generated text. This function internally calls the function question with a modified prompt that instructs the model to continue writing from the provided text. By doing so, it encapsulates the common task of text continuation, making it easy to request text extensions by simply providing the initial text and desired maximum token count. Through these defined functions, the code offers a structured way to interact with the Anthropic API for generating text responses or completions in a Racket Scheme environment.

 1 #lang racket
 2 
 3 (require net/http-easy)
 4 (require racket/set)
 5 (require pprint)
 6 
 7 (provide question completion)
 8 
 9 (define (question prompt max-tokens)
10   (let* ((prompt-data
11           (string-join
12            (list
13             (string-append
14              "{\"prompt\": \"\\n\\nHuman: "
15              prompt
16              "\\n\\nAssistant: \", \"max_tokens_to_sample\": "
17              (number->string  max-tokens)
18              ", \"model\": \"claude-instant-1\" }"))))
19          (auth (lambda (uri headers params)
20                  (values
21                   (hash-set*
22                    headers
23                    'x-api-key
24                      (getenv "ANTHROPIC_API_KEY")
25                    'anthropic-version "2023-06-01"
26                    'content-type "application/json")
27                   params)))
28          (p
29           (post
30            "https://api.anthropic.com/v1/complete"
31            #:auth auth
32            #:data prompt-data))
33          (r (response-json p)))
34     (string-trim (hash-ref r 'completion))))
35 
36 (define (completion prompt max-tokens)
37   (question
38    (string-append
39     "Continue writing from the following text: "
40     prompt)
41    max-tokens))

We will try the same examples we used with OpenAI APIs in the previous section:

 1 $ racket
 2 > (require "anthropic.rkt")
 3 > (question "Mary is 30 and Harry is 25. Who is older?" 20)
 4 "Mary is older than Harry. Mary is 30 years old and Harry is 25 years old."
 5 > (completion "Frank bought a new sports car. Frank drove" 200)
 6 "Here is a possible continuation of the story:\n\nFrank bought a new sports car. Fra\
 7 nk drove excitedly to show off his new purchase. The sleek red convertible turned he\
 8 ads as he cruised down the street with the top down. While stopping at a red light, \
 9 Frank saw his neighbor Jane walking down the sidewalk. He pulled over and called out\
10  to her, \"Hey Jane, check out my new ride! Want to go for a spin?\" Jane smiled and\
11  said \"Wow that is one nice car! I'd love to go for a spin.\" She hopped in and the\
12 y sped off down the road, the wind in their hair. Frank was thrilled to show off his\
13  new sports car and even more thrilled to share it with his beautiful neighbor Jane.\
14  Little did he know this joyride would be the beginning of something more between th\
15 em."
16 > 

While I usually use the OpenAPI APIs, I always like to have alternatives when I am using 3rd party infrastructure, even for personal research projects. The Anthropic LLMs definitely have a different “feel” than the OpenAPI APIs, and I enjoy using both.

Using a Local Hugging Face Llama2-13b-orca Model with Llama.cpp Server

Now we look at an approach to run LLMs locally on your own computers.

Diving into AI unveils many ways where modern language models play a pivotal role in bridging the gap between machines and human language. Among the many open and public models, I chose Hugging Face’s Llama2-13b-orca model because of its support for natural language processing tasks. To truly harness the potential of Llama2-13b-orca, an interface to Racket code is essential. This is where we use the Llama.cpp Server as a conduit between the local instance of the Hugging Face model and the applications that seek to utilize it. The combination of Llama2-13b-orca with the llama.cpp server code will meet our requirements for local deployment and ease of installation and use.

Installing and Running Llama.cpp server with a Llama2-13b-orca Model

The llama.cpp server acts as a conduit for translating REST API requests to the respective language model APIs. By setting up and running the llama.cpp server, a channel of communication is established, allowing Racket code to interact with these language models in a seamless manner. There is also a Python library to encapsulate running models inside a Python program (a subject I leave to my Python AI books).

I run the llama.cpp service easily on a M2 Mac with 16G of memory. Start by cloning the llama.cpp project and building it:

1 git clone https://github.com/ggerganov/llama.cpp.git
2 make
3 mkdir models

Then get a model file from https://huggingface.co/TheBloke/OpenAssistant-Llama2-13B-Orca-8K-3319-GGUF and copy to ./models directory:

1 $ ls -lh models
2 8.6G openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf

Note that there are many different variations of this model that trade off quality for memory use. I am using one of the larger models. If you only have 8G of memory try a smaller model.

Run the REST server:

1 ./server -m models/openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf -c 2048

We can test the REST server using the curl utility:

 1  $ curl --request POST \
 2     --url http://localhost:8080/completion \
 3     --header "Content-Type: application/json" \
 4     --data '{"prompt": "Answer the question: Mary is 30 years old and Sam is 25. Who\
 5  is older and by how much?","n_predict": 128, "top_k": 1}'
 6 {"content":"\nAnswer: Mary is older than Sam by 5 years.","generation_settings":{"fr\
 7 equency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"m\
 8 irostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"models/openassistant-ll\
 9 ama2-13b-orca-8k-3319.Q5_K_M.gguf","n_ctx":2048,"n_keep":0,"n_predict":128,"n_probs"\
10 :0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.1\
11 00000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"\
12 tfs_z":1.0,"top_k":1,"top_p":0.949999988079071,"typical_p":1.0},"model":"models/open\
13 assistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf","prompt":"Answer the question: Mary i\
14 s 30 years old and Sam is 25. Who is older and by how much?","stop":true,"stopped_eo\
15 s":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"pr\
16 edicted_ms":960.595,"predicted_n":13,"predicted_per_second":13.53327885321077,"predi\
17 cted_per_token_ms":73.89192307692308,"prompt_ms":539.3580000000001,"prompt_n":27,"pr\
18 ompt_per_second":50.05951520140611,"prompt_per_token_ms":19.976222222222223},"tokens\
19 _cached":40,"tokens_evaluated":27,"tokens_predicted":13,"truncated":false}

The important part of the output is:

1 "content":"Answer: Mary is older than Sam by 5 years."

In the next section we will write a simple library to extract data from Llama.cpp server responses.

A Racket Library for Using a Local Llama.cpp server with a Llama2-13b-orca Model

The following Racket code is designed to interface with a local instance of a Llama.cpp server to interact with a language model for generating text completions. This setup is particularly beneficial when there’s a requirement to have a local language model server, reducing latency and ensuring data privacy. We start by requiring libraries for handling HTTP requests and responses. The functionality of this code is encapsulated in three functions: helper, question, and completion, each serving a unique purpose in the interaction with the Llama.cpp server.

The helper function provides common functionality, handling the core logic of constructing the HTTP request, sending it to the Llama.cpp server, and processing the response. It accepts a prompt argument which forms the basis of the request payload. A JSON string is constructed with three key fields: prompt, n_predict, and top_k, which respectively contain the text prompt, the number of tokens to generate, and a parameter to control the diversity of the generated text. A debug line with displayln is used to output the constructed JSON payload to the console, aiding in troubleshooting. The function post is employed to send a POST request to the Llama.cpp server hosted locally on port 8080 at the /completion endpoint, with the constructed JSON payload as the request body. Upon receiving the response, it’s parsed into a Racket hash data structure, and the content field, which contains the generated text, is extracted and returned.

The question and completion functions serve as specialized interfaces to the helper function, crafting specific prompts aimed at answering a question and continuing a text, respectively. The question function prefixes the provided question text with “Answer: “ to guide the model’s response, while the completion function prefixes the provided text with a phrase instructing the model to continue from the given text. Both functions then pass these crafted prompts to the helper function, which in turn handles the interaction with the Llama.cpp server and extracts the generated text from the response.

The following code is in the file llama_local.rkt:

 1 #lang racket
 2 
 3 (require net/http-easy)
 4 (require racket/set)
 5 (require pprint)
 6 
 7 (define (helper prompt)
 8   (let* ((prompt-data
 9           (string-join
10            (list
11             (string-append
12              "{\"prompt\": \""
13              prompt
14              "\", \"n_predict\": 256, \"top_k\": 1}"))))
15          (ignore (displayln prompt-data))
16          (p
17           (post
18            "http://localhost:8080/completion"
19            #:data prompt-data))
20          (r (response-json p)))
21     (hash-ref r 'content)))
22 
23 (define (question question)
24   (helper (string-append "Answer: " question)))
25 
26 (define (completion prompt)
27   (helper
28    (string-append
29     "Continue writing from the following text: "
30     prompt)))

We can try this in a Racket REPL (output of the second example is edited for brevity):

 1 > (question "Mary is 30 and Harry is 25. Who is older?")
 2 {"prompt": "Answer: Mary is 30 and Harry is 25. Who is older?", "n_predict": 256, "t\
 3 op_k": 1}
 4 "\nAnswer: Mary is older than Harry."
 5 > (completion "Frank bought a new sports car. Frank drove")
 6 {"prompt": "Continue writing from the following text: Frank bought a new sports car.\
 7  Frank drove", "n_predict": 256, "top_k": 1}
 8 " his new sports car to work every day. He was very happy with his new sports car. O\
 9 ne day, while he was driving his new sports car, he saw a beautiful girl walking on \
10 the side of the road. He stopped his new sports car and asked her if she needed a ri\
11 de. The beautiful girl said yes, so Frank gave her a ride in his new sports car. The\
12 y talked about many things during the ride to work. When they arrived at work, Frank\
13  asked the beautiful girl for her phone number. She gave him her phone number, and h\
14 e promised to call her later that day...."
15 > (question "Mary is 30 and Harry is 25. Who is older and by how much?")
16 {"prompt": "Answer: Mary is 30 and Harry is 25. Who is older and by how much?", "n_p\
17 redict": 256, "top_k": 1}
18 "\nAnswer: Mary is older than Harry by 5 years."
19 > 

Using a Local Mistral-7B Model with Ollama.ai

Now we look at another approach to run LLMs locally on your own computers. The Ollama.ai project supplies a simple-to-install application for macOS and Linux (Windows support expected soon). When you download and run the application, it will install a command line tool ollama that we use here.

Installing and Running Ollama.ai server with a Mistral-7B Model

The Mistral model is the best 7B LLM that I have used (as I write this chapter in October 2023). When you run the ollama command line tool it will download and cache for future use the requested model.

For example, the first time we run ollama requesting the mistral LLM, you see that it is downloading the model:

 1  $ ollama run mistral
 2 pulling manifest
 3 pulling 6ae280299950... 100% |███████████████████████████████████████████████| (4.1/\
 4 4.1 GB, 13 MB/s)           
 5 pulling fede2d8d6c1f... 100% |██████████████████████████████████████████████████████\
 6 | (29/29 B, 20 B/s)        
 7 pulling b96850d2e482... 100% |███████████████████████████████████████████████████| (\
 8 307/307 B, 170 B/s)        
 9 verifying sha256 digest
10 writing manifest
11 removing any unused layers
12 success
13 >>> Mary is 30 and Bill is 25. Who is older and by how much?
14 Mary is older than Bill by 5 years.
15 
16 >>> /?
17 Available Commands:
18   /set         Set session variables
19   /show        Show model information
20   /bye         Exit
21   /?, /help    Help for a command
22 
23 Use """ to begin a multi-line message.
24 
25 >>>

When you run the ollama command line tool, it also runs a REST API serve which we use later. The next time you run the mistral model, there is no download delay:

 1 $ ollama run mistral
 2 >>> ^D
 3 $ ollama run mistral
 4 >>> If I am driving between Sedona Arizona and San Diego, what sites should I visit \
 5 as a tourist?
 6     
 7 There are many great sites to visit when driving from Sedona, Arizona to San Diego. \
 8 Here are some 
 9 suggestions:
10 
11 * Grand Canyon National Park - A must-see attraction in the area, the Grand Canyon i\
12 s a massive and 
13 awe-inspiring natural wonder that offers countless opportunities for outdoor activit\
14 ies such as hiking, 
15 camping, and rafting.
16 * Yuma Territorial Prison State Historic Park - Located in Yuma, Arizona, this forme\
17 r prison was once the 
18 largest and most secure facility of its kind in the world. Today, visitors can explo\
19 re the site and learn 
20 about its history through exhibits and guided tours.
21 * Joshua Tree National Park - A unique and otherworldly landscape in southern Califo\
22 rnia, Joshua Tree 
23 National Park is home to a variety of natural wonders, including towering trees, gia\
24 nt boulders, and 
25 scenic trails for hiking and camping.
26 * La Jolla Cove - Located just north of San Diego, La Jolla Cove is a beautiful beac\
27 h and tidal pool area 
28 that offers opportunities for snorkeling, kayaking, and exploring marine life.
29 * Balboa Park - A cultural and recreational hub in the heart of San Diego, Balboa Pa\
30 rk is home to numerous
31 museums, gardens, theaters, and other attractions that offer a glimpse into the city\
32 's history and culture.
33 
34 >>> 

While we use the mistral LLM here, there are many more available models listed in the GitHub repository for Ollama.ai: https://github.com/jmorganca/ollama.

A Racket Library for Using a Local Ollama.ai REST Server with a Mistral-7B Model

The example code in the file ollama_ai_local.rkt is very similar to the example code in the last section. The main changes are a different REST service URI and the format of the returned JSON response:

 1 (require net/http-easy)
 2 (require racket/set)
 3 (require pprint)
 4 
 5 (define (helper prompt)
 6   (let* ((prompt-data
 7           (string-join
 8            (list
 9             (string-append
10              "{\"prompt\": \""
11              prompt
12              "\", \"model\": \"mistral\", \"stream\": false}"))))
13          (ignore (displayln prompt-data))
14          (p
15           (post
16            "http://localhost:11434/api/generate"
17            #:data prompt-data))
18          (r (response-json p)))
19     (hash-ref r 'response)))
20 
21 (define (question-ollama-ai-local question)
22   (helper (string-append "Answer: " question)))
23 
24 (define (completion-ollama-ai-local prompt)
25   (helper
26    (string-append
27     "Continue writing from the following text: "
28     prompt)))
29 
30 ;; EMBEDDINGS:
31 
32 (define (embeddings-ollama text)
33     (let* ((prompt-data
34             (string-join
35              (list
36               (string-append
37                "{\"prompt\": \"" text "\","
38                " \"model\": \"mistral\"}"))))
39            (p
40             (post
41              "http://localhost:11434/api/embeddings"
42              #:data prompt-data))
43            (r (response-json p)))
44       (hash-ref r 'embedding)))
45 
46 
47 ;; (embeddings-ollama "Here is an article about llamas...")

The function embeddings-ollama can be used to create embedding vectors from text input. Embeddings are used for chat with local documents, web sites, etc. We will run the same examples we used in the last section for comparison:

 1 > (question "Mary is 30 and Harry is 25. Who is older and by how much?")
 2 {"prompt": "Answer: Mary is 30 and Harry is 25. Who is older and by how much?", "mod\
 3 el": "mistral", "stream": false}
 4 "Answer: Mary is older than Harry by 5 years."
 5 > (completion "Frank bought a new sports car. Frank drove")
 6 {"prompt": "Continue writing from the following text: Frank bought a new sports car.\
 7  Frank drove", "model": "mistral", "stream": false}
 8 "Frank drove his new sports car around town, enjoying the sleek design and powerful \
 9 engine. The car was a bright red, which caught the attention of everyone on the road\
10 . Frank couldn't help but smile as he cruised down the highway, feeling the wind in \
11 his hair and the sun on his face.\n\nAs he drove, Frank couldn't resist the urge to \
12 test out the car's speed and agility. He weaved through traffic, expertly maneuverin\
13 g the car around curves and turns. The car handled perfectly, and Frank felt a rush \
14 of adrenaline as he pushed it to its limits.\n\nEventually, Frank found himself at a\
15  local track where he could put his new sports car to the test. He revved up the eng\
16 ine and took off down the straightaway, hitting top speeds in no time. The car handl\
17 ed like a dream, and Frank couldn't help but feel a sense of pride as he crossed the\
18  finish line.\n\nAfterwards, Frank parked his sports car and walked over to a nearby\
19  café to grab a cup of coffee. As he sat outside, sipping his drink and watching the\
20  other cars drive by, he couldn't help but think about how much he loved his new rid\
21 e. It was the perfect addition to his collection of cars, and he knew he would be dr\
22 iving it for years to come."
23 > 

While I often use larger and more capable proprietary LLMs like Claude 2.1 and GPT-4, smaller open models from Mistral are very capable and sufficient for most of my experiments embedding LLMs in application code. As I write this, you can run Mistral models locally and through commercially hosted APIs.

Retrieval Augmented Generation of Text Using Embeddings

Retrieval-Augmented Generation (RAG) is a framework that combines the strengths of pre-trained language models (LLMs) with retrievers. Retrievers are system components for accessing knowledge from external sources of text data. In RAG a retriever selects relevant documents or passages from a corpus, and a generator produces a response based on both the retrieved information and the input query. The process typically follows these steps that we will use in the example Racket code:

  • Query Encoding: The input query is encoded into a vector representation.
  • Document Retrieval: A retriever system uses the query representation to fetch relevant documents or passages from an external corpus.
  • Document Encoding: The retrieved documents are encoded into vector representations.
  • Joint Encoding: The query and document representations are combined, often concatenated or mixed via attention mechanisms.
  • Generation: A generator, usually LLM, is used to produce a response based on the joint representation.

RAG enables the LLM to access and leverage external text data sources, which is crucial for tasks that require information beyond what the LLM has been trained on. It’s a blend of retrieval-based and generation-based approaches, aimed at boosting the factual accuracy and informativeness of generated responses.

Example Implementation

In the following short Racket example program (file Racket-AI-book-code/embeddingsdb/embeddingsdb.rkt) I implement some ideas of a RAG architecture. At file load time the text files in the subdirectory data are read, split into “chunks”, and each chunk along with its parent file name and OpenAI text embedding is stored in a local SQLite database. When a user enters a query, the OpenAI embedding is calculated, and this embedding is matched against the embeddings of all chunks using the dot product of two 1536 element embedding vectors. The “best” chunks are concatenated together and this “context” text is passed to GPT-4 along with the user’s original query. Here I describe the code in more detail:

The provided Racket code uses a local SQLite database and OpenAI’s APIs for calculating text embeddings and for text completions.

Utility Functions:

  • floats->string and string->floats are utility functions for converting between a list of floats and its string representation.
  • read-file reads a file’s content.
  • join-strings joins a list of strings with a specified separator.
  • truncate-string truncates a string to a specified length.
  • interleave merges two lists by interleaving their elements.
  • break-into-chunks breaks a text into chunks of a specified size.
  • string-to-list and decode-row are utility functions for parsing and processing database rows.

Database Setup:

  • Database connection is established to “test.db” and a table named “documents” is created with columns for document_path, content, and embedding.

Document Management:

  • insert-document inserts a document and its associated information into the database.
  • get-document-by-document-path and all-documents are utility functions for querying documents from the database.
  • create-document reads a document from a file path, breaks it into chunks, computes embeddings for each chunk via a function embeddings-openai, and inserts these into the database.

Semantic Matching and Interaction:

  • execute-to-list and dot-product are utility functions for database queries and vector operations.
  • semantic-match performs a semantic search by calculating the dot product of embeddings of the query and documents in the database. It then aggregates contexts of documents with a similarity score above a certain threshold, and sends a new query constructed with these contexts to OpenAI for further processing.
  • QA is a wrapper around semantic-match for querying.
  • CHAT initiates a loop for user interaction where each user input is processed through semantic-match to generate a response, maintaining a context of the previous chat.

Test Code:

  • test function creates documents by reading from specified file paths, and performs some queries using the QA function.

The code uses a local SQLite database to store and manage document embeddings and the OpenAI API for generating embeddings and performing semantic searches based on user queries. Two functions are exported in case you want to use this example as a library: create-document and QA. Note: in the test code at the bottom of the listing, change the absolute path to reflect where you cloned the GitHub repository for this book.

  1 #lang racket
  2 
  3 (require db)
  4 (require llmapis)
  5 
  6 (provide create-document QA)
  7 
  8 ; Function to convert list of floats to string representation
  9 (define (floats->string floats)
 10   (string-join (map number->string floats) " "))
 11 
 12 ; Function to convert string representation back to list of floats
 13 (define (string->floats str)
 14   (map string->number (string-split str)))
 15 
 16 (define (read-file infile)
 17   (with-input-from-file infile
 18     (lambda ()
 19       (let ((contents (read)))
 20         contents))))
 21 
 22 (define (join-strings separator list)
 23   (string-join list separator))
 24 
 25 (define (truncate-string string length)
 26   (substring string 0 (min length (string-length string))))
 27 
 28 (define (interleave list1 list2)
 29   (if (or (null? list1) (null? list2))
 30       (append list1 list2)
 31       (cons (car list1)
 32             (cons (car list2)
 33                   (interleave (cdr list1) (cdr list2))))))
 34 
 35 (define (break-into-chunks text chunk-size)
 36   (let loop ((start 0) (chunks '()))
 37     (if (>= start (string-length text))
 38         (reverse chunks)
 39         (loop (+ start chunk-size)
 40               (cons (substring text start (min (+ start chunk-size) (string-length t\
 41 ext))) chunks)))))
 42 
 43 (define (string-to-list str)
 44   (map string->number (string-split str)))
 45 
 46 (define (decode-row row)
 47   (let ((id (vector-ref row 0))
 48         (context (vector-ref row 1))
 49         (embedding (string-to-list (read-line (open-input-string (vector-ref row 2))\
 50 ))))
 51     (list id context embedding)))
 52 
 53 (define
 54   db
 55   (sqlite3-connect #:database "test.db" #:mode 'create
 56                    #:use-place #t))
 57 
 58 (with-handlers ([exn:fail? (lambda (ex) (void))])
 59   (query-exec
 60    db
 61    "CREATE TABLE documents (document_path TEXT, content TEXT, embedding TEXT);"))
 62 
 63 (define (insert-document document-path content embedding)
 64   (printf "~%insert-document:~%  content:~a~%~%" content)
 65   (query-exec
 66    db
 67    "INSERT INTO documents (document_path, content, embedding) VALUES (?, ?, ?);"
 68    document-path content (floats->string embedding)))
 69 
 70 (define (get-document-by-document-path document-path)
 71   (map decode-row
 72        (query-rows
 73          db
 74          "SELECT * FROM documents WHERE document_path = ?;"
 75          document-path)))
 76 
 77 (define (all-documents)
 78   (map
 79    decode-row
 80    (query-rows
 81     db
 82     "SELECT * FROM documents;")))
 83    
 84 (define (create-document fpath)
 85   (let ((contents (break-into-chunks (file->string fpath) 200)))
 86     (for-each
 87      (lambda (content)
 88        (with-handlers ([exn:fail? (lambda (ex) (void))])
 89          (let ((embedding (embeddings-openai content)))
 90            (insert-document fpath content embedding))))
 91      contents)))
 92 
 93 ;; Assuming a function to fetch documents from database
 94 (define (execute-to-list db query)
 95   (query-rows db query))
 96 
 97 ;; dot product of two lists of floating point numbers:
 98 (define (dot-product a b) 
 99   (cond
100     [(or (null? a) (null? b)) 0]
101     [else
102      (+ (* (car a) (car b))
103         (dot-product (cdr a) (cdr b)))]))
104 
105 (define (semantic-match query custom-context [cutoff 0.7])
106   (let ((emb (embeddings-openai query))
107         (ret '()))
108     (for-each
109      (lambda (doc)
110        (let* ((context (second doc))
111               (embedding (third doc))
112               (score (dot-product emb embedding)))
113          (when (> score cutoff)
114            (set! ret (cons context ret)))))
115      (all-documents))
116     (printf "~%semantic-search: ret=~a~%" ret)
117     (let* ((context (string-join (reverse ret) " . "))
118            (query-with-context (string-join (list context custom-context "Question:"\
119  query) " ")))
120       (question-openai query-with-context))))
121 
122 (define (QA query [quiet #f])
123   (let ((answer (semantic-match query "")))
124     (unless quiet
125       (printf "~%~%** query: ~a~%** answer: ~a~%~%" query answer))
126     answer))
127 
128 (define (CHAT)
129   (let ((messages '(""))
130         (responses '("")))
131     (let loop ()
132       (printf "~%Enter chat (STOP or empty line to stop) >> ")
133       (let ((string (read-line)))
134         (cond
135          ((or (string=? string "STOP") (< (string-length string) 1))
136           (list (reverse messages) (reverse responses)))
137          (else
138           (let* ((custom-context
139                   (string-append
140                    "PREVIOUS CHAT: "
141                    (string-join (reverse messages) " ")))
142                  (response (semantic-match string custom-context)))
143             (set! messages (cons string messages))
144             (set! responses (cons response responses))
145             (printf "~%Response: ~a~%" response)
146             (loop))))))))
147 
148 ;; ... test code ...
149 
150 (define (test)
151   "Test Semantic Document Search Using GPT APIs and local vector database"
152   (create-document
153     "/Users/markw/GITHUB/Racket-AI-book-code/embeddingsdb/data/sports.txt")
154   (create-document
155     "/Users/markw/GITHUB/Racket-AI-book-code/embeddingsdb/data/chemistry.txt")
156   (QA "What is the history of the science of chemistry?")
157   (QA "What are the advantages of engaging in sports?"))

Let’s look at a few examples form a Racket REPL:

 1 > (QA "What is the history of the science of chemistry?")
 2 ** query: What is the history of the science of chemistry?
 3 ** answer: The history of the science of chemistry dates back thousands of years. An\
 4 cient civilizations such as the Egyptians, Greeks, and Chinese were experimenting wi\
 5 th various substances and observing chemical reactions even before the term "chemist\
 6 ry" was coined.
 7 
 8 The foundations of modern chemistry can be traced back to the works of famous schola\
 9 rs such as alchemists in the Middle Ages. Alchemists sought to transform common meta\
10 ls into gold and discover elixirs of eternal life. Although their practices were oft\
11 en based on mysticism and folklore, it laid the groundwork for the understanding of \
12 chemical processes and experimentation.
13 
14 In the 17th and 18th centuries, significant advancements were made in the field of c\
15 hemistry. Prominent figures like Robert Boyle and Antoine Lavoisier began to underst\
16 and the fundamental principles of chemical reactions and the concept of elements. La\
17 voisier is often referred to as the "father of modern chemistry" for his work in est\
18 ablishing the law of conservation of mass and naming and categorizing elements.
19 
20 Throughout the 19th and 20th centuries, chemistry continued to progress rapidly. The\
21  development of the periodic table by Dmitri Mendeleev in 1869 revolutionized the or\
22 ganization of elements. The discovery of new elements, the formulation of atomic the\
23 ory, and the understanding of chemical bonding further expanded our knowledge.
24 
25 Chemistry also played a crucial role in various industries and technologies, such as\
26  the development of synthetic dyes, pharmaceuticals, plastics, and materials. The em\
27 ergence of quantum mechanics and spectroscopy in the early 20th century opened up ne\
28 w avenues for understanding the behavior of atoms and molecules.
29 
30 Today, chemistry is an interdisciplinary science that encompasses various fields suc\
31 h as organic chemistry, inorganic chemistry, physical chemistry, analytical chemistr\
32 y, and biochemistry. It continues to evolve and make significant contributions to so\
33 ciety, from developing sustainable materials to understanding biological processes a\
34 nd addressing global challenges such as climate change.
35 
36 In summary, the history of the science of chemistry spans centuries, starting from a\
37 ncient civilizations to the present day, with numerous discoveries and advancements \
38 shaping our understanding of the composition, properties, and transformations of mat\
39 ter.

This output is the combination of data found in the text files in the directory Racket-AI-book-code/embeddingsdb/data and the data that OpenAI GPT-4 was trained on. Since the local “document” file chemistry.txt is very short, most of this output is derived from the innate knowledge GPT-4 has from its training data.

In order to show that this example is also using data in the local “document” text files, I manually edited the file data/chemistry.txt adding the following made-up organic compound:

1 ZorroOnian Alcohol is another organic compound with the formula C 6 H 10 O.

GPT-4 was never trained on my made-up data so it has no idea what the non-existent compound ZorroOnian Alcohol is. The following answer is retrieved via RAG from the local document data (for brevity, most of the output for adding the local document files to the embedding index is not shown):

 1 > (create-document
 2    "/Users/markw/GITHUB/Racket-AI-book-code/embeddingsdb/data/chemistry.txt")
 3 
 4 insert-document:
 5   content:Amyl alcohol is an organic compound with the formula C 5 H 12 O. ZorroOnia\
 6 n Alcohol is another organic compound with the formula C 6 H 10 O. All eight isomers\
 7  of amyl alcohol are known.
 8 
 9   ...
10 
11 > (QA "what is the formula for ZorroOnian Alcohol")
12 
13 ** query: what is the formula for ZorroOnian Alcohol
14 ** answer: The formula for ZorroOnian Alcohol is C6H10O.

There is also a chat interface:

 1 Enter chat (STOP or empty line to stop) >> who is the chemist Robert Boyle
 2 
 3 Response: Robert Boyle was an Irish chemist and physicist who is known as one of the\
 4  pioneers of modern chemistry. He is famous for Boyle's Law, which describes the inv\
 5 erse relationship between the pressure and volume of a gas, and for his experiments \
 6 on the properties of gases. He lived from 1627 to 1691.
 7 
 8 Enter chat (STOP or empty line to stop) >> Where was he born?
 9 
10 Response: Robert Boyle was born in Lismore Castle, County Waterford, Ireland.
11 
12 Enter chat (STOP or empty line to stop) >> 

Retrieval Augmented Generation Wrap Up

Retrieval Augmented Generation (RAG) is one of the best use cases for semantic search. Another way to write RAG applications is to use a web search API to get context text for a query, and add this context data to whatever context data you have in a local embeddings data store.

Natural Language Processing

I have a Natural Language Processing (NLP) library that I wrote in Common Lisp. Here we will use code that I wrote in pure Scheme and converted to Racket.

The NLP library is still a work in progress so please check for future updates to this live eBook.

Since we will use the example code in this chapter as a library we start by defining a main.rkt file:

1 #lang racket/base
2 
3 (require "fasttag.rkt")
4 (require "names.rkt")
5 
6 (provide parts-of-speech)
7 (provide find-human-names)
8 (provide find-place-names)

There are two main source files for the NLP library: fasttag.rkt and names.rkt.

The following listing of fasttag.rkt is a conversion of original code I wrote in Java and later translated to Common Lisp. The provided Racket Scheme code is designed to perform part-of-speech tagging for a given list of words. The code begins by loading a hash table (lex-hash) from a data file (“data/tag.dat”), where each key-value pair maps a word to its possible part of speech. Then it defines several helper functions and transformation rules for categorizing words based on various syntactic and morphological criteria.

The core function, parts-of-speech, takes a vector of words and returns a vector of corresponding parts of speech. Inside this function, a number of rules are applied to each word in the list to refine its part of speech based on both its individual characteristics and its context within the list. For instance, Rule 1 changes the part of speech to “NN” (noun) if the previous word is “DT” (determiner) and the current word is categorized as a verb form (“VBD”, “VBP”, or “VB”). Rule 2 changes a word to a cardinal number (“CD”) if it contains a period, and so on. The function applies these rules in sequence, updating the part of speech for each word accordingly.

The parts-of-speech function iterates over each word in the input vector, checks it against lex-hash, and then applies the predefined rules. The result is a new vector of tags, one for each input word, where each tag represents the most likely part of speech for that word, based on the rules and the original lexicon.

#lang racket

(require srfi/13) ; the string SRFI
(require racket/runtime-path)

(provide parts-of-speech)

;; FastTag.lisp
;;
;; Conversion of KnowledgeBooks.com Java FastTag to Scheme
;;
;; Copyright 2002 by Mark Watson.  All rights reserved.
;;


(display "loading lex-hash...")
(log-info "loading lex-hash" "starting")
(define lex-hash
  (let ((hash (make-hash)))
    (with-input-from-file
        (string-append (path->string my-data-path) "/tag.dat")
      (lambda ()
        (let loop ()
          (let ((p (read)))
            (if (list? p) (hash-set! hash (car p) (cadr p)) #f)
            (if (eof-object? p) #f (loop))))))
    hash))
(display "...done.")
(log-info "loading lex-hash" "ending")

(define (string-suffix? pattern str)
  (let loop ((i (- (string-length pattern) 1)) (j (- (string-length str) 1)))
    (cond
     ((negative? i) #t)
     ((negative? j) #f)
     ((char=? (string-ref pattern i) (string-ref str j))
      (loop (- i 1) (- j 1)))
     (else #f))))
;;
; parts-of-speech
;
;  input: a vector of words (each a string)
;  output: a vector of parts of speech
;;

(define (parts-of-speech words)
  (display "\n+ tagging:") (display words)
  (let ((ret '())
        (r #f)
        (lastRet #f)
        (lastWord #f))
    (for-each
     (lambda (w)
       (set! r (hash-ref lex-hash w #f))
       ;; if this word is not in the hash table, try making it ll lower case:
       (if (not r)
           (set! r '("NN"))
           #f)
       ;;(if (list? r) (set! r (car r))))
       ;; apply transformation rules:
       
       ; rule 1: DT, {VBD, VBP, VB} --> DT, NN
       (if (equal? lastRet "DT")
           (if (or
                (equal? r "VBD")
                (equal? r "VBP")
                (equal? r "VB"))
               (set! r '("NN"))
               #f)
           #f)
       ; rule 2: convert a noun to a number if a "." appears in the word
       (if (string-contains "." w) (set! r '("CD")) #f)

       ; rule 3: convert a noun to a past participle if word ends with "ed"
       (if (equal? (member "N" r) 0)
           (let* ((slen (string-length w)))
             (if (and
                  (> slen 1)
                  (equal? (substring w (- slen 2)) "ed"))
                 (set! r "VBN") #f))
           #f)

       ; rule 4: convert any type to an adverb if it ends with "ly"
       (let ((i (string-suffix? "ly" w)))
         (if (equal? i (- (string-length w) 2))
             (set! r '("RB"))
             #f))

       ; rule 5: convert a common noun (NN or NNS) to an adjective
       ;         if it ends with "al"
       (if (or
            (member "NN" r)
            (member "NNS" r))
           (let ((i (string-suffix? "al" w)))
             (if (equal? i (- (string-length w) 2))
                 (set! r '("RB"))
                 #f))
           #f)

       ; rule 6: convert a noun to a verb if the receeding word is "would"
       (if (equal? (member "NN" r) 0)
           (if (equal? lastWord "would")
               (set! r '("VB"))
               #f)
           #f)

       ; rule 7: if a word has been categorized as a common noun and it
       ;         ends with "s", then set its type to a plural noun (NNS)
       (if (member "NN" r)
           (let ((i (string-suffix? "s" w)))
             (if (equal? i (- (string-length w) 1))
                 (set! r '("NNS"))
                 #f))
           #f)

       ; rule 8: convert a common noun to a present participle verb
       ;         (i.e., a gerand)
       (if (equal? (member "NN" r) 0)
           (let ((i (string-suffix? "ing" w)))
             (if (equal? i (- (string-length w) 3))
                 (set! r '("VBG"))
                 #f))
           #f)

       (set! lastRet ret)
       (set! lastWord w)
       (set! ret (cons (first r) ret)))
     (vector->list words))  ;; not very efficient !!
    (list->vector (reverse ret))))

The following listing of file names.rkt identifies human and place names in text. The Racket Scheme code is a script for Named Entity Recognition (NER). It is specifically designed to recognize human names and place names in given text:

  • It provides two main functions: find-human-names and find-place-names.
  • Uses two kinds of data: human names and place names, loaded from text files.
  • Employs Part-of-Speech tagging through an external fasttag.rkt module.
  • Uses hash tables and lists for efficient look-up.
  • Handles names with various components (prefixes, first name, last name, etc.)

The function process-one-word-per-line reads each line of a file and applies a given function func on it.

Initial data preparation consists of defining the hash tables *last-name-hash*, *first-name-hash*, *place-name-hash* are populated with last names, first names, and place names, respectively, from specified data files.

We define two Named Entity Recognition (NER) functions:

  1. find-human-names: Takes a word vector and an exclusion list.
    • Utilizes parts-of-speech tags.
    • Checks for names that have 1 to 4 words.
    • Adds names to ret list if conditions are met, considering the exclusion list.
    • Returns processed names (ret2).
  2. find-place-names: Similar to find-human-names, but specifically for place names.
    • Works on 1 to 3 word place names.
    • Returns processed place names.

We define one helper functions not-in-list-find-names-helper to ensures that an identified name does not overlap with another name or entry in the exclusion list.

Overall, the code is fairly optimized for its purpose, utilizing hash tables for constant-time look-up and lists to store identified entities.

#lang racket

(require "fasttag.rkt")
(require racket/runtime-path)
(provide find-human-names)
(provide find-place-names)

(define (process-one-word-per-line file-path func)
  (with-input-from-file file-path
	(lambda ()
	  (let loop ()
		(let ([l (read-line)])
		  (if (equal? l #f) #f (func l))
		  (if (eof-object? l) #f (loop)))))))

(define *last-name-hash* (make-hash))
(process-one-word-per-line 
  (string-append
    (path->string my-data-path)
    "/human_names/names.last")
  (lambda (x) (hash-set! *last-name-hash* x #t)))
(define *first-name-hash* (make-hash))
(process-one-word-per-line
  (string-append
    (path->string my-data-path)
    "/human_names/names.male")
  (lambda (x) (hash-set! *first-name-hash* x #t)))
(process-one-word-per-line
  (string-append
    (path->string my-data-path)
    "/human_names/names.female")
  (lambda (x) (hash-set! *first-name-hash* x #t)))

(define *place-name-hash* (make-hash))
(process-one-word-per-line
  (string-append
    (path->string my-data-path)
    "/placenames.txt")
  (lambda (x) (hash-set!  *place-name-hash* x #t)))
(define *name-prefix-list*
  '("Mr" "Mrs" "Ms" "Gen" "General" "Maj" "Major" "Doctor" "Vice" "President" 
	"Lt" "Premier" "Senator" "Congressman" "Prince" "King" "Representative"
	"Sen" "St" "Dr"))

(define (not-in-list-find-names-helper a-list start end)
  (let ((rval #t))
    (do ((x a-list (cdr x)))
	((or
	  (null? x)
	  (let ()
	    (if (or
		 (and
		  (>= start (caar x))
		  (<= start (cadar x)))
		 (and
		  (>= end (caar x))
		  (<= end (cadar x))))
		(set! rval #f)
                #f)
	    (not rval)))))
    rval))

;; return a list of sublists, each sublist looks like:
;;    (("John" "Smith") (11 12) 0.75) ; last number is an importance rating
(define (find-human-names word-vector exclusion-list)
  (define (score result-list)
    (- 1.0 (* 0.2 (- 4 (length result-list)))))
  (let ((tags (parts-of-speech word-vector))
        (ret '()) (ret2 '()) (x '())
        (len (vector-length word-vector))
        (word #f))
    (display "\ntags: ") (display tags)
    ;;(dotimes (i len)
    (for/list ([i (in-range len)])
      (set! word (vector-ref word-vector i))
      (display "\nword: ") (display word)
      ;; process 4 word names:      HUMAN NAMES
      (if (< i (- len 3))
          ;; case #1: single element from '*name-prefix-list*'
          (if (and
               (not-in-list-find-names-helper ret i (+ i 4))
               (not-in-list-find-names-helper exclusion-list i (+ i 4))
               (member word *name-prefix-list*)
               (equal? "." (vector-ref word-vector (+ i 1)))
               (hash-ref *first-name-hash* (vector-ref word-vector (+ i 2)) #f)
               (hash-ref *last-name-hash* (vector-ref word-vector (+ i 3)) #f))
              (if (and
                   (string-prefix? (vector-ref tags (+ i 2)) "NN")
                   (string-prefix? (vector-ref tags (+ i 3)) "NN"))
                  (set! ret (cons (list i (+ i 4)) ret))
                  #f)
              #f)
          ;; case #1: two elements from '*name-prefix-list*'
          (if (and
               (not-in-list-find-names-helper ret i (+ i 4))
               (not-in-list-find-names-helper exclusion-list i (+ i 4))
               (member word *name-prefix-list*)
               (member (vector-ref word-vector (+ i 1)) *name-prefix-list*)
               (hash-ref *first-name-hash* (vector-ref word-vector (+ i 2)) #f)
               (hash-ref *last-name-hash* (vector-ref word-vector (+ i 3)) #f))
              (if (and
                   (string-prefix? (vector-ref tags (+ i 2)) "NN")
                   (string-prefix? (vector-ref tags (+ i 3)) "NN"))
                  (set! ret (cons (list i (+ i 4)) ret))
                  #f)
              #f))
      ;; process 3 word names:      HUMAN NAMES
      (if (< i (- len 2))
          (if (and
               (not-in-list-find-names-helper ret i (+ i 3))
               (not-in-list-find-names-helper exclusion-list i (+ i 3)))
              (if (or
                   (and
                    (member word *name-prefix-list*)
                    (hash-ref *first-name-hash* (vector-ref word-vector (+ i 1)) #f)
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 2)) #f)
                    (string-prefix? (vector-ref tags (+ i 1)) "NN")
                    (string-prefix? (vector-ref tags (+ i 2)) "NN"))
                   (and
                    (member word *name-prefix-list*)
                    (member (vector-ref word-vector (+ i 1)) *name-prefix-list*)
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 2)) #f)
                    (string-prefix? (vector-ref tags (+ i 1)) "NN")
                    (string-prefix? (vector-ref tags (+ i 2)) "NN"))
                   (and
                    (member word *name-prefix-list*)
                    (equal? "." (vector-ref word-vector (+ i 1)))
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 2)) #f)
                    (string-prefix? (vector-ref tags (+ i 2)) "NN"))
                   (and
                    (hash-ref *first-name-hash* word #f)
                    (hash-ref *first-name-hash* (vector-ref word-vector (+ i 1)) #f)
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 2)) #f)
                    (string-prefix? (vector-ref tags i) "NN")
                    (string-prefix? (vector-ref tags (+ i 1)) "NN")
                    (string-prefix? (vector-ref tags (+ i 2)) "NN")))
                  (set! ret (cons (list i (+ i 3)) ret))
                  #f)
              #f)
          #f)
      ;; process 2 word names:      HUMAN NAMES
      (if (< i (- len 1))
          (if (and
               (not-in-list-find-names-helper ret i (+ i 2))
               (not-in-list-find-names-helper exclusion-list i (+ i 2)))
              (if (or
                   (and
                    (member word '("Mr" "Mrs" "Ms" "Doctor" "President" "Premier"))
                    (string-prefix? (vector-ref tags (+ i 1)) "NN")
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 1)) #f))
                   (and
                    (hash-ref *first-name-hash* word #f)
                    (hash-ref *last-name-hash* (vector-ref word-vector (+ i 1)) #f)
                    (string-prefix? (vector-ref tags i) "NN")
                    (string-prefix? (vector-ref tags (+ i 1)) "NN")))
                  (set! ret (cons (list i (+ i 2)) ret))
                  #f)
              #f)
          #f)
      ;; 1 word names:      HUMAN NAMES
      (if (hash-ref *last-name-hash* word #f)
          (if (and
               (string-prefix? (vector-ref tags i) "NN")
               (not-in-list-find-names-helper ret i (+ i 1))
               (not-in-list-find-names-helper exclusion-list i (+ i 1)))
              (set! ret (cons (list i (+ i 1)) ret))
              #f)
          #f))
    ;; TBD: calculate importance rating based on number of occurences of name in tex\
t:
    (set! ret2
          (map (lambda (index-pair)
                 (string-replace
                  (string-join (vector->list (vector-copy  word-vector (car index-pa\
ir) (cadr index-pair))))
                  " ." "."))
               ret))
    ret2))

(define (find-place-names word-vector exclusion-list)  ;; PLACE
  (define (score result-list)
    (- 1.0 (* 0.2 (- 4 (length result-list)))))
  (let ((tags (parts-of-speech word-vector))
        (ret '()) (ret2 '()) (x '())
        (len (vector-length word-vector))
        (word #f))
    (display "\ntags: ") (display tags)
    ;;(dotimes (i len)
    (for/list ([i (in-range len)])
      (set! word (vector-ref word-vector i))
      (display "\nword: ") (display word) (display "\n")
      ;; process 3 word names: PLACE
      (if (< i (- len 2))
          (if (and
               (not-in-list-find-names-helper ret i (+ i 3))
               (not-in-list-find-names-helper exclusion-list i (+ i 3)))
              (let ((p-name (string-append word " " (vector-ref word-vector (+ i 1))\
 " " (vector-ref word-vector (+ i 2)))))
                (if (hash-ref *place-name-hash* p-name #f)
                    (set! ret (cons (list i (+ i 3)) ret))
                  #f))
              #f)
          #f)
      ;; process 2 word names:  PLACE
      (if (< i (- len 1))
          (if (and
               (not-in-list-find-names-helper ret i (+ i 2))
               (not-in-list-find-names-helper exclusion-list i (+ i 2)))
              (let ((p-name (string-append word " " (vector-ref word-vector (+ i 1))\
)))
                (if (hash-ref *place-name-hash* p-name #f)
                    (set! ret (cons (list i (+ i 2)) ret))
                    #f)
                #f)
              #f)
          #f)
      ;; 1 word names:   PLACE
      (if (hash-ref *place-name-hash* word #f)
          (if (and
               (string-prefix? (vector-ref tags i) "NN")
               (not-in-list-find-names-helper ret i (+ i 1))
               (not-in-list-find-names-helper exclusion-list i (+ i 1)))
              (set! ret (cons (list i (+ i 1)) ret))
              #f)
          #f))
    ;; TBD: calculate importance rating based on number of occurences of name in tex\
t: can use (count-substring..) defined in utils.rkt
    (set! ret2
          (map (lambda (index-pair)
                 (string-join (vector->list (vector-copy  word-vector (car index-pai\
r) (cadr index-pair))) " "))
               ret))
    ret2))

#|
(define nn (find-human-names '#("President" "George" "Bush" "went" "to" "San" "Diego\
" "to" "meet" "Ms" "." "Jones" "and" "Gen" "." "Pervez" "Musharraf" ".") '()))
(display (find-place-names '#("George" "Bush" "went" "to" "San" "Diego" "and" "Londo\
n") '()))
|#

Let’s try some examples in a Racket REPL:

 1 > Racket-AI-book-code/nlp $ racket
 2 Welcome to Racket v8.10 [cs].
 3 > (require nlp)
 4 loading lex-hash......done.#f
 5 > (find-human-names '#("President" "George" "Bush" "went" "to" "San" "Diego" "to" "m\
 6 eet" "Ms" "." "Jones
 7 " "and" "Gen" "." "Pervez" "Musharraf" ".") '())
 8 
 9 + tagging:#(President George Bush went to San Diego to meet Ms . Jones and Gen . Per\
10 vez Musharraf .)
11 tags: #(NNP NNP NNP VBD TO NNP NNP TO VB NNP CD NNP CC NNP CD NN NN CD)
12 word: President
13 word: George
14 word: Bush
15 word: went
16 word: to
17 word: San
18 word: Diego
19 word: to
20 word: meet
21 word: Ms
22 word: .
23 word: Jones
24 word: and
25 word: Gen
26 word: .
27 word: Pervez
28 word: Musharraf
29 word: .'("Gen. Pervez Musharraf" "Ms. Jones" "San" "President George Bush")
30 > (find-place-names '#("George" "Bush" "went" "to" "San" "Diego" "and" "London") '())
31 
32 + tagging:#(George Bush went to San Diego and London)
33 tags: #(NNP NNP VBD TO NNP NNP CC NNP)
34 word: George
35 
36 word: Bush
37 
38 word: went
39 
40 word: to
41 
42 word: San
43 
44 word: Diego
45 
46 word: and
47 
48 word: London
49 '("London" "San Diego")
50 > 

NLP Wrap Up

The NLP library is still a work in progress so please check for updates to this live eBook and the GitHub repository for this book:

https://github.com/mark-watson/Racket-AI-book-code

Knowledge Graph Navigator

The Knowledge Graph Navigator (which I will often refer to as KGN) is a tool for processing a set of entity names and automatically exploring the public Knowledge Graph DBPedia using SPARQL queries. I started to write KGN for my own use to automate some things I used to do manually when exploring Knowledge Graphs, and later thought that KGN might be also useful for educational purposes. KGN shows the user the auto-generated SPARQL queries so hopefully the user will learn by seeing examples. KGN uses the SPARQL queries.

I cover SPARQL and linked data/knowledge Graphs is previous books I have written and while I give you a brief background here, I ask interested users to look at either for more details:

  • The chapter Knowledge Graph Navigator in my book Loving Common Lisp, or the Savvy Programmer’s Secret Weapon
  • The chapters Background Material for the Semantic Web and Knowledge Graphs, Knowledge Graph Navigator in my book Practical Artificial Intelligence Programming With Clojure

We use the Natural Language Processing (NLP) library from the last chapter to find human and place names in input text and then construct SPARQL queries to access data from DBPedia.

The KGN application is still a work in progress so please check for updates to this live eBook. The following screenshots show the current version of the application:

KGN finds multiple RDF subjects for "Steve Jobs" so a dialog is presented to choose one
KGN finds multiple RDF subjects for “Steve Jobs” so a dialog is presented to choose one

I have implemented parts of KGN in several languages: Common Lisp, Java, Clojure, Racket Scheme, Swift, Python, and Hy. The most full featured version of KGN, including a full user interface, is featured in my book Loving Common Lisp, or the Savvy Programmer’s Secret Weapon that you can read free online. That version performs more speculative SPARQL queries to find information compared to the example here that I designed for ease of understanding, and modification. I am not covering the basics of RDF data and SPARQL queries here. While I provide sufficient background material to understand the code, please read the relevant chapters in my Common Lisp book for more background material.

KGN query results
KGN query results

We will be running an example using data containing three person entities, one company entity, and one place entity. The following figure shows a very small part of the DBPedia Knowledge Graph that is centered around these entities. The data for this figure was collected by an example Knowledge Graph Creator from my Common Lisp book:

File dbpedia_sample.nt loaded into the free version of GraphDB
File dbpedia_sample.nt loaded into the free version of GraphDB

I chose to use DBPedia instead of WikiData for this example because DBPedia URIs are human readable. The following URIs represent the concept of a person. The semantic meanings of DBPedia and FOAF (friend of a friend) URIs are self-evident to a human reader while the WikiData URI is not:

http://www.wikidata.org/entity/Q215627
http://dbpedia.org/ontology/Person
http://xmlns.com/foaf/0.1/name

I frequently use WikiData in my work and WikiData is one of the most useful public knowledge bases. I have both DBPedia and WikiData SPARQL endpoints in the example code that we will look at later, with the WikiData endpoint comment out. You can try manually querying WikiData at the WikiData SPARQL endpoint. For example, you might explore the WikiData URI for the person concept using:

select ?p ?o where {
 <http://www.wikidata.org/entity/Q215627> ?p ?o .
} limit 10

For the rest of this chapter we will just use DBPedia or data copied from DBPedia.

After looking at an interactive session using the example program for this chapter we will look at the implementation.

Entity Types Handled by KGN

To keep this example simple we handle just two entity types:

  • People
  • Places

The Common Lisp version of KGN also searches for relationships between entities. This search process consists of generating a series of SPARQL queries and calling the DBPedia SPARQL endpoint. I may add this feature to the Racket version of KGN in the future.

KGN Implementation

The example application works processing a list or Person, Place, and Organization names. We generate SPARQL queries to DBPedia to find information about the entities and relationships between them.

We are using two libraries developed for this book that can be found in the directories Racket-AI-book-code/sparql and Racket-AI-book-code/nlp to supply support for SPARQL queries and natural language processing.

SPARQL Client Library

We already looked at code examples for making simple SPARQL queries in the chapter Datastores and here we continue with more examples that we need to the KGN application.

The following listing shows Racket-AI-book-code/sparql/sparql.rkt where we implement several functions for interacting with DBPedia’s SPARQL endpoint. There are two functions sparql-dbpedia-for-person and sparql-dbpedia-person-uri crafted for constructing SPARQL queries. The function sparql-dbpedia-for-person takes a person URI and formulates a query to fetch associated website links and comments, limiting the results to four. On the other hand, the function sparql-dbpedia-person-uri takes a person name and builds a query to obtain the person’s URI and comments from DBpedia. Both functions utilize string manipulation to embed the input parameters into the SPARQL query strings. There are similar functions for places.

Another function sparql-query->hash executes SPARQL queries against the DBPedia endpoint. It takes a SPARQL query string as an argument, sends an HTTP request to the DBpedia SPARQL endpoint, and expects a JSON response. The call/input-url function is used to send the request, with uri-encode ensuring the query string is URL-encoded. The response is read from the port, converted to a JSON expression using the function string->jsexpr, and is expected to be in a hash form which is returned by this function.

Lastly, there are two functions json->listvals and gd for processing the JSON response from DBPedia. The function json->listvals extracts the variable bindings from the SPARQL result and organizes them into lists. The function gd further processes these lists based on the number of variables in the query result, creating lists of lists which represent the variable bindings in a structured way. The sparql-dbpedia function serves as an interface to these functionalities, taking a SPARQL query string, executing the query via sparql-query->hash, and processing the results through gd to provide a structured output. This arrangement encapsulates the process of querying DBPedia and formatting the results, making it convenient for further use within a Racket program.

We already saw most of the following code listing in the previous chapter Datastores. The following listings in this chapter will be updated in future versions of this live eBook when I finish writing the KGN application.

Part of solving this problem is constructing SPARQL queries as strings. We will look in some detail at one utility function sparql-dbpedia-for-person that constructs a SPARQL query string for fetching data from DBpedia about a specific person. The function takes one parameter, person-uri, which is expected to be the URI of a person in the DBpedia dataset. The query string is built by appending strings, including the dynamic insertion of the person-uri parameter value. Here’s a breakdown of how the code works:

  1. Function Definition: The function sparql-dbpedia-for-person is defined with one parameter, person-uri. This parameter is used to dynamically insert the person’s URI into the SPARQL query.
  2. String Appending (@string-append): The @string-append construct (which seems like a custom or pseudo-syntax, as the standard Scheme function for string concatenation is string-append without the @) is used to concatenate multiple strings to form the complete SPARQL query. This includes static parts of the query as well as dynamic parts where the person-uri is inserted.
  3. SPARQL Query Construction: The function constructs a SPARQL query with the following key components:
    • SELECT Clause: This part of the query specifies what information to return. It uses GROUP_CONCAT to aggregate multiple ?website values into a single string, separated by ” | “, and also selects the ?comment variable.
    • OPTIONAL Clauses: Two OPTIONAL blocks are included:
      • The first block attempts to fetch English comments (?comment) associated with the person, filtering to ensure the language of the comment is English (lang(?comment) = ‘en’).
      • The second block fetches external links (?website) associated with the person but filters out any URLs containing “dbpedia” (case-insensitive), to likely avoid self-references within DBpedia.
    • Dynamic URI Insertion: The @person-uri placeholder is replaced with the actual person-uri passed to the function. This dynamically targets the SPARQL query at a specific DBpedia resource.
    • LIMIT Clause: The query is limited to return at most 4 results with LIMIT 4.
  4. Usage of @person-uri Placeholder: The code shows @person-uri placeholders within the query string, indicating where the person-uri parameter’s value should be inserted. However, the mechanism for replacing these placeholders with the actual URI value is not explicitly shown in the snippet. Typically, this would involve string replacement functionality, ensuring the final query string includes the specific URI of the person of interest.

In summary, the sparql-dbpedia-for-person function dynamically constructs a SPARQL query to fetch English comments and external links (excluding DBpedia links) for a given person from DBpedia, with the results limited to a maximum of 4 entries. The use of string concatenation (or a pseudo-syntax resembling @string-append) allows for the dynamic insertion of the person’s URI into the query.

  1 (provide sparql-dbpedia-person-uri)
  2 (provide sparql-query->hash)
  3 (provide json->listvals)
  4 (provide sparql-dbpedia)
  5 
  6 (require net/url)
  7 (require net/uri-codec)
  8 (require json)
  9 (require racket/pretty)
 10 
 11 (define (sparql-dbpedia-for-person person-uri)
 12   @string-append{
 13      SELECT
 14       (GROUP_CONCAT(DISTINCT ?website; SEPARATOR="  |  ")
 15                                    AS ?website) ?comment {
 16       OPTIONAL {
 17        @person-uri
 18        <http://www.w3.org/2000/01/rdf-schema#comment>
 19        ?comment . FILTER (lang(?comment) = 'en')
 20       } .
 21       OPTIONAL {
 22        @person-uri
 23        <http://dbpedia.org/ontology/wikiPageExternalLink>
 24        ?website
 25         . FILTER( !regex(str(?website), "dbpedia", "i"))
 26       }
 27      } LIMIT 4})
 28 
 29 (define (sparql-dbpedia-person-uri person-name)
 30   @string-append{
 31     SELECT DISTINCT ?personuri ?comment {
 32       ?personuri
 33         <http://xmlns.com/foaf/0.1/name>
 34         "@person-name"@"@"en .
 35       ?personuri
 36         <http://www.w3.org/2000/01/rdf-schema#comment>
 37         ?comment .
 38              FILTER  (lang(?comment) = 'en') .
 39 }})
 40 
 41 
 42 (define (sparql-query->hash query)
 43   (call/input-url
 44    (string->url
 45     (string-append
 46      "https://dbpedia.org/sparql?query="
 47      (uri-encode query)))
 48    get-pure-port
 49    (lambda (port)
 50      (string->jsexpr (port->string port)))
 51    '("Accept: application/json")))
 52 
 53 (define (json->listvals a-hash)
 54   (let ((bindings (hash->list a-hash)))
 55     (let* ((head (first bindings))
 56            (vars (hash-ref (cdr head) 'vars))
 57            (results (second bindings)))
 58       (let* ((x (cdr results))
 59              (b (hash-ref x 'bindings)))
 60         (for/list
 61             ([var vars])
 62           (for/list ([bc b])
 63             (let ((bcequal
 64                    (make-hash (hash->list bc))))
 65               (let ((a-value
 66                      (hash-ref
 67                       (hash-ref
 68                        bcequal
 69                        (string->symbol var)) 'value)))
 70                 (list var a-value)))))))))
 71 
 72 
 73 (define gd (lambda (data)
 74 
 75     (let ((jd (json->listvals data)))
 76 
 77       (define gg1
 78         (lambda (jd) (map list (car jd))))
 79       (define gg2
 80         (lambda (jd) (map list (car jd) (cadr jd))))
 81       (define gg3
 82         (lambda (jd)
 83           (map list (car jd) (cadr jd) (caddr jd))))
 84       (define gg4
 85         (lambda (jd)
 86           (map list
 87                (car jd) (cadr jd)
 88                (caddr jd) (cadddr jd))))
 89 
 90       (case (length (json->listvals data))
 91         [(1) (gg1 (json->listvals data))]
 92         [(2) (gg2 (json->listvals data))]
 93         [(3) (gg3 (json->listvals data))]
 94         [(4) (gg4 (json->listvals data))]
 95         [else
 96          (error "sparql queries with 1 to 4 vars")]))))
 97 
 98 
 99 (define sparql-dbpedia
100   (lambda (sparql)
101     (gd (sparql-query->hash sparql)))

The function gd converts JSON data to Scheme nested lists and then extracts the values for up to four variables.

NLP Library

We implemented a library in the chapter Natural Language Processing that we use here.

Please make sure you have read that chapter before the following sections.

Implementation of KGN Application Code

The file Racket-AI-book-code/kgn/main.rkt contains library boilerplate and the file Racket-AI-book-code/kgn/kgn.rkt the application code. The provided Racket scheme code is structured for interacting with the DBPedia SPARQL endpoint to retrieve information about persons or places based on a user’s string query. The code is organized into several defined functions aimed at handling different steps of the process:

Query Parsing and Entity Recognition:

The parse-query function takes a string query-str and tokenizes it into a list of words after replacing certain characters (like “.” and “?”). It then checks for keywords like “who” or “where” to infer the type of query - person or place. Using find-human-names and find-place-names (defined in the earlier section on SPARQL), it extracts the entity names from the tokens. Depending on the type of query and the entities found, it returns a list indicating the type and name of the entity, or unknown if no relevant entities are identified.

SPARQL Query Construction and Execution:

The functions get-person-results and get-place-results take a name string, construct a SPARQL query to get information about the entity from DBPedia, execute the query, and process the results. They utilize the sparql-dbpedia-person-uri, sparql-query->hash, and json->listvals functions that we listed previously to construct the query, execute it, and convert the returned JSON data to a list, respectively.

Query Interface:

The ui-query-helper function acts as the top-level utility for processing a string query to generate a SPARQL query, execute it, and return the results. It first calls parse-query to understand the type of query and the entity in question. Depending on whether the query is about a person or a place, it invokes get-person-results or get-place-results, respectively, to get the relevant information from DBPedia. It then returns a list containing the SPARQL query and the results, or #f if the query type is unknown.

This code structure facilitates the breakdown of a user’s natural language query into actionable SPARQL queries to retrieve and present information about identified entities from a structured data source like DBPedia.

 1 (require racket/pretty)
 2 (require nlp)
 3 
 4 (provide get-person-results)
 5 (provide ui-query-helper)
 6 
 7 (require "sparql-utils.rkt")
 8 
 9 (define (get-person-results person-name-string)
10   (let ((person-uri (sparql-dbpedia-person-uri person-name-string)))
11     (let* ((hash-data (sparql-query->hash person-uri)))
12       (list
13        person-uri
14        (extract-name-uri-and-comment
15         (first (json->listvals hash-data)) (second (json->listvals hash-data)))))))
16 
17 
18 (define (get-place-results place-name-string)
19   (let ((place-uri (sparql-dbpedia-place-uri place-name-string)))
20     (let* ((hash-data (sparql-query->hash place-uri)))
21       (list
22        place-uri
23        (extract-name-uri-and-comment
24         (first (json->listvals hash-data)) (second (json->listvals hash-data)))))))
25 
26 
27 (define (parse-query query-str)
28   (let ((cleaned-query-tokens
29          (string-split (string-replace  (string-replace query-str "." " ") "?" " "))\
30 ))
31     (printf "\n+ + + cleaned-query-tokens:~a\n" cleaned-query-tokens)
32     (if (member "who" cleaned-query-tokens)
33         (let ((person-names (find-human-names (list->vector cleaned-query-tokens) '(\
34 ))))
35           (printf "\n+ + person-names= ~a\n" person-names)
36           (if (> (length person-names) 0)
37               (list 'person (first person-names))   ;; for now, return the first nam\
38 e found
39               #f))
40         (if (member "where" cleaned-query-tokens)
41             (let ((place-names (find-place-names (list->vector cleaned-query-tokens)\
42  '())))
43               (printf "\n+ + place-names= ~a\n" place-names)
44               (if (> (length place-names) 0)
45                   (list 'place (first place-names))   ;; for now, return the first p\
46 lace name found
47                   (list 'unknown query-str)))
48             (list 'unknown query-str))))) ;; no person or place name match so just r\
49 eturn original query
50 
51 (define (ui-query-helper query-str)  ;; top level utility function for string query \
52 -> 1) generated sparql 2) query function
53   (display "in ui-query-helper: query-str=") (display query-str)
54   (let* ((parse-results (parse-query query-str))
55          (question-type (first parse-results))
56          (entity-name (second parse-results)))
57     (display (list parse-results question-type entity-name))
58     (if (equal? question-type 'person)
59         (let* ((results2 (get-person-results entity-name))
60                (sparql (car results2))
61                (results (second results2)))
62           (printf "\n++  results: ~a\n" results)
63           (list sparql results))
64         (if (equal? question-type 'place)
65             (let* ((results2 (get-place-results entity-name))
66                    (sparql (car results2))
67                    (results (second results2)))
68               (list sparql results))
69             #f))))

The file Racket-AI-book-code/kgn/dialog-utils.rkt contains the user interface specific code for implementing a dialog box.

 1 (require htdp/gui)
 2 (require racket/gui/base)
 3 (require racket/pretty)
 4 (provide make-selection-functions)
 5 
 6 (define (make-selection-functions parent-frame title)
 7   (let* ((dialog
 8           (new dialog%	 
 9                [label title]	 
10                [parent parent-frame]	 
11                [width 440]	 
12                [height 480]))
13          (close-callback
14           (lambda (button event)
15             (send dialog show #f)))
16          (entity-chooser-dialog-list-box
17           (new list-box%	 
18                [label ""]	 
19                [choices (list "aaaa" "bbbb")]	 
20                [parent dialog]	 
21                [callback (lambda (click event)
22                            (if (equal? (send event get-event-type) 'list-box-dclick)
23                                (close-callback click event)
24                                #f))]))
25          (quit-button
26           (new button% [parent dialog]
27              [label "Select an entity"]
28              [callback  close-callback]))
29          (set-new-items-and-show-dialog
30           (lambda (a-list)
31             (send entity-chooser-dialog-list-box set-selection 0)
32             (send entity-chooser-dialog-list-box set a-list)
33             (send dialog show #t)))
34          (get-selection-index (lambda () (first (send entity-chooser-dialog-list-box\
35  get-selections)))))
36     (list set-new-items-and-show-dialog get-selection-index)))

The local file sparql-utils.rkt contains additional utility functions for accessing information in DBPedia.

 1 (provide sparql-dbpedia-for-person)
 2 (provide sparql-dbpedia-person-uri)
 3 (provide sparql-query->hash)
 4 (provide json->listvals)
 5 (provide extract-name-uri-and-comment)
 6 
 7 (require net/url)
 8 (require net/uri-codec)
 9 (require json)
10 (require racket/pretty)
11 
12 (define ps-encoded-by "ps:P702")
13 (define wdt-instance-of "wdt:P31")
14 (define wdt-in-taxon "wdt:P703")
15 (define wd-human "wd:Q15978631")
16 (define wd-mouse "wd:Q83310")
17 (define wd-rat "wd:Q184224")
18 (define wd-gene "wd:Q7187")
19 
20 (define (sparql-dbpedia-for-person person-uri)
21   @string-append{
22      SELECT
23       (GROUP_CONCAT(DISTINCT ?website; SEPARATOR="  |  ") AS ?website) ?comment {
24       OPTIONAL { @person-uri <http://www.w3.org/2000/01/rdf-schema#comment> ?comment\
25  . FILTER (lang(?comment) = 'en') } .
26       OPTIONAL { @person-uri <http://dbpedia.org/ontology/wikiPageExternalLink> ?web\
27 site . FILTER( !regex(str(?website), "dbpedia", "i"))} .
28      } LIMIT 4})
29 
30 (define (sparql-dbpedia-person-uri person-name)
31   @string-append{
32     SELECT DISTINCT ?personuri ?comment {
33       ?personuri <http://xmlns.com/foaf/0.1/name> "@person-name"@"@"en .
34       ?personuri <http://www.w3.org/2000/01/rdf-schema#comment>  ?comment .
35                   FILTER  (lang(?comment) = 'en') .
36 }})
37 
38 (define (sparql-query->hash query)
39   (call/input-url (string->url (string-append "https://dbpedia.org/sparql?query=" (u\
40 ri-encode query)))
41                       get-pure-port
42                       (lambda (port)
43                         (string->jsexpr (port->string port))
44                         )
45                       '("Accept: application/json")))
46 
47 (define (json->listvals a-hash)
48   (let ((bindings (hash->list a-hash)))
49     (let* ((head (first bindings))
50            (vars (hash-ref (cdr head) 'vars))
51            (results (second bindings)))
52       (let* ((x (cdr results))
53              (b (hash-ref x 'bindings)))
54         (for/list ([var vars])
55                   (for/list ([bc b])
56                     (let ((bcequal (make-hash (hash->list bc))))
57                       (let ((a-value (hash-ref (hash-ref bcequal (string->symbol var\
58 )) 'value)))
59                         (list var a-value)))))))))
60 
61 (define extract-name-uri-and-comment (lambda (l1 l2)
62               (map ;; perform a "zip" action on teo lists
63                (lambda (a b)
64                  (list (second a) (second b)))
65                l1 l2)))

The local file kgn.rkt is the main program for this application.

 1 (require htdp/gui)  ;; note: when building executable, choose GRacket, not Racket to\
 2  get *.app bundle
 3 (require racket/gui/base)
 4 (require racket/match)
 5 (require racket/pretty)
 6 (require scribble/text/wrap)
 7 
 8 ;; Sample queries:
 9 ;;   who is Bill Gates
10 ;;   where is San Francisco
11 ;; (only who/where queries are currently handled)
12 
13 ;;(require "utils.rkt")
14 (require nlp)
15 (require "main.rkt")
16 (require "dialog-utils.rkt")
17 
18 (define count-substring 
19   (compose length regexp-match*))
20 
21 (define (short-string s)
22   (if (< (string-length s) 75)
23       s
24       (substring s 0 73)))
25 
26 (define dummy (lambda (button event) (display "\ndummy\n"))) ;; this will be redefin\
27 ed after UI objects are created
28 
29 (let ((query-callback (lambda (button event) (dummy button event))))
30   (match-let* ([frame (new frame% [label "Knowledge Graph Navigator"]
31                            [height 400] [width 608] [alignment '(left top)])]
32                [(list set-new-items-and-show-dialog get-selection-index) ; returns l\
33 ist of 2 callback functions
34                 (make-selection-functions frame "Test selection list")]
35                [query-field (new text-field%
36                                  [label "  Query:"] [parent frame]
37                                  [callback
38                                   (lambda( k e)
39                                     (if (equal? (send e get-event-type) 'text-field-\
40 enter) (query-callback k e) #f))])]
41                [a-container (new pane%
42                                  [parent frame] [alignment '(left top)])]
43                [a-message (new message%
44                                [parent frame] [label "  Generated SPARQL:"])]
45                [sparql-canvas (new text-field%
46                                    (parent frame) (label "")
47                                    [min-width 380] [min-height 200]
48                                    [enabled #f])]
49                [a-message-2 (new message% [parent frame] [label "  Results:"])]
50                [results-canvas (new text-field%
51                                     (parent frame) (label "")
52                                     [min-height 200] [enabled #f])]
53                [a-button (new button% [parent a-container]
54                               [label "Process: query -> generated SPARQL -> results \
55 from DBPedia"]
56                               [callback query-callback])])
57     (display "\nbefore setting new query-callback\n")
58     (set!
59      dummy ;; override dummy labmda defined earlier
60      (lambda (button event)
61        (display "\n+ in query-callback\n")
62        (let ((query-helper-results-all (ui-query-helper (send query-field get-value)\
63 )))
64          (if (equal? query-helper-results-all #f)
65              (let ()
66                (send  sparql-canvas set-value "no generated SPARQL")
67                (send  results-canvas set-value "no results"))
68              (let* ((sparql-results (first query-helper-results-all))
69                     (query-helper-results-uri-and-description (cadr query-helper-res\
70 ults-all))
71                     (uris (map first query-helper-results-uri-and-description))
72                     (query-helper-results (map second query-helper-results-uri-and-d\
73 escription)))
74                (display "\n++ query-helper-results:\n") (display query-helper-result\
75 s) (display "\n")
76                (if (= (length query-helper-results) 1)
77                    (let ()
78                      (send  sparql-canvas set-value sparql-results)
79                      (send  results-canvas set-value
80                             (string-append (string-join  (wrap-line (first query-hel\
81 per-results) 95) "\n") "\n\n" (first uris))))
82                    (if (> (length query-helper-results) 1)
83                        (let ()
84                          (set-new-items-and-show-dialog (map short-string query-help\
85 er-results))
86                          (set! query-helper-results
87                                (let ((sel-index (get-selection-index)))
88                                  (if (> sel-index -1)
89                                      (list-ref query-helper-results sel-index)
90                                      '(""))))
91                          (set! uris (list-ref uris (get-selection-index)))
92                          (display query-helper-results)
93                          (send  sparql-canvas set-value sparql-results)
94                          (send  results-canvas set-value
95                                 (string-append (string-join  (wrap-line query-helper\
96 -results 95) "\n") "\n\n" uris)))
97                        (send  results-canvas set-value (string-append "No results fo\
98 r: " (send query-field get-value))))))))))
99      (send frame show #t)))

The two screen shot figures seen earlier show the GUI application running.

Knowledge Graph Navigator Wrap Up

This KGN example was hopefully both interesting to you and simple enough in its implementation to use as a jumping off point for your own projects.

I had the idea for the KGN application because I was spending quite a bit of time manually setting up SPARQL queries for DBPedia (and other public sources like WikiData) and I wanted to experiment with partially automating this process. I have experimented with versions of KGN written in Java, Hy language (Lisp running on Python that I wrote a short book on), Swift, and Common Lisp and all four implementations take different approaches as I experimented with different ideas.

Conclusions

The material in this book was informed by my own work interests and experiences. If you enjoyed reading it and you make practical use of at least some of the material I covered, then I consider my effort to be worthwhile.

Racket is a language that many people use for both fun personal projects and for professional development. I have tried, dear reader, to make the case here that Racket is a practical language that integrates well with my work flows on both Linux and macOS.

Writing software is a combination of a business activity, promoting good for society, and an exploration to try out new ideas for self improvement. I believe that there is sometimes a fine line between spending too many resources tracking many new technologies versus getting stuck using old technologies at the expense of lost opportunities. My hope is that reading this book was an efficient and pleasurable use of your time, letting you try some new techniques and technologies that you had not considered before.

If we never get to meet in person or talk on the telephone, then I would like to thank you now for taking the time to read my book.