Retrieval Augmented Generation of Text Using Embeddings
Retrieval-Augmented Generation (RAG) is a framework that combines the strengths of pre-trained language models (LLMs) with retrievers. Retrievers are system components for accessing knowledge from external sources of text data. In RAG a retriever selects relevant documents or passages from a corpus, and a generator produces a response based on both the retrieved information and the input query. The process typically follows these steps that we will use in the example Racket code:
- Query Encoding: The input query is encoded into a vector representation.
- Document Retrieval: A retriever system uses the query representation to fetch relevant documents or passages from an external corpus.
- Document Encoding: The retrieved documents are encoded into vector representations.
- Joint Encoding: The query and document representations are combined, often concatenated or mixed via attention mechanisms.
- Generation: A generator, usually LLM, is used to produce a response based on the joint representation.
RAG enables the LLM to access and leverage external text data sources, which is crucial for tasks that require information beyond what the LLM has been trained on. It’s a blend of retrieval-based and generation-based approaches, aimed at boosting the factual accuracy and informativeness of generated responses.
Example Implementation
In the following short Racket example program (file Racket-AI-book/source-code/embeddingsdb /embeddingsdb.rkt) I implement some ideas of a RAG architecture. At file load time the text files in the subdirectory data are read, split into “chunks”, and each chunk along with its parent file name and OpenAI text embedding is stored in a local SQLite database. When a user enters a query, the OpenAI embedding is calculated, and this embedding is matched against the embeddings of all chunks using the dot product of two 1536 element embedding vectors. The “best” chunks are concatenated together and this “context” text is passed to GPT-4 along with the user’s original query. Here I describe the code in more detail:
The provided Racket code uses a local SQLite database and OpenAI’s APIs for calculating text embeddings and for text completions.
Utility Functions:
-
floats->stringandstring->floatsare utility functions for converting between a list of floats and its string representation. -
read-filereads a file’s content. -
join-stringsjoins a list of strings with a specified separator. -
truncate-stringtruncates a string to a specified length. -
interleavemerges two lists by interleaving their elements. -
break-into-chunksbreaks a text into chunks of a specified size. -
string-to-listanddecode-roware utility functions for parsing and processing database rows.
Database Setup:
- Database connection is established to “test.db” and a table named “documents” is created with columns for document_path, content, and embedding.
Document Management:
-
insert-documentinserts a document and its associated information into the database. -
get-document-by-document-pathandall-documentsare utility functions for querying documents from the database. -
create-documentreads a document from a file path, breaks it into chunks, computes embeddings for each chunk via a functionembeddings-openai, and inserts these into the database.
Semantic Matching and Interaction:
-
execute-to-listanddot-productare utility functions for database queries and vector operations. -
semantic-matchperforms a semantic search by calculating the dot product of embeddings of the query and documents in the database. It then aggregates contexts of documents with a similarity score above a certain threshold, and sends a new query constructed with these contexts to OpenAI for further processing. -
QAis a wrapper aroundsemantic-matchfor querying. -
CHATinitiates a loop for user interaction where each user input is processed throughsemantic-matchto generate a response, maintaining a context of the previous chat.
Test Code:
-
testfunction creates documents by reading from specified file paths, and performs some queries using theQAfunction.
The code uses a local SQLite database to store and manage document embeddings and the OpenAI API for generating embeddings and performing semantic searches based on user queries. Two functions are exported in case you want to use this example as a library: create-document and QA. Note: in the test code at the bottom of the listing, change the absolute path to reflect where you cloned the GitHub repository for this book.
1 #lang racket
2
3 (require db)
4 (require llmapis)
5
6 (provide create-document QA)
7
8 ; Function to convert list of floats to string representation
9 (define (floats->string floats)
10 (string-join (map number->string floats) " "))
11
12 ; Function to convert string representation back to list of floats
13 (define (string->floats str)
14 (map string->number (string-split str)))
15
16 (define (read-file infile)
17 (with-input-from-file infile
18 (lambda ()
19 (let ((contents (read)))
20 contents))))
21
22 (define (join-strings separator list)
23 (string-join list separator))
24
25 (define (truncate-string string length)
26 (substring string 0 (min length (string-length string))))
27
28 (define (interleave list1 list2)
29 (if (or (null? list1) (null? list2))
30 (append list1 list2)
31 (cons (car list1)
32 (cons (car list2)
33 (interleave (cdr list1) (cdr list2))))))
34
35 (define (break-into-chunks text chunk-size)
36 (let loop ((start 0) (chunks '()))
37 (if (>= start (string-length text))
38 (reverse chunks)
39 (loop (+ start chunk-size)
40 (cons (substring text start (min (+ start chunk-size) (string-length t\
41 ext))) chunks)))))
42
43 (define (string-to-list str)
44 (map string->number (string-split str)))
45
46 (define (decode-row row)
47 (let ((id (vector-ref row 0))
48 (context (vector-ref row 1))
49 (embedding (string-to-list (read-line (open-input-string (vector-ref row 2))\
50 ))))
51 (list id context embedding)))
52
53 (define
54 db
55 (sqlite3-connect #:database "test.db" #:mode 'create
56 #:use-place #t))
57
58 (with-handlers ([exn:fail? (lambda (ex) (void))])
59 (query-exec
60 db
61 "CREATE TABLE documents (document_path TEXT, content TEXT, embedding TEXT);"))
62
63 (define (insert-document document-path content embedding)
64 (printf "~%insert-document:~% content:~a~%~%" content)
65 (query-exec
66 db
67 "INSERT INTO documents (document_path, content, embedding) VALUES (?, ?, ?);"
68 document-path content (floats->string embedding)))
69
70 (define (get-document-by-document-path document-path)
71 (map decode-row
72 (query-rows
73 db
74 "SELECT * FROM documents WHERE document_path = ?;"
75 document-path)))
76
77 (define (all-documents)
78 (map
79 decode-row
80 (query-rows
81 db
82 "SELECT * FROM documents;")))
83
84 (define (create-document fpath)
85 (let ((contents (break-into-chunks (file->string fpath) 200)))
86 (for-each
87 (lambda (content)
88 (with-handlers ([exn:fail? (lambda (ex) (void))])
89 (let ((embedding (embeddings-openai content)))
90 (insert-document fpath content embedding))))
91 contents)))
92
93 ;; Assuming a function to fetch documents from database
94 (define (execute-to-list db query)
95 (query-rows db query))
96
97 ;; dot product of two lists of floating point numbers:
98 (define (dot-product a b)
99 (cond
100 [(or (null? a) (null? b)) 0]
101 [else
102 (+ (* (car a) (car b))
103 (dot-product (cdr a) (cdr b)))]))
104
105 (define (semantic-match query custom-context [cutoff 0.7])
106 (let ((emb (embeddings-openai query))
107 (ret '()))
108 (for-each
109 (lambda (doc)
110 (let* ((context (second doc))
111 (embedding (third doc))
112 (score (dot-product emb embedding)))
113 (when (> score cutoff)
114 (set! ret (cons context ret)))))
115 (all-documents))
116 (printf "~%semantic-search: ret=~a~%" ret)
117 (let* ((context (string-join (reverse ret) " . "))
118 (query-with-context (string-join (list context custom-context "Question:"\
119 query) " ")))
120 (question-openai query-with-context))))
121
122 (define (QA query [quiet #f])
123 (let ((answer (semantic-match query "")))
124 (unless quiet
125 (printf "~%~%** query: ~a~%** answer: ~a~%~%" query answer))
126 answer))
127
128 (define (CHAT)
129 (let ((messages '(""))
130 (responses '("")))
131 (let loop ()
132 (printf "~%Enter chat (STOP or empty line to stop) >> ")
133 (let ((string (read-line)))
134 (cond
135 ((or (string=? string "STOP") (< (string-length string) 1))
136 (list (reverse messages) (reverse responses)))
137 (else
138 (let* ((custom-context
139 (string-append
140 "PREVIOUS CHAT: "
141 (string-join (reverse messages) " ")))
142 (response (semantic-match string custom-context)))
143 (set! messages (cons string messages))
144 (set! responses (cons response responses))
145 (printf "~%Response: ~a~%" response)
146 (loop))))))))
147
148 ;; ... test code ...
149
150 (define (test)
151 "Test Semantic Document Search Using GPT APIs and local vector database"
152 (create-document
153 "/Users/markw/GITHUB/Racket-AI-book/source-code/embeddingsdb/data/sports.txt")
154 (create-document
155 "/Users/markw/GITHUB/Racket-AI-book/source-code/embeddingsdb/data/chemistry.txt")
156 (QA "What is the history of the science of chemistry?")
157 (QA "What are the advantages of engaging in sports?"))
Let’s look at a few examples form a Racket REPL:
1 > (QA "What is the history of the science of chemistry?")
2 ** query: What is the history of the science of chemistry?
3 ** answer: The history of the science of chemistry dates back thousands of years. An\
4 cient civilizations such as the Egyptians, Greeks, and Chinese were experimenting wi\
5 th various substances and observing chemical reactions even before the term "chemist\
6 ry" was coined.
7
8 The foundations of modern chemistry can be traced back to the works of famous schola\
9 rs such as alchemists in the Middle Ages. Alchemists sought to transform common meta\
10 ls into gold and discover elixirs of eternal life. Although their practices were oft\
11 en based on mysticism and folklore, it laid the groundwork for the understanding of \
12 chemical processes and experimentation.
13
14 In the 17th and 18th centuries, significant advancements were made in the field of c\
15 hemistry. Prominent figures like Robert Boyle and Antoine Lavoisier began to underst\
16 and the fundamental principles of chemical reactions and the concept of elements. La\
17 voisier is often referred to as the "father of modern chemistry" for his work in est\
18 ablishing the law of conservation of mass and naming and categorizing elements.
19
20 Throughout the 19th and 20th centuries, chemistry continued to progress rapidly. The\
21 development of the periodic table by Dmitri Mendeleev in 1869 revolutionized the or\
22 ganization of elements. The discovery of new elements, the formulation of atomic the\
23 ory, and the understanding of chemical bonding further expanded our knowledge.
24
25 Chemistry also played a crucial role in various industries and technologies, such as\
26 the development of synthetic dyes, pharmaceuticals, plastics, and materials. The em\
27 ergence of quantum mechanics and spectroscopy in the early 20th century opened up ne\
28 w avenues for understanding the behavior of atoms and molecules.
29
30 Today, chemistry is an interdisciplinary science that encompasses various fields suc\
31 h as organic chemistry, inorganic chemistry, physical chemistry, analytical chemistr\
32 y, and biochemistry. It continues to evolve and make significant contributions to so\
33 ciety, from developing sustainable materials to understanding biological processes a\
34 nd addressing global challenges such as climate change.
35
36 In summary, the history of the science of chemistry spans centuries, starting from a\
37 ncient civilizations to the present day, with numerous discoveries and advancements \
38 shaping our understanding of the composition, properties, and transformations of mat\
39 ter.
This output is the combination of data found in the text files in the directory Racket-AI-book/source-code/embeddingsdb/data and the data that OpenAI GPT-4 was trained on. Since the local “document” file chemistry.txt is very short, most of this output is derived from the innate knowledge GPT-4 has from its training data.
In order to show that this example is also using data in the local “document” text files, I manually edited the file data/chemistry.txt adding the following made-up organic compound:
1 ZorroOnian Alcohol is another organic compound with the formula C 6 H 10 O.
GPT-4 was never trained on my made-up data so it has no idea what the non-existent compound ZorroOnian Alcohol is. The following answer is retrieved via RAG from the local document data (for brevity, most of the output for adding the local document files to the embedding index is not shown):
1 > (create-document
2 "/Users/markw/GITHUB/Racket-AI-book/source-code/embeddingsdb/data/chemistry.txt")
3
4 insert-document:
5 content:Amyl alcohol is an organic compound with the formula C 5 H 12 O. ZorroOnia\
6 n Alcohol is another organic compound with the formula C 6 H 10 O. All eight isomers\
7 of amyl alcohol are known.
8
9 ...
10
11 > (QA "what is the formula for ZorroOnian Alcohol")
12
13 ** query: what is the formula for ZorroOnian Alcohol
14 ** answer: The formula for ZorroOnian Alcohol is C6H10O.
There is also a chat interface:
1 Enter chat (STOP or empty line to stop) >> who is the chemist Robert Boyle
2
3 Response: Robert Boyle was an Irish chemist and physicist who is known as one of the\
4 pioneers of modern chemistry. He is famous for Boyle's Law, which describes the inv\
5 erse relationship between the pressure and volume of a gas, and for his experiments \
6 on the properties of gases. He lived from 1627 to 1691.
7
8 Enter chat (STOP or empty line to stop) >> Where was he born?
9
10 Response: Robert Boyle was born in Lismore Castle, County Waterford, Ireland.
11
12 Enter chat (STOP or empty line to stop) >>
Retrieval Augmented Generation Wrap Up
Retrieval Augmented Generation (RAG) is one of the best use cases for semantic search. Another way to write RAG applications is to use a web search API to get context text for a query, and add this context data to whatever context data you have in a local embeddings data store.