Question Answering Using OpenAI APIs and a Local Embeddings Vector Database
The examples in this chapter are inspired by the Python LangChain and LlamaIndex projects, with just the parts I need for my projects written from scratch in Clojure. I wrote a Python book “LangChain and LlamaIndex Projects Lab Book: Hooking Large Language Models Up to the Real World Using GPT-3, ChatGPT, and Hugging Face Models in Applications” in March 2023: https://leanpub.com/langchain that you might also be interested in.
The GitHub repository for this example can be found here: https://github.com/mark-watson/Clojure-AI-Book-Code/tree/main/docs_qa. We will be using an OpenAI API wrapper from the last chapter that you should have installed with lein install on your local system.
We use two models in this example: a vector embedding model and a text completion model (see bottom of this file). The vector embedding model is used to generate a vector embeddings for “chunks” of input documents. Here we break documents into 200 character chunks and calculate a vector embedding for each chunk. A vector dot product between two embedding vectors tells us how semantically similar two chunks of text are. We will also calculate embedding vectors for user queries and use those to find chunks that might be useful for answering the query. Useful chunks are concatenated to for a prompt for a GPT text completion model.
Implementing a Local Vector Database for Document Embeddings
For interactive development we will read all text files in the data directory, create a global variable doc-strings containing the string contents of each file, and then create another global variable doc-chunks where each document string has been broken down into smaller chunks. For each chunk, we will call the OpenAI API for calculating document embeddings and store the embeddings for each chunk in the global variable embeddings.
When we want to query the documents in the data directory, we then calculate an embedding vector for the query and using a dot product calculation, efficiently find all chunks that are semantically similar to the query. The original text for these matching chunks is then combined with the user’s query and passed to an OpenAI API for text completion.
For this example, we use an in-memory store of embedding vectors and chunk text. A text document is broken into smaller chunks of text. Each chunk is embedded and stored in the embeddingsStore. The chunk text is stored in the chunks array. The embeddingsStore and chunks array are used to find the most similar chunk to a prompt. The most similar chunk is used to generate a response to the prompt.
Create Local Embeddings Vectors From Local Text Files with OpenAI GPT APIs
The code for handling OpenAI API calls is in the library openai_api in the GitHub repository for this book. You need to install that example project locally using:
1 lien install
The code using text embeddings is located in src/docs_qa/vectordb.clj:
1 (ns docs-qa.vectordb)
2
3 (defn string-to-floats [s]
4 (map
5 #(Float/parseFloat %)
6 (clojure.string/split s #" ")))
7
8 (defn truncate-string [s max-length]
9 (if (< (count s) max-length)
10 s
11 (subs s 0 max-length)))
12
13 (defn break-into-chunks [s chunk-size]
14 (let [chunks (partition-all chunk-size s)]
15 (map #(apply str %) chunks)))
16
17 (defn document-texts-from_dir [dir-path]
18 (map
19 #(slurp %)
20 (rest
21 (file-seq
22 (clojure.java.io/file dir-path)))))
23
24 (defn document-texts-to-chunks [strings]
25 (flatten
26 (map #(break-into-chunks % 200) strings)))
27
28 (def directory-path "data")
29
30 (def doc-strings (document-texts-from_dir directory-path))
31
32 (def doc-chunks
33 (filter
34 #(> (count %) 40)
35 (document-texts-to-chunks doc-strings)))
36
37 (def chunk-embeddings
38 (map #(openai-api.core/embeddings %) doc-chunks))
39
40 (def embeddings-with-chunk-texts
41 (map vector chunk-embeddings doc-chunks))
42
43 ;;(clojure.pprint/pprint
44 ;; (first embeddings-with-chunk-texts))
If we uncomment the print statement in the last two lines of code, we see the first embedding vector and its corresponding chunk text:
1 [[-0.011284076
2 -0.0110755935
3 -0.011531647
4 -0.0374746
5 -0.0018975098
6 -0.0024985236
7 0.0057560513 ...
8 ]
9 "Amyl alcohol is an organic compound with the formula C 5 H 12 O. All eight isomers\
10 of amyl alcohol are known. The most important is isobutyl carbinol, this being the \
11 chief constituent of fermentation "]
12 ]
Using Local Embeddings Vector Database with OpenAI GPT APIs
The main application code is in the file src/docs_qa/core.clj:
1 (ns docs-qa.core
2 (:require [clojure.java.jdbc :as jdbc]
3 [openai-api.core :refer :all]
4 [docs-qa.vectordb :refer :all])
5 (:gen-class))
6
7 (defn best-vector-matches [query]
8 (clojure.string/join
9 " ."
10 (let [query-embedding
11 (openai-api.core/embeddings query)]
12 (map
13 second
14 (filter
15 (fn [emb-text-pair]
16 (let [emb (first emb-text-pair)
17 text (second emb-text-pair)]
18 (> (openai-api.core/dot-product
19 query-embedding
20 emb)
21 0.79)))
22 docs-qa.vectordb/embeddings-with-chunk-texts)))))
23
24 (defn answer-prompt [prompt]
25 (openai-api.core/answer-question
26 prompt))
27
28 (defn -main
29 []
30 (println "Loading text files in ./data/, performing chunking and getting OpenAI em\
31 beddings...")
32 (answer-prompt "do nothing")n ;; force initiation
33 (print "...done loading data and getting local embeddings.\n")
34 (loop []
35 (println "Enter a query:")
36 (let [input (read-line)]
37 (if (empty? input)
38 (println "Done.")
39 (do
40 (let [text (best-vector-matches input)
41 prompt
42 (clojure.string/replace
43 (clojure.string/join
44 "\n"
45 ["With the following CONTEXT:\n\n"
46 text
47 "\n\nANSWER:\n\n"
48 input])
49 #"\s+" " ")]
50 (println "** PROMPT:" prompt)
51 (println (answer-prompt prompt)))
52 (recur))))))
The main example function reads the text files in ./data/, chunks the files, and uses the OpenAI APIs to get embeddings for each chunk. The main function then has an infinite loop where you can enter a question about your local documents. The most relevant chunks are identified and turned into a prompt along with your question, the generated prompt and answer to the question are printed. You can enter a control-D to stop the example program:
1 $ lein run
2 Loading text files in ./data/, performing chunking and getting OpenAI embeddings...
3 ...done loading data and getting local embeddings.
4 Enter a query:
5 What is Chemistry. How useful, really, are the sciences. Is Amyl alcohol is an organ\
6 ic compound?
7 PROMPT: With the following CONTEXT: Amyl alcohol is an organic compound with the for\
8 mula C 5 H 12 O. All eight isomers of amyl alcohol are known. The most important is \
9 isobutyl carbinol, this being the chief constituent of fermentation .een 128 and 132\
10 C only being collected. The 1730 definition of the word "chemistry", as used by Geo\
11 rg Ernst Stahl, meant the art of resolving mixed, compound, or aggregate bodies into\
12 their principles .; and of composing such bodies from those principles. In 1837, Je\
13 an-Baptiste Dumas considered the word "chemistry" to refer to the science concerned \
14 with the laws and effects of molecular forces.[16] .This definition further evolved \
15 until, in 1947, it came to mean the science of substances: their structure, their pr\
16 operties, and the reactions that change them into other substances - a characterizat\
17 .ion accepted by Linus Pauling.[17] More recently, in 1998, the definition of "chem\
18 istry" was broadened to mean the study of matter and the changes it undergoes, as ph\
19 rased by Professor Raymond Chang. .ther aggregates of matter. This matter can be stu\
20 died in solid, liquid, or gas states, in isolation or in combination. The interactio\
21 ns, reactions and transformations that are studied in chemistry are ANSWER: What is \
22 Chemistry. How useful, really, are the sciences. Is Amyl alcohol is an organic compo\
23 und?
24
25 Chemistry is the science of substances: their structure, their properties, and the r\
26 eactions that change them into other substances. Amyl alcohol is an organic compound\
27 with the formula C5H12O. All eight isomers of amyl alcohol
28 Enter a query:
29 What is the Austrian School of Economics?
30 PROMPT: With the following CONTEXT: The Austrian School (also known as the Vienna Sc\
31 hool or the Psychological School ) is a Schools of economic thought|school of econom\
32 ic thought that emphasizes the spontaneous organizing power of the p .rice mechanism\
33 . Austrians hold that the complexity of subjective human choices makes mathematical \
34 modelling of the evolving market extremely difficult (or Undecidable and advocate a \
35 "laissez faire" ap .proach to the economy. Austrian School economists advocate the s\
36 trict enforcement of voluntary contractual agreements between economic agents, and h\
37 old that commercial transactions should be subject t .o the smallest possible imposi\
38 tion of forces they consider to be (in particular the smallest possible amount of go\
39 vernment intervention). The Austrian School derives its name from its predominantly \
40 Au .strian founders and early supporters, including Carl Menger, Eugen von Böhm-Bawe\
41 rk and Ludwig von Mises. Economics is the social science that analyzes the productio\
42 n, distribution, and consumption of .growth, and monetary and fiscal policy. The pro\
43 fessionalization of economics, reflected in the growth of graduate programs on the s\
44 ubject, has been described as "the main change in economics since .essional study; s\
45 ee Master of Economics. Economics is the social science that studies the behavior of\
46 individuals, households, and organizations (called economic actors, players, or age\
47 nts), whe . govern the production, distribution and consumption of goods and service\
48 s in an exchange economy.[3] An approach to understanding these processes, through t\
49 he study of agent behavior under scarcity, ANSWER: What is the Austrian School of Ec\
50 onomics?
51
52
53 The Austrian School of Economics is a school of economic thought that emphasizes the\
54 spontaneous organizing power of the price mechanism and advocates a "laissez faire"\
55 approach to the economy. It is named after its predominantly Austrian founders and \
56 early supporters, including
57 Enter a query:
58 Done.
Wrap Up for Using Local Embeddings Vector Database to Enhance the Use of GPT3 APIs with Local Documents
As I write this in May 2023, I have been working almost exclusively with OpenAI APIs for the last year and using the Python libraries for LangChain and LlamaIndex for the last three months.
I started writing the examples in this chapter for my own use, implementing a tiny subset of the LangChain and LlamaIndex libraries in Clojure for creating local embedding vector data stores and for interactive chat using my own data.