Documents Question Answering Using OpenAI GPT4 APIs and a Local Embeddings Vector Database
The examples in this chapter are inspired by the Python LangChain and LlamaIndex projects, with just the parts I need for my projects written from scratch in Common Lisp. I wrote a Python book “LangChain and LlamaIndex Projects Lab Book: Hooking Large Language Models Up to the Real World Using GPT-3, ChatGPT, and Hugging Face Models in Applications” in March 2023: https://leanpub.com/langchain that you might also be interested in.
The GitHub repository for this example can be found here: https://github.com/mark-watson/Docs_QA_Swift.
The entire example is in one Swift source file main.swift. All of the program listings in this chapter can be found in this single source file.
We use two models in this example: a vector embedding model and a gpt-4o-mini conversation model (see bottom of this file). The vector embedding model is used to generate a vector embedding. The gpt-4o-mini model is used to generate a response to a prompt. The vector embedding model is used to compare the similarity of two prompts.
Extending the String Class
1 import Foundation
2 import NaturalLanguage
3
4 // String utilities:
5
6 extension String {
7 func removeCharacters(from forbiddenChars: CharacterSet) -> String {
8 let passed = self.unicodeScalars.filter { !forbiddenChars.contains($0) }
9 return String(String.UnicodeScalarView(passed))
10 }
11
12 func removeCharacters(from: String) -> String {
13 return removeCharacters(from: CharacterSet(charactersIn: from))
14 }
15 func plainText() -> String {
16 return self.removeCharacters(from:
17 "\"`()%$#@[]{}<>").replacingOccurrences(of: "\n\
18 ",
19 with: " ")
20 }
21 }
Implementing a Local Vector Database for Document Embeddings
1 let openai_key = ProcessInfo.processInfo.environment["OPENAI_KEY"]!
2
3 let openAiHost = "https://api.openai.com/v1/embeddings"
4
5 func openAiHelper(body: String) -> String {
6 var ret = ""
7 var content = "{}"
8 let requestUrl = URL(string: openAiHost)!
9 var request = URLRequest(url: requestUrl)
10 request.httpMethod = "POST"
11 request.httpBody = body.data(using: String.Encoding.utf8);
12 request.setValue("application/json", forHTTPHeaderField: "Content-Type")
13 request.setValue("Bearer " + openai_key, forHTTPHeaderField: "Authorization")
14 let task = URLSession.shared.dataTask(with: request) { (data, response, error) in
15 if let error = error {
16 print("-->> Error accessing OpenAI servers: \(error)")
17 return
18 }
19 if let data = data, let s = String(data: data, encoding: .utf8) {
20 content = s
21 //print("** s=", s)
22 CFRunLoopStop(CFRunLoopGetMain())
23 }
24 }
25 task.resume()
26 CFRunLoopRun()
27 let c = String(content)
28 let i1 = c.range(of: "\"embedding\":")
29 if let r1 = i1 {
30 let i2 = c.range(of: "]")
31 if let r2 = i2 {
32 ret = String(String(String(c[r1.lowerBound..<r2.lowerBound]).dropFirst(1\
33 5)).dropLast(2))
34 }
35 }
36 return ret
37 }
38
39 public func embeddings(someText: String) -> [Float] {
40 let body: String = "{\"input\": \"" + someText + "\", \"model\": \"text-embeddin\
41 g-ada-002\" }"
42 return readList(openAiHelper(body: body))
43 }
44
45 func dotProduct(_ list1: [Float], _ list2: [Float]) -> Float {
46 if list1.count != list2.count {
47 //fatalError("Lists must have the same length.")
48 print("WARNING: Lists must have the same length: \(list1.count) != \(list2.c\
49 ount)")
50 return 0.0
51 }
52
53 var result: Float = 0
54
55 for i in 0..<list1.count {
56 result += list1[i] * list2[i]
57 }
58
59 return result
60 }
The source file contains example code for creating embeddings and using dot product work to find semantic similarity:
1 let emb1 = embeddings(someText: "John bought a new car")
2 let emb2 = embeddings(someText: "Sally drove to the store")
3 let emb3 = embeddings(someText: "The dog saw a cat")
4 let dotProductResult1 = dotProduct(emb1, emb2)
5 print(dotProductResult1)
6 let dotProductResult2 = dotProduct(emb1, emb3)
7 print(dotProductResult2)
The output is:
1 0.8416926
2 0.79411536
For this example, we use an in-memory store of embedding vectors and chunk text. A text document is broken into smaller chunks of text. Each chunk is embedded and stored in the embeddingsStore. The chunk text is stored in the chunks array. The embeddingsStore and chunks array are used to find the most similar chunk to a prompt. The most similar chunk is used to generate a response to the prompt.
1 var embeddingsStore: Array<[Float]> = Array()
2 var chunks: Array<String> = Array()
3
4 func addEmbedding(_ embedding: [Float]) {
5 embeddingsStore.append(embedding)
6 //print("Added embedding: count=\(embeddingsStore.count) \(embedding)")
7 }
8
9 func addChunk(_ chunk: String) {
10 chunks.append(chunk)
11 }
Create Local Embeddings Vectors From Local Text Files With OpenAI GPT APIs
1 func readList(_ input: String) -> [Float] {
2 return input.split(separator: ",\n").compactMap {
3 Float($0.trimmingCharacters(in: .whitespaces))
4 }
5 }
6
7 let fileManager = FileManager.default
8 let currentDirectoryURL = URL(fileURLWithPath: fileManager.currentDirectoryPath)
9 let dataDirectoryURL = currentDirectoryURL.appendingPathComponent("data")
10
11 // Top level code expression to process all *.txt files in the data/ directory:
12
13 do {
14 let directoryContents = try fileManager.contentsOfDirectory(at: dataDirectoryURL\
15 , includingPropertiesForKeys: nil)
16 let txtFiles = directoryContents.filter { $0.pathExtension == "txt" }
17 for txtFile in txtFiles {
18 let content = try String(contentsOf: txtFile)
19 let chnks = segmentTextIntoChunks(text: content.plainText(),
20 max_chunk_size: 100)
21 for chunk in chnks {
22 let embedding = embeddings(someText: chunk)
23 if embedding.count > 0 {
24 addEmbedding(embedding)
25 addChunk(chunk)
26 }
27 }
28 }
29 } catch {
30 }
31
32 func segmentTextIntoSentences(text: String) -> [String] {
33 let tokenizer = NLTokenizer(unit: .sentence)
34 tokenizer.string = text
35 let sentences = tokenizer.tokens(for: text.startIndex..<text.endIndex).map {
36 token -> String in
37 return String(text[token.lowerBound..<token.upperBound])
38 }
39 return sentences
40 }
41
42 func segmentTextIntoChunks(text: String, max_chunk_size: Int) -> [String] {
43 let sentences = segmentTextIntoSentences(text: text)
44 var chunks: Array<String> = Array()
45 var currentChunk = ""
46 var currentChunkSize = 0
47 for sentence in sentences {
48 if currentChunkSize + sentence.count < max_chunk_size {
49 currentChunk += sentence
50 currentChunkSize += sentence.count
51 } else {
52 chunks.append(currentChunk)
53 currentChunk = sentence
54 currentChunkSize = sentence.count
55 }
56 }
57 return chunks
58 }
Using Local Embeddings Vector Database With OpenAI GPT APIs
We use the OpenAI QA API using gpt-4o-mini model (reformatted to fit the page width):
1 let openAiQaHost = "https://api.openai.com/v1/chat/completions"
2
3 func openAiQaHelper(body: String) -> String {
4 var ret = ""
5 var content = "{}"
6 let requestUrl = URL(string: openAiQaHost)!
7 var request = URLRequest(url: requestUrl)
8 request.httpMethod = "POST"
9 request.httpBody = body.data(using: String.Encoding.utf8);
10 request.setValue("application/json", forHTTPHeaderField: "Content-Type")
11 request.setValue("Bearer " + openai_key, forHTTPHeaderField: "Authorization")
12 let task = URLSession.shared.dataTask(with: request) { (data, response, error) in
13 if let error = error {
14 print("-->> Error accessing OpenAI servers: \(error)")
15 return
16 }
17 if let data = data, let s = String(data: data, encoding: .utf8) {
18 content = s
19 CFRunLoopStop(CFRunLoopGetMain())
20 }
21 }
22 task.resume()
23 CFRunLoopRun()
24 let c = String(content)
25 //print("DEBUG response c:", c)
26 // pull returned content for string instead of using a
27 // JSON parser:
28 let i1 = c.range(of: "\"content\":")
29 if let r1 = i1 {
30 let i2 = c.range(of: "\"}")
31 if let r2 = i2 {
32 ret = String(
33 String(
34 String(c[r1.lowerBound..<r2.lowerBound])
35 .dropFirst(11)))
36 }
37 }
38 return ret
39 }
40
41 func questionAnswering(context: String, question: String) -> String {
42 let body = "{ \"model\": \"gpt-3.5-turbo\", \"messages\": [ {\"role\": \"system\\
43 ", \"content\": \"" +
44 context + "\"}, {\"role\": \"user\", \"content\": \"" + question + "\"}]}"
45
46 //print("DEBUG body:", body)
47
48 let answer = openAiQaHelper(body: body)
49 if let i1 = answer.range(of: "\"content\":") {
50 // variable answer is a string containing JSON. We want to extract the value\
51 of the "content" key and we do so without using a JSON parser.
52 return String(answer[answer.startIndex..<i1.lowerBound])
53 }
54 return answer
55 }
56
57 // Top level query interface:
58
59 func query(_ query: String) -> String {
60 let queryEmbedding = embeddings(someText: query)
61 var contextText = ""
62 for i in 0..<embeddingsStore.count {
63 let dotProductResult = dotProduct(queryEmbedding, embeddingsStore[i])
64 if dotProductResult > 0.8 {
65 contextText.append(chunks[i])
66 contextText.append(" ")
67 }
68 }
69 //print("\n\n+++++++ contextText = \(contextText)\n\n")
70 let answer = questionAnswering(context: contextText, question: query)
71 //print("* * debug: query: ", query)
72 //print("* * debug: answer:", answer)
73 return answer
74
75 }
76
77 print(query("What is the history of chemistry?"))
78 print(query("What is the definition of sports?"))
79 print(query("What is the microeconomics?"))
The output for these three questions looks like:
1 The history of chemistry dates back to ancient times when people began to manipulate\
2 materials to produce useful products. The ancient Egyptians were skilled in metallu\
3 rgy and used various chemicals to embalm bodies. The Greeks were interested in theor\
4 ies of matter and sought to understand the nature of substances.\n\nDuring the Middl\
5 e Ages, alchemy became popular, with alchemists seeking to transform base metals int\
6 o gold and searching for an elixir of life. While alchemy was considered a pseudosci\
7 ence, it did lead to important discoveries such as the distillation of alcohol and t\
8 he discovery of various acids.\n\nThe Scientific Revolution of the 17th century brou\
9 ght about significant changes in chemistry. The work of Robert Boyle, Antoine Lavois\
10 ier, and others laid the foundation for modern chemistry. Lavoisier is considered th\
11 e father of modern chemistry for his work in establishing the law of conservation of\
12 mass, which states that matter cannot be created or destroyed.\n\nThe 19th century \
13 saw the development of organic chemistry, as scientists sought to understand the che\
14 mistry of carbon-based compounds, which make up many biological molecules. The 20th \
15 century brought about significant advances in analytical chemistry, as well as the d\
16 evelopment of quantum mechanics and the discovery of the structure of DNA, which rev\
17 olutionized the field of biochemistry.\n\nToday, chemistry plays a critical role in \
18 fields such as medicine, agriculture, materials science, and environmental science.
19
20
21 Sports can be defined as activities involving physical athleticism, physical dexteri\
22 ty, and governed by rules to ensure fair competition and consistent adjudication of \
23 the winner. The term \"sport\" originally meant leisure, but it now primarily refers\
24 to physical activities that involve competition at various levels of skill and prof\
25 iciency. Some organizations also include all physical activity and exercise in the d\
26 efinition of sport.
27
28
29 Microeconomics is a branch of economics that focuses on the behavior and decision-ma\
30 king of individual units within an economy, such as households, firms, and industrie\
31 s. It examines how these units interact in various markets to determine the prices o\
32 f goods and services and how resources are allocated efficiently. Microeconomics als\
33 o considers the role of government policies and regulations in influencing these int\
34 eractions and outcomes. Topics studied in microeconomics include supply and demand, \
35 market structures, consumer behavior, production and cost analysis, and welfare anal\
36 ysis.
Wrap Up for Using Local Embeddings Vector Database to Enhance the Use of GPT3 APIs With Local Documents
As I write this in early April 2023, I have been working almost exclusively with OpenAI APIs for the last year and using the Python libraries for LangChain and LlamaIndex for the last three months.
I started writing the examples in this chapter for my own use, implementing a tiny subset of the LangChain and LlamaIndex libraries in Swift in order to write efficient command line utilities for creating local embedding vector data stores and for interactive chat using my own data.
By writing about my “scratching my own itch” command line experiments here I hope that I get pull requests for https://github.com/mark-watson/Docs_QA_Swift from readers who are interested in helping to extend this code with new functionality.