More Useful Libraries for Working with Unstructured Text Data

Here we look at examples using two libraries that I find useful for my work: EmbedChain and Kor.

EmbedChain Wrapper for LangChain Simplifies Application Development

Taranjeet Singh developed a very nice wrapper library EmbedChain https://github.com/embedchain/embedchain that simplifies writing “query your own data” applications by choosing good defaults for the LangChain library.

I will show one simple example that I run on my laptop to search the contents of all of the books I have written as well as a large number of research papers. You can find my example in the GitHub repository for this book in the directory langchain-book-examples/embedchain_test. As usual, you will need an OpenAI API account and set the environment variable OPENAI_API_KEY to the value of your key.

I have copied PDF files for all of this content to the directory ~/data on my laptop. It takes a short while to build a local vector embedding data store so I use two Python scripts. The first script process_pdfs.py that is shown here:

 1 # reference: https://github.com/embedchain/embedchain
 2 
 3 from embedchain import App
 4 import os
 5 
 6 test_chat = App()
 7 
 8 my_books_dir = "/Users/mark/data/"
 9 
10 for filename in os.listdir(my_books_dir):
11     if filename.endswith('.pdf'):
12         print("processing filename:", filename)
13         test_chat.add("pdf_file",
14                       os.path.join(my_books_dir,
15                       filename))

Here is a demo Python script app.py that makes three queries:

 1 from embedchain import App
 2 
 3 test_chat = App()
 4 
 5 def test(q):
 6     print(q)
 7     print(test_chat.query(q), "\n")
 8 
 9 test("How can I iterate over a list in Haskell?")
10 test("How can I edit my Common Lisp files?")
11 test("How can I scrape a website using Common Lisp?")

The output looks like:

 1 $ python app.py
 2 How can I iterate over a list in Haskell?
 3 To iterate over a list in Haskell, you can use recursion or higher-order functions l\
 4 ike `map` or `foldl`. 
 5 
 6 How can I edit my Common Lisp files?
 7 To edit Common Lisp files, you can use Emacs with the Lisp editing mode. By setting \
 8 the default auto-mode-alist in Emacs, whenever you open a file with the extensions "
 9 .lisp", ".lsp", or ".cl", Emacs will automatically use the Lisp editing mode. You ca
10 n search for an "Emacs tutorial" online to learn how to use the basic Emacs editing 
11 commands. 
12 
13 How can I scrape a website using Common Lisp?
14 One way to scrape a website using Common Lisp is to use the Drakma library. Paul Nat\
15 han has written a library using Drakma called web-trotter.lisp, which is available u
16 nder the AGPL license at articulate-lisp.com/src/web-trotter.lisp. This library can 
17 be a good starting point for your scraping project. Additionally, you can use the wg
18 et utility to make local copies of a website. The command "wget -m -w 2 http:/knowle
19 dgebooks.com/" can be used to mirror a site with a two-second delay between HTTP req
20 uests for resources. The option "-m" indicates to recursively follow all links on th
21 e website, and the option "-w 2" adds a two-second delay between requests. Another o
22 ption, "wget -mk -w 2 http:/knowledgebooks.com/", converts URI references to local f
23 ile references on your local mirror. Concatenating all web pages into one file can a
24 lso be a useful trick.

Kor Library

The Kor library was written by Eugene Yurtsev. Kor is useful for using LLMs to extract structured data from unstructured text. Kor works by generating appropriate prompt text to explain to GPT-3.5 what information to extract and adding in the text to be processed.

The GitHub repository for Kor is under active development so please check the project for updates. Here is the documentation.

For the following example, I modified an example in the Kor documentation for extracting dates in text.

 1 " From documentation: https://eyurtsev.github.io/kor/"
 2 
 3 from kor.extraction import create_extraction_chain
 4 from kor.nodes import Object, Text, Number
 5 from langchain.chat_models import ChatOpenAI
 6 from pprint import pprint
 7 import warnings ; warnings.filterwarnings('ignore')
 8 
 9 llm = ChatOpenAI(
10     model_name="gpt-3.5-turbo",
11     temperature=0,
12     max_tokens=2000,
13     frequency_penalty=0,
14     presence_penalty=0,
15     top_p=1.0,
16 )
17 
18 schema = Object(
19     id="date",
20     description=(
21         "Any dates found in the text. Should be output in the format:"
22         " January 12, 2023"
23     ),
24     attributes = [
25         Text(id = "month",
26              description = "The month of the date",
27              examples=[("Someone met me on December 21, 1995",
28                         "Let's meet up on January 12, 2023 and discuss our yearly bu\
29 dget")])
30     ],
31 )
32 
33 chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
34 
35 
36 pred = chain.predict_and_parse(text="I will go to California May 1, 2024")['data']
37 print("* month mentioned in text=", pred)

Sample output:

1 $ python dates.py
2 * month mentioned in text= {'date': {'month': 'May'}}

Kor is a library focused on extracting data from text. You can get the same effects by writing for own prompts manually for GPT style LLMs but using Tor can save development time.

Up next

Book Wrap Up