LLM Techniques for Structured Data Conversion

Here we look at a simple example of converting CSV spreadsheet files to JSON but the idea of data conversion using LLMs is general purpose.

Using LLMs helps handle ambiguity. Traditional Symbolic AI methods often struggle with the nuance of human language. LLMs, with their understanding of context, can resolve ambiguity and provide more accurate extraction.

LLMs are also effective at handling complex or previously unseen formats (one shot). LLMs are trained on vast amounts of diverse text data, making them more adaptable to unexpected variations in data formats than rule-based approaches.

Using LLMs for application development can reduce manual effort by automating many parts of the conversion process that traditionally required significant human intervention and the creation of detailed extraction rules.

Example Prompt for Converting CSV Files to JSON

In the prompt we supply a few examples for converting between these two formats:

 1 Given the example below, convert a CSV spreadsheet text file to a JSON text file:
 2 
 3 Example:
 4 CSV:
 5 name,address, email
 6 John Doe, 1234 Maple Street, Springfield,johndoe@example.com
 7 "Jane Smith", "5678 Oak Avenue, Anytown", jane@smith764323.com
 8 Output: 
 9 {
10   "name": "John Doe",
11   "address": "1234 Maple Street, Springfield",
12   "email": "johndoe@example.com"
13 }
14 {
15   "name": "Jane Smith",
16   "address": "5678 Oak Avenue, Anytown",
17   "email": null
18 }
19 
20 Process Text: "{input_csv}"
21 Output:

Example Code for Converting CSV Files to JSON

The example in file structured_data_conversion/person_data.py reads the prompt template file and substitutes the CSV data from the test file test.csv. The modified prompt is passed to the OpenAI completion API:

 1 import openai
 2 from openai import OpenAI
 3 import os
 4 
 5 openai.api_key = os.getenv("OPENAI_API_KEY")
 6 client = OpenAI()
 7 
 8 # Read the prompt from a text file
 9 with open('prompt.txt', 'r') as file:
10     prompt_template = file.read()
11 
12 # Substitute a string variable into the prompt
13 with open('test.csv', 'r') as file:
14     input_csv = file.read()
15 prompt = prompt_template.replace("input_csv", input_csv)
16 
17 # Use the OpenAI completion API to generate a response with GPT-4
18 completion = client.chat.completions.create(
19     model="gpt-4",
20     messages=[
21         {
22             "role": "user",
23             "content": prompt,
24         },
25     ],
26 )
27 
28 print(completion.choices[0].message.content)

Here is the test CSV input file:

1 last_name,first_name,email
2 "Jackson",Michael,mj@musicgod.net
3 Jordan,Michael,"mike@retired.com"
4 Smith, John, john@acme41.com

Notice that this file is not consistent in quoting strings, hopefully making this a more general example of data you might see in the wild. The generated JSON looks like:

 1 {
 2   "last_name": "Jackson",
 3   "first_name": "Michael",
 4   "email": "mj@musicgod.net"
 5 }
 6 {
 7   "last_name": "Jordan",
 8   "first_name": "Michael",
 9   "email": "mike@retired.com"
10 }
11 {
12   "last_name": "Smith",
13   "first_name": "John",
14   "email": "john@acme41.com"
15 }

Up next

Retrieval Augmented Generation (RAG) Applications