Using Local Large Language Models

Local Large Language Models (LLMs) are AI models, similar to GPT-3 or GPT-4, but are designed to be run on local machines or secured leased cloud-based servers rather than public cloud-based platforms that may or may not respect the privacy of data. For the sake of our discussion I call both LLMs run on your local laptop as well as those run on a leased VPS as “local LLMs.”

In addition to control over your own data, local models allow users to harness the power of language models without needing a continuous internet connection or relying on a remote server. Local LLMs can be used in a variety of applications ranging from drafting emails, creating content, to developing interactive chatbots. Additionally, due to their local nature, they can be customized to specific tasks or domains, making them a valuable tool for businesses and organizations with unique requirements.

One of the primary advantages of using local LLMs is the enhanced privacy they offer. Since the models are run locally on users’ own hardware, there’s no need for user inputs to be sent over the internet, reducing the potential for data leaks or breaches. This is a critical benefit for organizations handling sensitive information or for individual users concerned about their privacy. Local LLMs also offer better control over data and processing, as users are not tied to the terms and conditions of a cloud-based service provider.

However, despite their advantages, local LLMs also have their challenges. The most significant of these is the computational resources required to run such models. Large language models are resource-intensive and need powerful hardware to operate effectively. This can be a limiting factor for many users, particularly smaller businesses or individuals without access to high-performance computing resources. Furthermore, setting up and maintaining a local LLM might require technical expertise that not all users possess, making them less accessible for non-technical users.

The scalability of local LLMs can also be a challenge. While cloud-based models can easily be scaled up to handle increased load by adding more server capacity, scaling local LLMs would typically require purchasing and setting up additional hardware. This can make local LLMs less flexible and more costly in scenarios with varying or unpredictable demand. Despite these challenges, local LLMs still represent an important development in the field of AI, providing users with greater privacy, control, and customization potential for their AI applications.

Available Public Large Language Models

Corporations like Meta, Google, and Hugging Face all often share models that they have trained with the public. This is good and bad. The good part is that these models are expensive to train and once you have a copy of a model you can run it locally or on a leased VPS in a secure privacy-preserving way. The bad aspects of some of these models usually a result of training data:

Data used to train models may contain text that is bigoted or contains incorrect information.
Data used to train models may not be effective in meeting your application’s requirements. See the chapter Fine Tuning for ways to mitigate these problems.

Some examples of publicly available LLMs include GPT-Neo, GPT-J, GPT-NeoX, XLNet, Roberta, DeBERTa, XLM-RoBERTa, DistilBERT, and OPT-175B.

These models excel at a wide range of tasks such as reading comprehension, text classification, sentiment analysis, and others. The efficacy with which they accomplish tasks, and the range of tasks at which they are capable, seems to be a function of the amount of resources (data, parameter-size, computing power) devoted to training them.

It is important to note that publicly available large language models do not provide a degree of confidence for the accuracy of their output. One main challenge is that they are not explicitly designed to provide truthful answers; rather, they are primarily trained to generate text that follows the patterns of human language.

In terms of recommendations for their use, it is important to be aware of their limitations and potential biases. It is also important to use them responsibly and in accordance with any licensing or usage restrictions set by the provider.

StabilityAI’s StableLM Using lil-parrot Library

Install for the lil-parrot library with GPU enabled from:

1     https://github.com/Lightning-AI/lit-gpt

The largest instruction tuned model is for good some purposes and weak on others.

1 cd lit-gpt
2 python scripts/download.py --repo_id stabilityai/stablelm-tuned-alpha-7b
3 python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-7b
4 python chat/base.py --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-7b

It writes decent Python programs, and wrote an pretty good poem about a parrot. It does not do well with context text, then a query.

Google’s Flan-T5-XXL

The Flan-T5-XXL model can run on a Linux server with a 48G A6000 GPU. Here are memory requirements I measured during my experiments:

google/flan-t5-xxl 30 to 90% memory used, 16 bit float weights
google/flan-t5-xl 39% of GPU memory used (32 float weights)

The storage on disk (stored in ~/.cache/huggingface/hub is:

11G models—google—flan-t5-xl
42G models—google—flan-t5-xxl

In my experiments flan-t5-xxl can write simple poetry and it was also able write business plans. Here is a short example Python script (derived from examples in Hugging Face documentation) that you can use to get started:

 1 # pip install transformers accelerate
 2 
 3 import torch
 4 from transformers import T5Tokenizer, T5ForConditionalGeneration
 5 
 6 tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
 7 model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", torch_dtype=torch.float16)
 8 
 9 #print("model.config.max_new_tokens:", model.config.max_new_tokens)
10 #model.config.max_new_tokens = 300
11 #print("model.config.max_new_tokens:", model.config.max_new_tokens)
12 
13 def generate(input_text, max_new_tokens = 100):
14     input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
15 
16     outputs = model.generate(input_ids, max_new_tokens = max_new_tokens)
17     print(tokenizer.decode(outputs[0]))
18 
19 generate("translate English to German: How old are you?")
20 generate("Bob is 23 years old. Mary is 34 years old. Sam is 17 years old. Who is older than 30?")
21 generate("What is the capital of California?")
22 generate("Write a 6 line poem about my pet parrot escaping out the window", max_new_tokens = 600)
23 generate("Write a business plan for selling computer art online, including pricing advice", max_new_tokens = 800)

Running FastChat with Vicuna LLMs

The GitHub repository for FastChat is https://github.com/lm-sys/FastChat. I recommend cloning this repository and spending some time reading the documentation in the docs directory.

I run FastChat with the Vicuna LLMs on a Lambda Labs GPU VPS. I run the 7B and 13B models on a VPS with a single Nvidia A6000. I use a VPS with two Nvidia A6000s to run the 33B model.

To run on a 2 A6000 GPU VPS:

1 pip install fschat transformers bitsandbytes 
2 export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
3 export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:1024'
4 python -m fastchat.serve.cli --model-path lmsys/vicuna-33b-v1.3 --num-gpus 2

Here is example output:

1 USER: Bill is 25 years old and Mary is 30 years old. Who is older?
2 ASSISTANT: Mary is older than Bill. At 30 years old, she is 5 years older than Bill, who is 25 years old.

You can also run FastChat in server mode and it emulates the OpenAI APIs so it is possible to run applications either using OpenAI APIs, or FastChat as a service. Here are the directions for replacing OpenAI APIs with FastChat in an application using LangChain: https://github.com/lm-sys/FastChat/blob/main/docs/langchain_integration.md. In this example both an embedding model and a text completion model are used.

Building a Chat Application Using Text From My Books and Prompt Engineering

For this example we use a new library EmbedChain (that in turn uses LangChain and LlamaIndex) and this example is in the GitHub repository for this book in directory safe-for-humans-AI-software/experiments/embedchain_test.

We start with the script process_pdfs.py that reads PDFs for my books and uses an OpenAI embedding model to create local text chunk embeddings vector datastore that is written to the subdirectory db. Here is a listing of process_pdfs.py:

 1 # https://github.com/embedchain/embedchain
 2 
 3 from embedchain import App
 4 import os
 5 
 6 test_chat = App()
 7 
 8 my_books_dir = "/Users/markwatson/Library/Mobile Documents/com~apple~CloudDocs/Documents/my book PDFs/"
 9 
10 for filename in os.listdir(my_books_dir):
11     if filename.endswith('.pdf'):
12         print("processing filename:", filename)
13         test_chat.add("pdf_file", os.path.join(my_books_dir, filename))

If you have a directory of PDFs on your laptop that you would like to use, just change the path for my_books_dir.

The script app.py is a simple application that allows us to chat against our local text chunk index using an OpenAI text completion model:

 1 # https://github.com/embedchain/embedchain
 2 
 3 from embedchain import App
 4 
 5 test_chat = App()
 6 
 7 def test(q):
 8     print(q)
 9     print(test_chat.query(q), "\n")
10 
11 test("How can I iterate over a list in Haskell?")
12 test("How can I edit my Common Lisp files?")
13 test("How can I scrape a website using Common Lisp?")

Here is the sample output for this script:

1 embedchain_test $ p app.py 
2 How can I iterate over a list in Haskell?
3 To iterate over a list in Haskell, you can use recursion or higher-order functions like `map` or `foldl`. 
4 
5 How can I edit my Common Lisp files?
6 To edit Common Lisp files, you can use Emacs with the Lisp editing mode. By setting the default auto-mode-alist in Emacs, whenever you open a file with the extensions ".lisp", ".lsp", or ".cl", Emacs will automatically use the Lisp editing mode. You can search for an "Emacs tutorial" online to learn how to use the basic Emacs editing commands. 
7 
8 How can I scrape a website using Common Lisp?
9 One way to scrape a website using Common Lisp is to use the Drakma library. Paul Nathan has written a library using Drakma called web-trotter.lisp, which is available under the AGPL license at articulate-lisp.com/src/web-trotter.lisp. This library can be a good starting point for web scraping in Common Lisp. Additionally, you can use the wget utility to make local copies of a website. The command "wget -m -w 2 http:/knowledgebooks.com/" can be used to mirror a website with a two-second delay between HTTP requests for resources. The option "-m" indicates to recursively follow all links on the website, and the option "-w 2" adds a two-second delay between requests. Another option, "wget -mk -w 2 http:/knowledgebooks.com/", converts URI references to local file references on your local mirror. Concatenating all web pages into one file can also be a useful trick.

The EmbedChain library abstracts away many of the details for using LangChain and LlamaIndex by using reasonable defaults.

Up next

Fine-Tuning LLMs Using Your Data