Running Local LLMs Using Ollama

We saw an example at the end of the last chapter running Llama.cpp project to run a local model with LangChain. As I update this chapter in April 2024 I now most often use the Ollama app (download, documentation, and list of supported models at https://ollama.ai). Ollama has a good command line interface and also runs a REST service that the examples in this chapter use.

Ollama works very well with Apple Silicon, systems with an NVIDIA GPU, and high end CPU-only systems. My Mac has a M2 SOC with 32G of internal memory which is suitable for running fairly large LLMs efficiently but most of the examples here run fine with 16G memory.

Most of this chapter involves Python code examples using Ollama to run local LLMs. However the Ollama command line interface is useful for interactive experiments. Another useful development technique is to write prompts in individual text files like p1.txt, p2.txt, etc. and run a prompt (on macOS and Linux) using:

1 $ ollama run llama3:instruct < p1.txt

And after the response is printed either stay in the Ollama REPL or type /bye to exit.

Simple Use of a local Mistral Model Using LangChain

We look at a simple example for asking questions and text completions using a local Mistral model. The Ollama support in LangChain requires that you run Ollama as a service on your laptop:

1 ollama serve

Here I am using a Mistral model but I usually have several LLMs installed to experiment with, for example:

 1  $ ollama list
 2 NAME                       ID              SIZE      MODIFIED    
 3 phi4-reasoning:plus        f0ad3edce8e4    11 GB     4 days ago     
 4 qwen3:8b                   e4b5fd7f8af0    5.2 GB    6 days ago     
 5 qwen3:30b                  2ee832bc15b5    18 GB     6 days ago     
 6 gemma3:12b-it-qat          5d4fa005e7bb    8.9 GB    2 weeks ago    
 7 gemma3:4b-it-qat           d01ad0579247    4.0 GB    2 weeks ago    
 8 gemma3:27b-it-qat          29eb0b9aeda3    18 GB     2 weeks ago    
 9 openthinker:latest         4e61774f7d1c    4.7 GB    3 weeks ago    
10 deepcoder:latest           12bdda054d23    9.0 GB    3 weeks ago    
11 llava:7b                   8dd30f6b0cb1    4.7 GB    3 weeks ago    
12 qwq:latest                 38ee5094e51e    19 GB     3 weeks ago    
13 llama3.2:1b                baf6a787fdff    1.3 GB    3 weeks ago    
14 mistral-small:latest       8039dd90c113    14 GB     3 weeks ago    
15 granite3-dense:latest      5c2e6f3112f4    1.6 GB    3 weeks ago    
16 phi4-mini:latest           78fad5d182a7    2.5 GB    3 weeks ago    
17 reader-lm:latest           33da2b9e0afe    934 MB    3 weeks ago    
18 smollm2:latest             cef4a1e09247    1.8 GB    3 weeks ago    
19 unsloth_AZ_1B:latest       0b3006d8395a    807 MB    3 weeks ago    
20 nomic-embed-text:latest    0a109f422b47    274 MB    3 weeks ago    
21 cogito:32b                 0b4aab772f57    19 GB     3 weeks ago    
22 deepseek-r1:32b            38056bbcbb2d    19 GB     3 weeks ago    
23 deepseek-r1:8b             28f8fd6cdc67    4.9 GB    3 weeks ago    
24 llama3.2:latest            a80c4f17acd5    2.0 GB    3 weeks ago    
25 qwen2.5:14b                7cdf5a0187d5    9.0 GB    3 weeks ago    

Here is the file ollama_langchain/test.py:

 1 # requires "ollama serve" to be running in a terminal
 2 
 3 from langchain.llms import Ollama
 4 
 5 llm = Ollama(
 6     model="qwen3:8b",
 7     verbose=False,
 8 )
 9 
10 s = llm("how much is 1 + 2?")
11 print(s)
12 
13 s = llm("If Sam is 27, Mary is 42, and Jerry is 33, what are their age differences?")
14 print(s)

Here is the output:

 1 $ python test.py
 2 1 + 2 = 3.
 3 
 4 To calculate their age differences, we simply subtract the younger person's age from\
 5  the older person's age. Here are the calculations:
 6 - Sam is 27 years old, and Mary is 42 years old, so their age difference is 42 - 27 \
 7 = 15 years.
 8 - Mary is 42 years old, and Jerry is 33 years old, so their age difference is 42 - 3\
 9 3 = 9 years.
10 - Jerry is 33 years old, and Sam is 27 years old, so their age difference is 33 - 27\
11  = 6 years.

Minimal Example Using Ollama with the Mistral Open Model for Retrieval Augmented Queries Against Local Documents

The following listing of file ollama_langchain/rag_test.py demonstrates creating a persistent embeddings datastore and reusing it. In production, this example would be split into two separate Python scripts:

  • Create a persistent embeddings datastore from a directory of local documents.
  • Open a persisted embeddings datastore and use it for queries against local documents.

Creating a local persistent embeddings datastore for the example text files in ../data/*.txt takes about 90 seconds on my Mac Mini.

 1 # requires "ollama serve" to be running in another terminal
 2 
 3 from langchain.llms import Ollama
 4 from langchain.embeddings.ollama import OllamaEmbeddings
 5 from langchain.chains import RetrievalQA
 6 
 7 from langchain.vectorstores import Chroma
 8 from langchain.text_splitter import RecursiveCharacterTextSplitter
 9 from langchain.document_loaders.directory import DirectoryLoader
10 
11 # Create index (can be reused):
12 
13 loader = DirectoryLoader('../data', glob='**/*.txt')
14 
15 data = loader.load()
16 
17 text_splitter = RecursiveCharacterTextSplitter(
18     chunk_size=1000, chunk_overlap=100)
19 all_splits = text_splitter.split_documents(data)
20 
21 persist_directory = 'cache'
22 
23 vectorstore = Chroma.from_documents(
24     documents=all_splits, embedding=OllamaEmbeddings(model="mistral:instruct"),
25                                persist_directory=persist_directory)
26 
27 vectorstore.persist()
28 
29 # Try reloading index from disk and using for search:
30 
31 persist_directory = 'cache'
32 
33 vectorstore = Chroma(
34   persist_directory=persist_directory,
35   embedding_function=OllamaEmbeddings(model="mistral:instruct")
36 )
37 
38 llm = Ollama(base_url="http://localhost:11434",
39              model="mistral:instruct",
40              verbose=False,
41             )
42 
43 retriever = vectorstore.as_retriever()
44 
45 qa_chain = RetrievalQA.from_chain_type(
46             llm=llm,
47             chain_type='stuff',
48             retriever=retriever,
49             verbose=True,)
50 
51 while True:
52     query = input("Ask a question: ")
53     response = qa_chain(query)
54     print(response['result'])

Here is an example using this script. The first question uses the innate knowledge contained in the Mistral-7B model while the second question uses the text files in the directory ../data as local documents. The test input file economics.txt has been edited to add the name of a fictional economist. I added this data to show that the second question is answered from the local document store.

 1 $ python rag_test.py 
 2 > Entering new RetrievalQA chain...
 3 
 4 > Finished chain.
 5 
 6 11 + 2 = 13
 7 Ask a question: Who says that economics is bullshit?
 8 
 9 
10 > Entering new RetrievalQA chain...
11 
12 > Finished chain.
13 
14 Pauli Blendergast, an economist who teaches at the University of Krampton Ohio, is k\
15 nown for saying that economics is bullshit.

Wrap Up for Running Local LLMs Using Ollama

As I write this chapter in December 2023 most of my personal LLM experiments involve running models locally on my Mac mini (or sometimes in Google Colab) even though models available through OpenAI, Anthropic, etc. APIs are more capable. I find that the Ollama project is currently the easiest and most convenient way to run local models as REST services or embedded in Python scripts as in the two examples here.