Ollama is an open-source framework that enables users to run
large language models (LLMs) locally on their computers, facilitating tasks like text summarization, chatbot development, and more. It supports various models, including Llama 3, Mistral, and Gemma, and offers flexibility in model sizes and quantization options to balance performance and resource usage. Ollama provides a command-line interface and an HTTP API for seamless integration into applications, making advanced AI capabilities accessible without relying on cloud services. Ollama is available on macOS,
Linux, and Windows.
A main theme of this book are the advantages of running models privately on either your personal computer or a computer at work.
While many commercial LLM API venders have options to not reuse
your prompt data and the output generated from your prompts to train their systems,
there is no better privacy and security than running open weight
models on your own hardware.
This book is about running Large Language Models (LLMs)
on your own hardware using Ollama. We will be using both the Ollama Python SDK library’s native support for passing text and images to LLMs as well as Ollama’s OpenAI API compatibility layer that lets you take any of the projects you may already run using OpenAI’s APIs and port them easily to run locally on Ollama.
To be clear, dear reader, although I have a strong preference to running smaller LLMs on my own hardware, I also frequently use commercial LLM API vendors like Anthropic,
OpenAI, ABACUS.AI, GROQ, and Google to take advantage of features like advanced models and scalability using cloud-based hardware.
About the Author
I am an AI practitioner and consultant specializing in large language models, LangChain/Llama-Index integrations, deep learning, and the semantic web. I have authored over 20 authored books on topics including artificial intelligence, Python, Common Lisp, deep learning, Haskell, Clojure, Java, Ruby, the Hy language, and the semantic web. I have 55 U.S. patents. Please check out my home page and social media: my personal web site https://markwatson.com, X/Twitter, my Blog on Blogspot, and my Blog on Substack
Running local models using tools like Ollama can enhance privacy when dealing with sensitive data. Let’s delve into why privacy is crucial and how Ollama contributes to improved security.
Why is privacy important?
Privacy is paramount for several reasons:
Protection from Data Breaches: When data is processed by third-party services, it becomes vulnerable to potential data breaches. Storing and processing data locally minimizes this risk significantly. This is especially critical for sensitive information like personal details, financial records, or proprietary business data.
Compliance with Regulations: Many industries are subject to stringent data privacy regulations, such as GDPR, HIPAA, and CCPA. Running models locally can help organizations maintain compliance by ensuring data remains under their control.
Maintaining Confidentiality: For certain applications, like handling legal documents or medical records, maintaining confidentiality is of utmost importance. Local processing ensures that sensitive data isn’t exposed to external parties.
Data Ownership and Control: Individuals and organizations have a right to control their own data. Local models empower users to maintain ownership and make informed decisions about how their data is used and shared.
Preventing Misuse: By keeping data local, you reduce the risk of it being misused by third parties for unintended purposes, such as targeted advertising, profiling, or even malicious activities.
Security Improvements with Ollama
Ollama, as a tool for running large language models (LLMs) locally, offers several security advantages:
Data Stays Local: Ollama allows you to run models on your own hardware, meaning your data never leaves your local environment. This eliminates the need to send data to external servers for processing.
Reduced Attack Surface: By avoiding external communication for model inference, you significantly reduce the potential attack surface for malicious actors. There’s no need to worry about vulnerabilities in third-party APIs or network security.
Control over Model Access: With Ollama, you have complete control over who has access to your models and data. This is crucial for preventing unauthorized access and ensuring data security.
Transparency and Auditability: Running models locally provides greater transparency into the processing pipeline. You can monitor and audit the model’s behavior more easily, ensuring it operates as intended.
Customization and Flexibility: Ollama allows you to customize your local environment and security settings according to your specific needs. This level of control is often not possible with cloud-based solutions.
It’s important to note that while Ollama enhances privacy and security, it’s still crucial to follow general security best practices for your local environment. This includes keeping your operating system and software updated, using strong passwords, and implementing appropriate firewall rules.
Setting Up Your Computing Environment for Using Ollama and Using Book Example Programs
There is a GitHub repository that I have prepared for you, dear reader, to both support working through the examples for this book as well as hopefully provide utilities for your own projects.
You need to git clone the following repository:
https://github.com/mark-watson/OllamaExamples that contains tools I have written in Python that you can use with Ollama as well as utilities I wrote to avoid repeated code in the book examples. There are also application level example files that have the string “example” in the file names. Tool library files begin with “tool” and files starting with “Agent” contain one of several approaches to writing Agents.
Python Build Tools
The requirements.txt file contains the library requirements for all code developed in this book. My preference is to use venv and maintain a separate Python environment for each of the few hundred Python projects I have on my laptop. I keep a personal directory ~/bin on my PATH and I use the following script venv_setup.sh in the ~/bin directory to use a requirements.txt file to set up a virtual environment:
I sometimes like to use the much faster uv build and package management tool:
There are many other good options like Anaconda, miniconda, poetry, etc.
Using Ollama From the Command Line
Working with Ollama from the command line provides a straightforward and efficient way to interact with large language models locally. The basic command structure starts with ollama run modelname, where modelname could be models like ’llama3’, ‘mistral’, or ‘codellama’. You can enhance your prompts using the -f flag for system prompts or context files, and the —verbose flag to see token usage and generation metrics. For example, ollama run llama2 -f system_prompt.txt “Your question here” lets you provide consistent context across multiple interactions.
One powerful technique is using Ollama’s model tags to maintain different versions or configurations of the same base model. For any model on the Ollama web site, you can view all available model tags, for example: https://ollama.com/library/llama2/tags.
The ollama list command helps you track installed models, and ollama rm modelname keeps your system clean. For development work, the —format json flag outputs responses in JSON format, making it easier to parse in scripts or applications; for example:
Using JSON Format
Analysis of Images
Advanced users can leverage Ollama’s multimodal capabilities and streaming options. For models like llava, you can pipe in image files using standard input or file paths. For example:
While I only cover command line use in this one short chapter, I use Ollama in command line mode several hours a week for software development, usually using a Qwen coding LLM:
I find that the qwen2.5-coder:14b model performs well for my most often used programming languages: Python, Common Lisp, Racket Scheme, and Haskell.
I also enjoy experimenting with the QwQ reasoning model even though it is so large it barely runs on my 32G M2 Pro system:
Analysis of Source Code Files
Here, assuming we are in the main directory for the GitHub repository for this book, we can ask for analysis of the tool for using SQLite databases(most output is not shown):
Unfortunately, when using the command ollama run qwen2.5-coder:14b < tool_sqlite.py, Ollama processes the input from the file and then exits the REPL. There’s no built-in way to stay in the Ollama REPL. However, if you want to analyze code and then interactively chat about the code, ask for code modifications, etc., you can try:
Start Ollama:
Paste the source code to tool_sqlite.py into Ollama REPL
Ask for advice, for example: “Please add code to print out the number of input and output tokens that are used by Ollama when calling function_caller.process_request(query)”
Short Examples
Here we look at a few short examples before later using libraries we develop and longer application style example programs with Ollama to solve more difficult problems.
Using The Ollama Python SDK with Image and Text Prompts
We saw an example of image processing in the last chapter using Ollama command line mode. Here we do the same thing using a short Python script that you can find in the file short_programs/Ollama_sdk_image_example.py:
The output may look like the following when you run this example:
Using the OpenAI Compatibility APIs with Local Models Running on Ollama
If you frequently use the OpenAI APIs for either your own LLM projects or work projects, you might want to simply use the same SDK library from OpenAI but specify a local Ollama REST endpoint:
The output might look like (following listing is edited for brevity):
In the next chapter we start developing tools that can be used for “function calling” with Ollama.
LLM Tool Calling with Ollama
There are several example Python tool utilities in the GitHub repository https://github.com/mark-watson/OllamaExamples that we will use for function calling that start with the “tool” prefix:
We postpone using the tools tool_llm_eval.py and tool_judge_results.py until the next chapter Automatic Evaluation of LLM Results
If you have not done so yet, please clone the repository for my Ollama book examples using:
Use of Python docstrings at runtime:
The Ollama Python SDK leverages docstrings as a crucial part of its runtime function calling mechanism. When defining functions that will be called by the LLM, the docstrings serve as structured metadata that gets parsed and converted into a JSON schema format. This schema describes the function’s parameters, their types, and expected behavior, which is then used by the model to understand how to properly invoke the function. The docstrings follow a specific format that includes parameter descriptions, type hints, and return value specifications, allowing the SDK to automatically generate the necessary function signatures that the LLM can understand and work with.
During runtime execution, when the LLM determines it needs to call a function, it first reads these docstring-derived schemas to understand the function’s interface. The SDK parses these docstrings using Python’s introspection capabilities (through the inspect module) and matches the LLM’s intended function call with the appropriate implementation. This system allows for a clean separation between the function’s implementation and its interface description, while maintaining human-readable documentation that serves both as API documentation and runtime function calling specifications. The docstring parsing is done lazily at runtime when the function is first accessed, and the resulting schema is typically cached to improve performance in subsequent calls.
Example Showing the Use of Tools Developed Later in this Chapter
The source file ollama_tools_examples.py contains simple examples of using the tools we develop later in this chapter. We will look at example code using the tools, then at the implementation of the tools. In this examples source file we first import these tools:
This code demonstrates the integration of a local LLM with custom tool functions for file system operations and web content processing. It imports three utility functions for listing directories, reading file contents, and converting URLs to markdown, then maps them to a dictionary for easy access.
The main execution flow involves sending a user prompt to the Ollama hosted model (here we are using the small IBM “granite3-dense” model), which requests directory listing, file reading, and URL conversion operations. The code then processes the model’s response by iterating through any tool calls returned, executing the corresponding functions, and printing their results. Error handling is included for cases where requested functions aren’t found in the available tools dictionary.
Here is sample output from using these three tools (most output removed for brevity and blank lines added for clarity):
Please note that the text extracted from a web page is mostly plain text. Section heads are maintained but the format is changed to markdown format. In the last (edited for brevity) listing, the HTML H1 element with the text Fun Stuff is converted to markdown:
You have now looked at example tool use. We will now implement the several tools in this chapter and the next. We will look at the first tool for reading and writing files in fine detail and then more briefly discuss the other tools in the https://github.com/mark-watson/OllamaExamples repository.
Tool for Reading and Writing File Contents
This tool is meant to be combined with other tools, for example a summarization tool and a file reading tool might be used to process a user prompt to summarize a specific local file on your laptop.
Here is the contents of tool utility tool_file_contents.py:
read_file_contents
This function provides file reading capabilities with robust error handling with parameters:
file_path (str): Path to the file to read
encoding (str, optional): File encoding (defaults to “utf-8”)
Features:
Uses pathlib.Path for cross-platform path handling
Checks file existence before attempting to read
Returns file contents with descriptive message
Comprehensive error handling
LLM Integration:
Includes metadata for function discovery
Returns descriptive string responses instead of raising exceptions
write_file_contents
This function handles file writing operations with built-in safety features. The parameters are:
file_path (str): Path to the file to write
content (str): Content to write to the file
encoding (str, optional): File encoding (defaults to “utf-8”)
mode (str, optional): Write mode (‘w’ for write, ‘a’ for append)
Features:
Automatically creates parent directories
Supports write and append modes
Uses context managers for safe file handling
Returns operation status messages
LLM Integration:
Includes detailed metadata for function calling
Provides clear feedback about operations
Common Features of both functions:
Type hints for better code clarity
Detailed docstrings that are used at runtime in the tool/function calling code. The text in the doc strings is supplied as context to the LLM currently in use.
Proper error handling
UTF-8 default encoding
Context managers for file operations
Metadata for LLM function discovery
Design Benefits for LLM Integration: the utilities are optimized for LLM function calling by:
Returning descriptive string responses
Including metadata for function discovery
Handling errors gracefully
Providing clear operation feedback
Using consistent parameter patterns
Tool for Getting File Directory Contents
This tool is similar to the last tool so here we just list the worker function from the file tool_file_dir.py:
Tool for Accessing SQLite Databases Using Natural Language Queries
The example file tool_sqlite.py serves two purposes here:
Test and example code: utility function _create_sample_data creates several database tables and the function main serves as an example program.
The Python class definitions SQLiteTool and OllamaFunctionCaller are meant to be copied and used in your applications.
This code provides a natural language interface for interacting with an SQLite database. It uses a combination of Python classes, SQLite, and Ollama for running a language model to interpret user queries and execute corresponding database operations. Below is a breakdown of the code:
Database Setup and Error Handling: a custom exception class, DatabaseError, is defined to handle database-specific errors. The database is initialized with three tables: example, users, and products. These tables are populated with sample data for demonstration purposes.
SQLiteTool Class: the SQLiteTool class is a singleton that manages all SQLite database operations. Key features include:–Singleton Pattern: Ensures only one instance of the class is created.–Database Initialization: Creates tables (example, users, products) if they do not already exist.–Sample Data: Populates the tables with predefined sample data.–Context Manager: Safely manages database connections using a context manager.
Utility Methods:
get_tables: Retrieves a list of all tables in the database.
get_table_schema: Retrieves the schema of a specific table.
execute_query: Executes a given SQL query and returns the results.
Sample Data Creation:
A helper function, _create_sample_data, is used to populate the database with sample data. It inserts records into the example, users, and products tables. This ensures the database has some initial data for testing and demonstration.
OllamaFunctionCaller Class:
The OllamaFunctionCaller class acts as the interface between natural language queries and database operations. Key components include:
Integration with Ollama LLM: Uses the Ollama language model to interpret natural language queries.
Function Definitions: Defines two main functions:–query_database: Executes SQL queries on the database.–list_tables: Lists all tables in the database.
Prompt Generation: Converts user input into a structured prompt for the language model.
Response Parsing: Parses the language model’s response into a JSON object that specifies the function to call and its parameters.
Request Processing: Executes the appropriate database operation based on the parsed response.
Function Definitions:
The OllamaFunctionCaller class defines two main functions that can be called based on user input:
query_database: Executes a SQL query provided by the user and returns the results of the query.
list_tables: Lists all tables in the database and is useful for understanding the database structure.
Request Processing Workflow:
The process_request method handles the entire workflow of processing a user query:
Input: Takes a natural language query from the user.
Prompt Generation: Converts the query into a structured prompt for the Ollama language model.
Response Parsing: Parses the language model’s response into a JSON object.
Function Execution: Calls the appropriate function (query_database or list_tables) based on the parsed response.
Output: Returns the results of the database operation.
Main test/example function:
The main function demonstrates how the system works with sample queries. It initializes the OllamaFunctionCaller and processes a list of example queries, such as:
“Show me all tables in the database.“
“Get all users from the users table.“
“What are the top 5 products by price?“
For each query, the system interprets the natural language input, executes the corresponding database operation, and prints the results.
Summary:
This code creates a natural language interface for interacting with an SQLite database. It works as follows:
Database Management: The SQLiteTool class handles all database operations, including initialization, querying, and schema inspection.
Natural Language Processing: The OllamaFunctionCaller uses the Ollama language model to interpret user queries and map them to database functions.
Execution: The system executes the appropriate database operation and returns the results to the user.
This approach allows users to interact with the database using natural language instead of writing SQL queries directly, making it more user-friendly and accessible.
The output looks like this:
Tool for Summarizing Text
Tools that are used by LLMs can themselves also use other LLMs. The tool defined in the file tool_summarize_text.py might be triggered by a user prompt such as “summarize the text in local file test1.txt” of “summarize text from web page https://markwatson.com” where it is used by other tools like reading a local file contents, fetching a web page, etc.
We will start by looking at the file tool_summarize_text.py and then look at an example in example_chain_web_summary.py.
This Python code implements a text summarization tool using the Ollama chat model. The core function summarize_text takes two parameters: the main text to summarize and an optional context string. The function operates by constructing a prompt that instructs the model to provide a concise summary without additional commentary. It includes an interesting logic where if the input text is very short (less than 50 characters), it defaults to using the context parameter instead. Additionally, if there’s substantial context provided (more than 50 characters), it prepends this context to the prompt. The function utilizes the Ollama chat model “llama3.2:latest” to generate the summary, structuring the request with a system message containing the prompt and a user message containing the text to be summarized. The program includes metadata for Ollama integration, specifying the function name, description, and parameter details, and exports the summarize_text function through all.
Here is an example of using this tool that you can find in the file example_chain_web_summary.py. Please note that this example also uses the web search tool that is discussed in the next section.
Here is the output edited for brevity:
Tool for Web Search and Fetching Web Pages
This code provides a set of functions for web searching and HTML content processing, with the main functions being uri_to_markdown, search_web, brave_search_summaries, and brave_search_text. The uri_to_markdown function fetches content from a given URI and converts HTML to markdown-style text, handling various edge cases and cleaning up the text by removing multiple blank lines and spaces while converting HTML entities. The search_web function is a placeholder that’s meant to be implemented with a preferred search API, while brave_search_summaries implements actual web searching using the Brave Search API, requiring an API key from the environment variables and returning structured results including titles, URLs, and descriptions. The brave_search_text function builds upon brave_search_summaries by fetching search results and then using uri_to_markdown to convert the content of each result URL to text, followed by summarizing the content using a separate summarize_text function. The code also includes utility functions like replace_html_tags_with_text which uses BeautifulSoup to strip HTML tags and return plain text, and includes proper error handling, logging, and type hints throughout. The module is designed to be integrated with Ollama and exports uri_to_markdown and search_web as its primary interfaces.
Tools Wrap Up
We have looked at the implementations and examples uses for several tools. In the next chapter we continue our study of tool use with the application of judging the accuracy of output generated of LLMs: basically LLMs judging the accuracy of other LLMs to reduce hallucinations, inaccurate output, etc.
Automatic Evaluation of LLM Results: More Tool Examples
As Large Language Models (LLMs) become increasingly integrated into production systems and workflows, the ability to systematically evaluate their performance becomes crucial. While qualitative assessment of LLM outputs remains important, organizations need robust, quantitative methods to measure and compare model performance across different prompts, use cases, and deployment scenarios. This has led to the development of specialized tools and frameworks designed specifically for LLM evaluation.
The evaluation of LLM outputs presents unique challenges that set it apart from traditional natural language processing metrics. Unlike straightforward classification or translation tasks, LLM responses often require assessment across multiple dimensions, including factual accuracy, relevance, coherence, creativity, and adherence to specified formats or constraints. Furthermore, the stochastic nature of LLM outputs means that the same prompt can generate different responses across multiple runs, necessitating evaluation methods that can account for this variability.
Modern LLM evaluation tools address these challenges through a combination of automated metrics, human-in-the-loop validation, and specialized frameworks for prompt testing and response analysis. These tools can help developers and researchers understand how well their prompts perform, identify potential failure modes, and optimize prompt engineering strategies. By providing quantitative insights into LLM performance, these evaluation tools enable more informed decisions about model selection, prompt design, and system architecture in LLM-powered applications.
In this chapter we take a simple approach:
Capture the chat history including output for an interaction with a LLM.
Generate a prompt containing the chat history, model output, and a request to a different LLM to evaluate the output generated by the first LLM. We request that the final output of the second LLM is a score of ‘G’ or ‘B’ (good or bad) judging the accuracy of the first LLM’s output.
We look at several examples in this chapter of approaches you might want to experiment with.
Tool For Judging LLM Results
Here we implement our simple approach of using a second LLM to evaluate the output of the first LLM tat generated a response to user input.
The following listing shows the tool tool_judge_results.py:
This Python code defines a function judge_results that takes an original prompt sent to a Large Language Model (LLM) and the generated response from the LLM, then attempts to judge the accuracy of the response.
Here’s a breakdown of the code:
The main function judge_results takes two parameters:
original_prompt: The initial prompt sent to an LLM
llm_gen_results: The output from the LLM that needs evaluation
The function judge_results returns a dictionary with two keys:
judgement: Single character (‘B’ for Bad, ‘G’ for Good, ‘E’ for Error)
reasoning: Detailed explanation of the judgment
The evaluation process is:
Creates a conversation with two messages:–System message: Sets the context for evaluation–User message: Combines the original prompt and results for evaluation
Uses the Qwen 2.5 Coder (14B parameter) model through Ollama
Expects a Y/N response at the end of the evaluation
Sample output
Evaluating LLM Responses Given a Chat History
Here we try a difference approach by asking the second “judge” LLM to evaluate the output of the first LLM based on specific criteria like “Response accuracy”, “Helpfulness”, etc.
The following listing shows the tool utility tool_llm_eval.py:
We will use these five evaluation criteria:
Response accuracy
Coherence and clarity
Helpfulness
Task completion
Natural conversation flow
The main function evaluate_llm_conversation uses these steps:
Receives chat history and optional parameters
Formats the conversation into a readable string
Creates a detailed evaluation prompt
Sends prompt to Ollama for evaluation
Cleans and parses the response
Returns structured evaluation results
Sample Output
A Tool for Detecting Hallucinations
Here we use a text template file templates/anti_hallucinations.txt to define the prompt template for checking a user input, a context, and the resulting output by another LLM (most of the file is not shown for brevity):
Here is the tool tool_anti_hallucination.py that uses this template:
This code implements a hallucination detection system for Large Language Models (LLMs) using the Ollama framework. The core functionality revolves around the detect_hallucination function, which takes three parameters: user input, context, and LLM output, and evaluates whether the output contains hallucinated content by utilizing another LLM (llama3.2) as a judge. The system reads a template from a file to structure the evaluation prompt.
The implementation includes type hints and error handling, particularly for JSON parsing of the response. The output is structured as a JSON object containing a hallucination score (between 0.0 and 1.0) and a list of reasoning points. The code also includes a test harness that demonstrates the system’s usage with a mathematical example, checking for accuracy in age difference calculations. The modular design allows for easy integration into larger systems through the explicit export of the detect_hallucination function.
The output looks something like this:
Wrap Up
Here we looked at several examples for using one LLM to rate the accuracy, usefulness, etc. of another LLM given an input prompt. There are two topics in this book that I spend most of my personal LLM research time on: automatic evaluation of LLM results, and tool using agents (the subject of the next chapter).
Building Agents with Ollama and the Hugging Face smolagents Library
We have seen a few useful examples of tool use (function calling) and now we will build on tool use to build both single agents and multi-agent systems. There are commercial and open source resources to build agents, CrewAI and LangGraph being popular choices. We will follow a different learning path here, preferring to use the smolagents library. Please bookmark https://github.com/huggingface/smolagents for reference while working through this chapter.
Each example program and utility for this chapter uses the prefix smolagents_ in the Python file name.
Note: We are using the 2 GB model Llama3.2:latest here. Different models support tools and agents differently.
Choosing Specific LLMs for Writing Agents
As agents operate performing tasks like interpreting user input, performing Chain of Thought (Cot) reasoning, observe the output from calling tools, and following plan steps one by one, then LLMs errors, hallucinations, and inconsistencies accumulate. When using Ollama we prefer using the most powerful models that we can run on our hardware.
Here we use Llama3.2:latest that is recognized for its function calling capabilities, facilitating seamless integration with various tools.
As you work through the examples here using different local models running on Ollama, you might encounter compounding errors problems. When I am experimenting with ideas for implementing agents, I sometimes keep two versions of my code, one for a local model and one using eight of the commercial models GPT-4o or Claude Sonnet 3.5. Comparing the same agent setup using different models might provide some insight into runtime agent problems being your code or the model you are using.
Installation notes
As I write this chapter on January 2, 2025, smolagents needs to be run with an older version of Python:
The first two lines of the requirements.txt file specify the smolagents specific requirements:
Overview of the Hugging Face smolagents Library
The smolagents library https://github.com/huggingface/smolagents is built around a minimalist and modular architecture that emphasizes simplicity and composability. The core components are cleanly separated into the file agents.py for agent definitions, tools.py for tool implementations, and related support files. This design philosophy allows developers to easily understand, extend, and customize the components while maintaining a small codebase footprint - true to the “smol” name.
This library implements a tools-first approach where capabilities are encapsulated as discrete tools that agents can use. The tools.py file in the smolagents implementation defines a clean interface for tools with input/output specifications, making it straightforward to add new tools. This tools-based architecture enables agents to have clear, well-defined capabilities while maintaining separation of concerns between the agent logic and the actual implementation of capabilities.
Agents are designed to be lightweight and focused on specific tasks rather than trying to be general-purpose. The BaseAgent class provides core functionality while specific agents like WebAgent extend it for particular use cases. This specialization allows the agents to be more efficient and reliable at their designated tasks rather than attempting to be jack-of-all-trades.
Overview for LLM Agents (optional section)
You might want to skip this section if you want to quickly work through the examples in this chapter and review this material later.
In general, we use the following steps to build agent based systems:
Define agents (e.g., Researcher, Writer, Editor, Judge outputs of other models and agents).
Assign tasks (e.g., research, summarize, write, double check the work of other agents).
Use an orchestration framework to manage task sequencing and collaboration.
Features of Agents:
Retrieval-Augmented Generation (RAG): Enhance agents’ knowledge by integrating external documents or databases.–Example: An agent that retrieves and summarizes medical research papers.
Memory Management: Enable agents to retain context across interactions.
Example: A chatbot that remembers user preferences over time.
Tool Integration: Equip agents with tools like web search, data scraping, or API calls.
Example: An agent that fetches real-time weather data and provides recommendations. We will use tools previously developed in this book.
Examples of Real-World Applications
Healthcare: Agents that analyze medical records and provide diagnostic suggestions.
Education: Virtual tutors that explain complex topics using Ollama’s local models.
Customer Support: Chatbots that handle inquiries without relying on cloud services.
Content Creation: Agents that generate articles, summaries, or marketing content.
Let’s Write Some Code
I am still experimenting with LLM-based agents. Please accept the following examples as my personal works in progress.
“Hello World” smolagents Example
Here we look at a simple example taken from the smolagents documentation and converted to run using local models with Ollama. Here is a listing of file smolagents_test.py:
Understanding the smolagents and Ollama Example
This code demonstrates a simple integration between smolagents (a tool-calling framework) and Ollama (a local LLM server). Here’s what the code accomplishes:
Core Components
Utilizes smolagents for creating AI agents with tool capabilities
Integrates with a local Ollama server running llama3.2
Implements a basic weather checking tool (though humorously hardcoded)
Model Configuration
The code sets up a LiteLLM model instance that connects to a local Ollama server on port 11434. It’s configured to use the llama3.2 model and supports optional API key authentication.
Weather Tool Implementation
The code defines a weather-checking tool using the @tool decorator. While it accepts a location parameter and an optional celsius flag, this example version playfully returns the same dramatic weather report regardless of the input location.
Agent Setup and Execution
The implementation creates a ToolCallingAgent with the weather tool and the configured model. Users can query the agent about weather conditions in any location, though in this example it always returns the same humorous response about terrible weather conditions.
Key Features
Demonstrates tool-calling capabilities through smolagents
Shows local LLM integration using Ollama
Includes proper type hinting for better code clarity
Provides an extensible structure for adding more tools
Python Tools Compatible with smolagents
The tools I developed in previous chapters are not quite compatible with the smolagents library so I wrap a few of the tools I previously wrote in the utility smolagents_tools.py:
This code defines a wrapper module containing three tool functions designed for compatibility with the smolagents framework. The module includes sa_list_directory(), which lists files and directories in the current working directory with an optional parameter to include dot files; read_file_contents(), which takes a file path as input and returns the contents of that file as a string while handling potential errors and file encoding; and summarize_directory(), which provides a concise summary of the current directory by counting the total number of files and directories. All functions are decorated with @tool for integration with smlolagents, and the code imports necessary modules including pathlib for file operations, typing for type hints, and pprint for formatted output. The functions rely on an external list_directory() function imported from tool_file_dir.py, and they provide clear documentation through docstrings explaining their parameters, functionality, and return values. Error handling is implemented particularly in the file reading function to gracefully handle cases where files don’t exist or cannot be read properly.
A complete smolagents Example using Three Tools
This listing shows the script smolagents_agent_test.py:
This code demonstrates the creation of an AI agent using the smolagents library, specifically configured to work with file system operations. It imports three specialized tools from smolagents_tools: sa_list_directory for listing directory contents, summarize_directory for providing directory summaries, and read_file_contents for accessing file contents. The code sets up a LiteLLMModel instance that connects to a local Ollama server running the llama3.2 model on port 11434, with provisions for API key authentication if needed. A ToolCallingAgent is then created with these three file-system-related tools, enabling it to interact with and analyze the local file system. The agent is instructed to examine the current directory through a natural language query, asking for both a listing and description of the files present. There’s also a second section that would have asked the agent to specifically analyze Python programs in the directory and identify those related to LLM performance evaluation, showing the agent’s potential for more complex file analysis tasks. This setup effectively creates an AI-powered file system navigator that can understand and respond to natural language queries about directory contents and file analysis.
Output from the First Example: “List the Python programs in the current directory, and then tell me which Python programs in the current directory evaluate the performance of LLMs?”
In the following output please notice that sometimes tool use fails and occasionally wrong assumptions are made, but after a long chain or thought (CoT) process the final result is good.
The output for for the query “Which python scripts evaluate the performance of LLMs?” is:
This is a lot of debug output to list in a book but I want you, dear reader, to get a feeling for the output generated by tools becomes the data for an again to observe before determining the next step in a plan to process.
This output shows the execution of the example smolagent-based agent that analyzes Python files in a directory looking for Python files containing code to evaluate the output results of LLMs. The agent follows a systematic approach by first listing all files using the sa_list_directory tool, then using sa_summarize_directory to provide detailed analysis of the contents.
The agent successfully identified all Python programs in the directory and specifically highlighted three files that evaluate LLM performance: tool_anti_hallucination.py (which checks for false information generation), tool_llm_eval.py (for general LLM evaluation), and tool_summarize_text.py (which likely tests LLM summarization capabilities). The execution includes detailed step-by-step logging, showing input/output tokens and duration for each step, demonstrating the agent’s methodical approach to file analysis and classification.
Output from the Second example: “What are the files in the current directory? Describe the current directory“
In this section we look at another agent processing cycle. Again, pay attention to the output of tools, and whether the agent can observe tool output and make sense of it (often the agent can’t!)
It is fairly normal for tools to fail with errors and it is important that agents can observe a failure and move on to try something else.
This output shows the agent performing a directory analysis using multiple tool calls, primarily utilizing sa_list_directory and sa_summarize_directory to examine the contents of the current working directory. The analysis revealed a Python-based project focused on natural language processing (NLP) and agent-based systems, containing various components including example scripts, testing files, and utility tools. The agent executed multiple iterations to gather and process information about the directory structure, with each step taking between 1.58 to 18.89 seconds to complete.
The final analysis identified key project components including a Makefile for build automation, example scripts demonstrating text summarization and graph-based algorithms, testing scripts for smolagent (Small Model-based Language Agent) and OLLAMA tools, and various utility scripts for tasks like anti-hallucination, database interactions, and web searching. The directory structure suggests this is a development and testing environment for NLP-related technologies, complete with its own virtual environment and dependency management through requirements.txt. The agent’s analysis provided detailed insights into the purpose and organization of the codebase while maintaining a focus on its NLP and agent-based systems orientation.
Output from Third Example: “Read the text in the file ‘data/economics.txt’ file and then summarize this text.”
This output shows a sequence of steps where the agent repeatedly calls directory listing and summarization tools to understand the contents of a Python project directory. The agent uses tools like sa_list_directory and sa_summarize_directory to gather information, with each step building on previous observations to form a more complete understanding of the codebase.
Through multiple iterations, the agent analyzes a directory containing various Python files related to NLP and agent-based systems. The files include examples of text summarization, graph processing with Kuzu, language model evaluation tools, and various utility scripts. The agent ultimately produces a comprehensive summary categorizing the files into groups like build scripts, example code, testing scripts, and tool implementations, while noting the project appears to be focused on demonstrating and testing NLP-related technologies. This output log shows the agent taking about 75 seconds total across 6 steps to complete its analysis, with each step consuming progressively more tokens as it builds its understanding.
Agents Wrap Up
There are several options for LLM agent frameworks. I especially like smolagents because it works fairly well with smaller models run with Ollama. I have experimented with other agent frameworks that work well with Claude, GPT-4o, etc., but fail more frequently when used with smaller LLMs.
Using the Unsloth Library on Google Colab to FineTune Models for Ollama
This is a book about running local LLMs using Ollama. That said, I use a Mac M2 Pro with 32G of memory and while my computer could be used for fine tuning models, I prefer using cloud assets. I frequently use Google’s Colab for running deep learning and other experiments.
We will be using three Colab notebooks in this chapter:
Colab notebook 1: Colab URI for this chapter is a modified copy of a Unsloth demo notebook. Here we create simple training data to quickly verify the process of fine tuning on Collab using Unsloth and exporting to a local Ollama model on a laptop. We fine tune the 1B model unsloth/Llama-3.2-1B-Instruct.
Colab notebook 2: Colab URI uses my dataset on fun things to do in Arizona. We fine tune the model unsloth/Llama-3.2-1B-Instruct.
Colab notebook 3: Colab URI This is identical to the example in Colab notebook 2 except that we fine tune the larger 3B model unsloth/Llama-3.2-3B-Instruct.
The Unsloth fine-tuning library is a Python-based toolkit designed to simplify and accelerate the process of fine-tuning large language models (LLMs). It offers a streamlined interface for applying popular techniques like LoRA (Low-Rank Adaptation), prefix-tuning, and full-model fine-tuning, catering to both novice and advanced users. The library integrates seamlessly with Hugging Face Transformers and other prominent model hubs, providing out-of-the-box support for many state-of-the-art pre-trained models. By focusing on ease of use, Unsloth reduces the boilerplate code needed for training workflows, allowing developers to focus on task-specific adaptation rather than low-level implementation details.
One of Unsloth’s standout features is its efficient resource utilization, enabling fine-tuning even on limited hardware such as single-GPU setups. It achieves this through parameter-efficient fine-tuning techniques and gradient check pointing, which minimize memory overhead. Additionally, the library supports mixed-precision training, significantly reducing computational costs without compromising model performance. With robust logging and built-in tools for hyper parameter optimization, Unsloth empowers developers to achieve high-quality results with minimal experimentation. It is particularly well-suited for applications like text summarization, chatbots, and domain-specific language understanding tasks.
Colab Notebook 1: A Quick Test of Fine Tuning and Deployment to Ollama on a Laptop
We start by installing the Unsloth library and all dependencies, then uninstalling just the sloth library and reinstalling the latest from source code on GitHub:
Now create a model and tokenizer:
Now add LoRA adapters:
The original Sloth example notebook used Maxime Labonne’s FineTome-100k dataset for fine tuning data. Since I wanted to fine tune with my own test data I printed out some of Maxime Labonne’s data after being loaded into a Dataset object. Here are a few snippets to show you, dear reader, the format of the data that I will reproduce:
I used a small Python script on my laptop to get the format correct for my test data:
Output is:
If you look at the notebook for this chapter on Colab you will see that I copied the last Python script as-is to the notebook, replaces code in the orgiginal Unsloth demo notebook.
The following code (copied from the Unsloth demo notebook) slightly reformats the prompts and then trains using the modified dataset:
The output is (edited for brevity and to remove a token warning):
The notebook has a few more tests:
The output is:
Warning on Limitations of this Example
We used very little training data and in the call to SFTTrainer we didn’t even train one epoch:
This allows us to fine tune a previously trained model very quickly for this short demo.
We will use much more training data in the next chapter to finetune a model to be an expert in recreational locations in the state of Arizona.
Save trained model and tokenizer to a GGUF File on the Colab Notebook’s File System
To experiment in the Colab Notebook Linux environment we can save the data locally:
In order to create a GGUF file to allow us to run this fine tuned model on our laptop we create a local GGUF file that can be downloaded to your laptop:
In the demo notebook, you can see where the GGUF file was written:
Copying the GGUF File to Your Laptop and Creating a Ollama Modelfile
Depending on how fast your Internet speed is, it might take five or ten minutes to download the GGUF file since it is about 1G in size:
We also will need to copy the generated Ollama model file (that the Unsloth library created for us):
The contents of the file is:
After downloading the GGUF file to my laptop I made a slight edit to the generated Modelfile got the path to the GGUF file on line 1:
Once the model is downloaded to your laptop, create a local Ollama model to use:
I can now use the model unsloth that was just created on my laptop:
Notice that fine tuned model has learned new data and still has functionality of the original model.
Fine Tuning Test Wrap Up
This was a short example that can be run on a free Google Colab notebook. Now we will use a larger fine tuning training data set.
Fine Tuning Using a Fun Things To Do in Arizona Data Set
I created a GitHub repository for the Arizona fine tuning data set that contains small individual JSON files and a larger file ArizonaFun.json that is a concatenation of the smaller files. Let’s look at az_flagstaff_parks.json (edited to remove some text for brevity):
There are a total of 40 fine tuning examples in the file ArizonaFun.json. You can see in the second and third Colab notebooks for this chapter I just pasted the JSON data from the file ArizonaFun.json into a cell:
Unfortunately, the fine tooled model often performs well, but also hallucinates. Here is an example of using the fine tuned model in the Colab notebook:
The output is:
This answer is correct.
The second Colab notebook also contains code cells for downloading the fine tuned model and the directions for importing the model into Ollama that we saw earlier also apply here.
Third Colab Notebook That Fine Tunes a Larger Model
There are only two changes made to the second notebook:
We now fine tune a 3B model unsloth/Llama-3.2-3B-Instruct.
Because the fine tuned model is large, I added code to store the model in Google Drive:
I created an empty folder LLM on my Google Drive before running this code.
Fine Tuning Wrap Up
I don’t usually fine tune models. I usually use larger prompt contexts and include one shot or two shot examples. That said there are good use cases for fine tuning small models with your data and I hope the simple examples in this chapter will save you time if you have an application requiring fine tuning.
Reasoning with Large Language Models
The Chinese tech conglomerate Alibaba’s MarcoPolo Team released the advanced Marco-o1 model at the end of 2024.
This model is designed to excel in open-ended problem-solving and complex reasoning tasks, going beyond traditional AI models that focus on structured tasks like coding or math. For reference the repository for the model is https://github.com/AIDC-AI/Marco-o1. From the README in this repository: “Marco-o1 Large Language Model (LLM) is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies—optimized for complex real-world problem-solving tasks.”
A Simple Example
I very mush enjoy experimenting with Marco-o1 model in the Ollama REPL. Let’s start with a very simple prompt that most models can solve. Here, we want to see the structure of for Marco-o1’s CoT (chain of thought) process:
We will look at a more difficult example later.
Key Features of Marco-o1
Here are some key characteristics of Marco-o1:
Advanced Reasoning Techniques: It utilizes Chain-of-Thought (CoT) fine-tuning and Monte Carlo Tree Search (MCTS) to enhance its reasoning capabilities. CoT allows the model to trace its thought patterns, making the problem-solving process more transparent. MCTS enables exploration of multiple reasoning paths by assigning confidence scores to different tokens. Reference: https://arxiv.org/html/2411.14405
Self-Reflection: A unique feature is its ability to self-reflect, evaluating its reasoning, identifying inaccuracies, and iterating on its outputs for improved results. This leads to higher accuracy and adaptability.
Multilingual Mastery: Marco-o1 excels in translation, handling cultural nuances, idiomatic expressions, and colloquialisms effectively. This makes it a powerful tool for global communication.
Focus on Open-Ended Problems: Unlike models focused on structured tasks with definitive answers, Marco-o1 tackles open-ended problems where clear evaluation metrics might be absent[1].
Strong Performance: It has shown significant improvements in reasoning and translation benchmarks, including increased accuracy on the MGSM dataset (both English and Chinese) and strong performance in machine translation tasks[1].
Open Source Datasets and Implementation: Alibaba has released Marco-o1’s datasets and implementation guides on GitHub, encouraging collaboration and further advancements in AI research.
A More Complex Example: City Traffic Planning
Let’s end this chapter with a more complex example:
I often use the state of the art commercial LLM APIs for models like Claude Sonnet 3.5, GPT-4o, o1, Grok-2, etc. to brainstorm ideas and help me think and plan out new projects. I find it exciting to be able to run close to state of the art reasoning LLM on my personal com computer using Ollama!
Using Property Graph Database with Ollama
I have a long history of working with Knowledge Graphs (at Google and OliveAI) and I usually use RDF graph databases and the SPARQL query language. I have recently developed a preference for property graph databases because recent research has shown that using LLMs with RDF-based graphs have LLM context size issues due to large schemas, overlapping relations, and complex identifiers that exceed LLM context windows. Property graph databases like Neo4J and Kuzu (which we use in this chapter) have more concise schemas.
It is true that Google and other players are teasing ‘infinite context’ LLMs but since this book is about running smaller models locally I have chosen to only show a property graph example.
Overview of Property Graphs
Property graphs represent a powerful and flexible data modeling paradigm that has gained significant traction in modern database systems and applications. At its core, a property graph is a directed graph structure where both vertices (nodes) and edges (relationships) can contain properties in the form of key-value pairs, providing rich contextual information about entities and their connections. Unlike traditional relational databases that rely on rigid table structures, property graphs offer a more natural way to represent highly connected data while maintaining the semantic meaning of relationships. This modeling approach is particularly valuable when dealing with complex networks of information where the relationships between entities are just as important as the entities themselves.
The distinguishing characteristics of property graphs make them especially well-suited for handling real-world data scenarios where relationships are multi-faceted and dynamic. Each node in a property graph can be labeled with one or more types (such as Person, Product, or Location) and can hold any number of properties that describe its attributes. Similarly, edges can be typed (like “KNOWS”, “PURCHASED”, or “LOCATED_IN”) and augmented with properties that qualify the relationship, such as timestamps, weights, or quality scores. This flexibility allows for sophisticated querying and analysis of data patterns that would be cumbersome or impossible to represent in traditional relational schemas. The property graph model has proven particularly valuable in domains such as social network analysis, recommendation systems, fraud detection, and knowledge graphs, where understanding the intricate web of relationships between entities is crucial for deriving meaningful insights.
Example Using Ollama, LangChain, and the Kuzu Property Graph Database
The example shown here is derived from an example in the LangChain documentation: https://python.langchain.com/docs/integrations/graphs/kuzu_db/. I modified the example to use a local model running on Ollama instead of the OpenAI APIs. Here is the file graph_kuzu_property_example.py:
This code demonstrates the implementation of a graph database using Kuzu, integrated with LangChain for question-answering capabilities. The code initializes a database connection and establishes a schema with two node types (Movie and Person) and a relationship type (ActedIn), creating a graph structure suitable for representing actors and their film appearances.
The implementation populates the database with specific data about “The Godfather” trilogy and two prominent actors (Al Pacino and Robert De Niro). It uses Cypher-like query syntax to create nodes for both movies and actors, then establishes relationships between them using the ActedIn relationship type. The data model represents a typical many-to-many relationship between actors and movies.
This example then sets up a question-answering chain using LangChain, which combines the Kuzu graph database with the Ollama language model (specifically the qwen2.5-coder:14b model). This chain enables natural language queries against the graph database, allowing users to ask questions about actor-movie relationships and receive responses based on the stored graph data. The implementation includes two example queries to demonstrate the system’s functionality.
Here is the output from this example:
The Cypher query language is commonly used in property graph databases. Here is a sample query:
This Cypher query performs a graph pattern matching operation to find actors who appeared in “The Godfather: Part II”. Let’s break it down:
MATCH initiates a pattern matching operation
(p:Person) looks for nodes labeled as “Person” and assigns them to variable p
-[:ActedIn]-> searches for “ActedIn” relationships pointing outward
(m:Movie) matches Movie nodes specifically with the name property equal to “The Godfather: Part II”
RETURN p.name returns only the name property of the matched Person nodes
Based on the previous code’s data, this query would return “Al Pacino” and “Robert De Niro” since they both acted in that specific film.
Using LLMs to Create Graph Databases from Text Data
Using Kuzo with local LLMs is simple to implement as seen in the last section. If you use large property graph databases hosted with Kuzo or Neo4J, then the example in the last section is hopefully sufficient to get you started implementing natural language interfaces to property graph databases.
Now we will do something very different: use LLMs to generate data for property graphs, that is, to convert text to Python code to create a Kuzo property graph database.
Specifically, we use the approach:
Use the last example file graph_kuzu_property_example.py as an example for Claude Sonnet 3.5 to understand the Kuzo Python APIs.
Have Claude Sonnet 3.5 read the file data/economics.txt and create a schema for a new graph database and populate the schema from the contents of the file data/economics.txt.
Ask Claude Sonnet 3.5 to also generate query examples.
Except for my adding the utility function query_and_print_result, this code was generated by Claude Sonnet 3.5:
How might you use this example? Using one or two shot prompting in LLM input prompts to specify data format and other information and then generating structured data of Python code is a common implementation pattern for using LLMs.
Here, the “structured data” I asked an LLM to output was Python code.
I cheated in this example by using what is currently the best code generation LLM: Claude Sonnet 3.5. I also tried this same exercise using Ollama with the model qwen2.5-coder:14b and the results were not quite as good. This is a great segway into the final chapter Book Wrap Up.
Book Wrap Up
Dear reader, I have been paid for “AI work” (for many interpretations of what that even means) since 1982. I certainly find LLMs to be the most exciting tool for moving the field of AI further and faster than anything else that I have used in the last 43 years.
I am also keenly interested in privacy and open source so I must admit a strong bias towards using open source software, open weight LLMs, and also systems and infrastructure like Ollama that enable me to control my own data. The content of this book is tailored to my own interests but I hope that I have, dear reader, covered many of your interests also.
In the last example in the previous chapter I “pulled a fast one” in that I didn’t use a local model running with Ollama. Instead I used what is the most powerful commercial LLM Claude Sonnet 3.5 because it generates better code than any model that I can run on my Mac with 32B of consolidated memory using Ollama. In my work, I balance my personal desires for data privacy and control over the software and hardware I use, with practical compromises like using the state of the art models running on massive cloud compute resources.
Leanpub requires cookies in order to provide you the best experience.
Dismiss