Leanpub Header

Skip to main content

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

This book is 100% completeLast updated on 2026-06-19

Pull a model onto a machine you own, shape it with a Modelfile, fine-tune your own adapter, and build a chat app that calls tools and talks to an MCP server, all running on your own hardware. By the end, you'll know exactly where owning your AI beats renting it, and where it doesn't.

Minimum price

$28.99

$31.99

You pay

Author earns

$

Also available for 1 book credit with a Reader Membership

PDF
EPUB
WEB
APP
About

About

About the Book

The model you depend on lives on someone else's hardware. They can change the price, change the rules, or retire it entirely, and you cannot stop them.

In August 2025, OpenAI retired GPT-4o overnight. In June 2026, an export directive forced Anthropic to suspend Fable 5 and Mythos 5. Teams lost models they had built on in an afternoon.

They just did not own what they ran.

Local AI Engineering with Ollama is how you stop renting and start owning. You take the model, the price, and the rules back into your own hands: run any model you want, when you want, where you want, and change how it behaves without a meter running.

This is a practical book for developers who can run a command and edit a file but have no Machine Learning degree and want none. It skips the marketing and jumps into building things that run, on hardware you already own, with the network unplugged. Every command was executed on a real machine, and every output you see (JSON responses, error messages, token counts, training logs) came from an actual session, not from documentation. Where Ollama behaved differently than its own docs, the book says so and pins the version.

This book moves in one direction: from running your first model to shipping an agent that runs on your own hardware. Each chapter ends with something working, and each skill below builds on the one before it. By the end you will be able to:

  • Understand what a model is actually doing: You will learn how text becomes tokens, how tokens become predictions, and what weights, embeddings, attention, and the KV cache really are. Just enough to make decisions, with every concept tied to a setting you will later change.
  • Install Ollama and size your hardware honestly: You will learn to install the runtime, tell whether a model fits in your RAM or VRAM before you download it, and read the tradeoffs between parameter count, quantization, and speed so you stop pulling models you will delete an hour later.
  • Pick, pull, and manage models: You will learn to read the Ollama library and Hugging Face GGUF repos, choose the right quantization (Q4_K_M, Q5_K_M, Q8_0, and the rest), and manage what is on disk and in memory with list, show, ps, stop, copy, and remove.
  • Drive Ollama from its API: You will move past the CLI and talk to Ollama the way your apps will, over HTTP, so anything you build (a script, a backend, an agent) can run models without a human typing commands. You will also learn to read tokens-per-second straight off the API so you can compare models and hardware on numbers, not vibes.
  • Control the context window: You will take control of how much your model remembers in a single conversation, so you can stop a model from silently forgetting the start of a long chat and start sizing the context window deliberately for the job at hand. You will also learn to see exactly what gets sent to the model on each turn, which is the difference between guessing why a model misbehaves and knowing.
  • Operate a model under real conditions: You will learn to tune behavior at runtime with temperature, top_p, top_k, penalties, and seed, control how long models stay loaded with keep-alive, and set concurrency so one model can serve parallel requests without falling over.
  • Package a custom model with a Modelfile: You will turn a general-purpose model into a customized one that does a specific job the same way every time, then ship it as a single named artifact a teammate can pull and run with zero setup.
  • Fine-tune a model on your own data: You will learn when prompting stops being enough and training begins, then fine-tune Granite to turn plain English into SQL using QLoRA with Unsloth, understand SFT versus preference tuning, and export the result to GGUF to run it in Ollama.
  • Build against the Python SDK: You will stop parsing raw JSON by hand and start building real Python programs against Ollama, with typed responses your editor can autocomplete and your code can trust, ending with a small CLI that does the everyday model-management jobs from inside your own tooling.
  • Build a working chat loop and see why it forgets: You will write a REPL that sends one message and prints one reply, then watch it fail to recall the previous turn, the concrete proof that the model itself holds no state.
  • Give the conversation a memory: You will keep a running message list and resend it every turn, so the assistant can follow a multi-turn conversation within a session.
  • Stream replies and accept multi-line input: You will print tokens the moment they arrive instead of waiting for the full reply, and take pasted, multi-line prompts without breaking the loop.
  • Keep long chats inside the context window: You will build chats that keep working past the point where they normally break, dropping the oldest turns on your terms so the prompt never overflows the context window and the model never silently forgets where it started.
  • Summarize old turns instead of dropping them: You will replace hard trimming with a second model that condenses earlier messages, wired in through LangChain's summarization middleware, so a long conversation keeps its gist instead of its raw length.
  • Cache replies in Redis: You will return repeated questions instantly from a cache, cutting both latency and the compute you spend regenerating the same answer.
  • Add long-term memory that survives restarts: You will wire in mem0 so the assistant recalls facts about a user across separate sessions, not just within the current one, and handle the background writes cleanly on exit.
  • Give the model tools to fetch live data: You will add function calling so the model can invoke your Python functions for things it cannot know, like the current weather or air quality, and guard it with a prompt that makes it admit ignorance instead of inventing numbers when a tool fails.
  • Source those tools from an external MCP server: You will swap your hand-written tools for ones served over MCP, so the same agent gains capabilities you did not write and do not have to maintain, and you will see why the M times N integration problem becomes M plus N.
  • Put a graphical interface in front of Ollama: You will stand up Open WebUI in Docker against a local or remote Ollama, pull models and chat with your own documents from the browser, and lock it down with the admin approval gate that turns a personal install into something you can safely hand to a team.

The through line is the build. You do not just learn what Ollama does; you leave with a model you customized, a model you trained, and an agent you assembled, all of it running on hardware you own.

If you can run a command and edit a file, you are qualified! Downloadable code included.


So what are you waiting for to stop renting, start owning, and get a model running tonight?

Author

About the Author

Aymen El Amri

Aymen El Amri is an author, entrepreneur, trainer, and polymath software engineer who has excelled in a range of roles and responsibilities in the field of technology, including DevOps & Cloud Native, Cloud Architecture, Python, NLP, Data Science, and more.

Aymen has trained hundreds of software engineers and written multiple books and courses read by thousands of other developers and software engineers.

Aymen El Amri has a practical approach to teaching, breaking down complex concepts into easy-to-understand language and providing real-world examples that resonate with his audience.

Some projects he founded are FAUN.dev(), eralabs.io, and Marketto. You can find Aymen on Twitter and Linkedin.

The Leanpub Podcast

Episode 88

An Interview with Aymen El Amri

Contents

Table of Contents

Local AI Engineering with Ollama

  1. Why This Book Exists
  2. What You Will Learn
  3. Who Is This Book For?
  4. About the Author

How to Get the Most Out of This Book

  1. What This Book Asks of You
  2. How to Read This So It Sticks
  3. The Companion Kit
  4. Recommended Environment
  5. Conventions
  6. Heredoc
  7. Callout Markers

What’s the Point of Local AI?

  1. Should You Run AI Locally, or Just Use an API?
  2. Why People Run Local Even When the API Is Cheaper
  3. A Sensible Default

What Is Ollama?

  1. What It Actually Solves
  2. Why Local, and Why Now
  3. What Ollama Is, and What It Is Not

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache

  1. What Is a Token?
  2. Embeddings
  3. What Is a Neural Network?
  4. What Is a Weight?
  5. Why This Matters in Practice
  6. Training vs Inference
  7. What “Open Weights” Means and Why You Should Care
  8. What Weights Are Not
  9. What Are Inference Parameters?
  10. Temperature
  11. Top P
  12. Top K
  13. Presence and Frequency Penalties
  14. Seed
  15. What Is GGUF and Why Does It Exist?
  16. Tokenizer
  17. Model’s Settings
  18. Generation Defaults
  19. Chat Template
  20. One File, Everything Inside
  21. Quantization: Trade Precision You Don’t Need for Memory You Do
  22. Transformer Models
  23. The KV Cache

Requirements and Setup

  1. Installing Ollama
  2. Notes on Hardware Support

Picking and Pulling Models

  1. Understanding What You Can Run on Your System
  2. Pulling Models
  3. Understanding How Models Are Stored
  4. Where to Find Models
  5. Ollama’s Official Library
  6. Hugging Face GGUF Repos
  7. Your Own GGUF Files via a Modelfile
  8. Reading Ollama’s Model Library
  9. Anatomy of a Model Entry
  10. Capability Tags
  11. The Full Tag (What Comes after the Colon)
  12. Categories of Models on the Library

Running Models and Understanding How They Work inside Ollama

  1. Running a Model
  2. One-Shot Mode
  3. Running Models with the Ollama API
  4. Prerequisites
  5. /api/pull: Download a Model
  6. /api/generate: Single-Shot Completion
  7. /api/chat: Multi-Turn Conversations
  8. Holding State
  9. The Context Window
  10. Images
  11. generate vs chat: When to Use Which
  12. Ollama Conversation Flow

The Context Window

  1. How a Model “Remembers” with Ollama
  2. What Actually Goes to the Model
  3. The Context Window: num_ctx
  4. Silent Truncation: The Trap

Controlling and Tuning Model Behavior at Runtime

  1. The /set Command
  2. Using the API to Control the Model
  3. The /save and /load Commands
  4. Erasing History with /clear

Working with the Model Library

  1. Understanding What’s Loaded
  2. Inspecting Models
  3. Using the CLI
  4. Using the API
  5. Listing Saved and Loaded Models
  6. Stopping a Loaded Model
  7. Removing a Model
  8. Copying a Model

Keep-Alive and Memory Control

  1. Why Keep Models Loaded at All
  2. Setting Keep-Alive Globally
  3. Setting Keep-Alive per Request
  4. Forcing an Unload Right Now
  5. Multiple Models in Memory at Once
  6. Picking a Keep-Alive That Makes Sense

Concurrency: Parallel Requests and the Queue

  1. How Many Requests One Model Handles at Once
  2. Choosing the Right Number of Parallel Slots
  3. How Many Waiting Requests Are Tolerated
  4. How They Interact

Building, Running, and Sharing Custom Models for Ollama (Modelfile)

  1. Step 1: Put the Model under Your Own Name
  2. Step 2: Give It One Job with SYSTEM
  3. Step 3: Control the Output with PARAMETER
  4. Step 4: Stop the Chatter with a Stop Sequence
  5. Step 5: Teach the Format by Example with MESSAGE
  6. Step 6: See and Pin the Prompt with TEMPLATE
  7. Step 7: Package It and Read It Back
  8. Sharing Your Model

Creating a Fine-Tuned Model (English to SQL)

  1. Why Fine-Tune
  2. How Fine-Tuning Works
  3. LoRA (Low-Rank Adaptation)
  4. QLoRA (Quantized Low-Rank Adaptation)
  5. What You Teach: Correct Answers vs Preferences
  6. The Dataset
  7. Base Models vs Instruct Models
  8. Fine-Tuning the Model
  9. Step 1: Clone the Repo and Install Unsloth
  10. Step 2: Load the Base Model in 4-Bit
  11. Where the Model Comes From
  12. Which Model We Use
  13. Step 3: Attach the LoRA Adapter
  14. Step 4: Load the Dataset
  15. Step 5: Format Each Row into Granite’s Chat Format
  16. Step 6: Set Up the Trainer
  17. Step 7: Train Only on the Answers
  18. Step 8: Train
  19. Step 9: Save the Adapter
  20. Step 10: Run the Script
  21. Understanding What Happened
  22. Fine-Tuning in the Browser with Unsloth Studio
  23. Install Studio
  24. Start the Server
  25. Open It in Your Browser

Running Your Fine-Tuned Model in Ollama

  1. Step 1: Export to GGUF
  2. Step 2: Create the Model in Ollama
  3. Step 3: Run and Test the Fine-Tuned Model

Building a Management CLI for Ollama Using the SDK

  1. Setup and Requirements
  2. Configure Host and Model with .env
  3. First Call
  4. Building a Management CLI for Ollama

Building Advanced Agents: Introduction

  1. Pass 1: A Bare-Minimum Chat Loop against a Local Ollama Model
  2. Step 1: Build the Client
  3. Step 2: The REPL Loop
  4. Step 3: Send One Message, Print One Reply
  5. Step 4: Running the REPL

Building Advanced Agents: Conversation History

  1. Pass 2: Keeping a Conversation History
  2. Step 1: Create a List to Hold the Conversation
  3. Step 2: Append the User’s Message before Sending
  4. Step 3: Send the Whole History, Not Just the Last Message
  5. Step 4: Append the Model’s Reply Too
  6. Step 5: Run the REPL

Building Advanced Agents: Streaming and Multi-Line Input

  1. Pass 3: Stream the Reply Token-by-Token and Accept Multi-Line Input
  2. Step 1: Read Multi-Line Input
  3. Step 2: Handle the Empty Submission in the Main Loop
  4. Step 3: Raise the Client Timeout
  5. Step 4: Ask the API to Stream
  6. Step 5: Save the Reassembled Reply to History
  7. Step 6: Run the REPL

Building Advanced Agents: Long Conversations

  1. Pass 4: Trim the History so It Never Outgrows the Model’s Context Window
  2. Step 1: Define a Size Budget
  3. Step 2: Write the Trimming Function
  4. Step 3: Call Trimming at the Right Moment
  5. Step 4: Run the REPL

Building Advanced Agents: Summarization with LangChain

  1. Pass 5: Swap Hard Trimming for an Automatic Summary of Older Messages
  2. Why LangChain
  3. Step 1: Build the Chat Model with ChatOllama
  4. Step 2: Build a Second Model for Summarizing
  5. Step 3: Wrap the Model in an Agent with Summarization Middleware
  6. Step 4: Use LangChain Message Classes for History and Stream the Reply
  7. Step 5: Run the REPL
  8. Cost Notes

Building Advanced Agents: Caching

  1. Pass 6: Cache Model Replies in Redis so Repeated Questions Come Back Instantly
  2. Step 1: Start a Redis Server
  3. Step 2: Add the Redis Settings to config.py
  4. Step 3: Turn On the Global LLM Cache
  5. Step 4: Heads-Up on Streaming Behavior
  6. Step 5: Run the REPL

Building Advanced Agents: Long-Term Memory with mem0

  1. Pass 7: Give the Chat a Long-Term Memory That Survives Restarts
  2. Step 1: Add the Memory Settings
  3. Step 2: Build the Memory Object
  4. Step 3: Identify the User at Startup
  5. Step 4: Look Up Relevant Facts before Answering
  6. Step 5: Write to Memory in the Background
  7. Step 6: Inject the Memory Block at the Top of Each Turn
  8. Step 7: Save the Turn (Skip Empty Replies)
  9. Step 8: Wait for Pending Writes on /bye
  10. Step 9: Run the REPL

Building Advanced Agents: Function-Calling

  1. Pass 8: Let the Model Call Python Functions (“Tools”) to Fetch Live Data
  2. The Shape of a Tool Call, End to End
  3. Step 1: Write a Private Helper for Geocoding
  4. Step 2: Define Real Tools with @tool
  5. Step 3: Add an Anti-Hallucination System Prompt
  6. Step 4: Wire Tools and a Retry Middleware into the Agent
  7. Step 5: Prepend the Anti-Hallucination Message
  8. Step 6: Run the REPL

Building Advanced Agents: Integrating MCP Servers

  1. LangChain and MCP
  2. Pass 9: Get Tools from an External MCP Server Instead of Writing Them In-Process
  3. Step 1: Start the MCP Server
  4. Step 2: Add the MCP Settings to config.py
  5. Step 3: Connect to the Server and Fetch Tools
  6. Step 4: Make main Async
  7. Step 5: input() from inside Async Code
  8. Step 6: Wire the Network-Fetched Tools into the Agent
  9. Step 7: Prepend TOOL_GUIDANCE to Each Turn’s Messages
  10. Step 8: Use the Async Stream
  11. Step 9: Run the REPL

User-Friendly Interfaces for Ollama

  1. Local Chat UIs for Ollama
  2. Ollama App (Built-In)
  3. Open WebUI
  4. LibreChat
  5. AnythingLLM
  6. LobeChat
  7. Jan
  8. Hollama
  9. Comparison
  10. Installing and Using Open WebUI
  11. Installation
  12. Running Open WebUI on a Separate Machine
  13. The Features You Will Actually Use
  14. Common Failures

Afterword: Where to Go from Here

  1. What’s Next?
  2. Keep Going
  3. Your Feedback Matters

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub