The model you depend on lives on someone else's hardware. They can change the price, change the rules, or retire it entirely, and you cannot stop them.
In August 2025, OpenAI retired GPT-4o overnight. In June 2026, an export directive forced Anthropic to suspend Fable 5 and Mythos 5. Teams lost models they had built on in an afternoon.
They just did not own what they ran.
Local AI Engineering with Ollama is how you stop renting and start owning. You take the model, the price, and the rules back into your own hands: run any model you want, when you want, where you want, and change how it behaves without a meter running.
This is a practical book for developers who can run a command and edit a file but have no Machine Learning degree and want none. It skips the marketing and jumps into building things that run, on hardware you already own, with the network unplugged. Every command was executed on a real machine, and every output you see (JSON responses, error messages, token counts, training logs) came from an actual session, not from documentation. Where Ollama behaved differently than its own docs, the book says so and pins the version.
This book moves in one direction: from running your first model to shipping an agent that runs on your own hardware. Each chapter ends with something working, and each skill below builds on the one before it. By the end you will be able to:
- Understand what a model is actually doing: You will learn how text becomes tokens, how tokens become predictions, and what weights, embeddings, attention, and the KV cache really are. Just enough to make decisions, with every concept tied to a setting you will later change.
- Install Ollama and size your hardware honestly: You will learn to install the runtime, tell whether a model fits in your RAM or VRAM before you download it, and read the tradeoffs between parameter count, quantization, and speed so you stop pulling models you will delete an hour later.
- Pick, pull, and manage models: You will learn to read the Ollama library and Hugging Face GGUF repos, choose the right quantization (Q4_K_M, Q5_K_M, Q8_0, and the rest), and manage what is on disk and in memory with list, show, ps, stop, copy, and remove.
- Drive Ollama from its API: You will move past the CLI and talk to Ollama the way your apps will, over HTTP, so anything you build (a script, a backend, an agent) can run models without a human typing commands. You will also learn to read tokens-per-second straight off the API so you can compare models and hardware on numbers, not vibes.
- Control the context window: You will take control of how much your model remembers in a single conversation, so you can stop a model from silently forgetting the start of a long chat and start sizing the context window deliberately for the job at hand. You will also learn to see exactly what gets sent to the model on each turn, which is the difference between guessing why a model misbehaves and knowing.
- Operate a model under real conditions: You will learn to tune behavior at runtime with temperature, top_p, top_k, penalties, and seed, control how long models stay loaded with keep-alive, and set concurrency so one model can serve parallel requests without falling over.
- Package a custom model with a Modelfile: You will turn a general-purpose model into a customized one that does a specific job the same way every time, then ship it as a single named artifact a teammate can pull and run with zero setup.
- Fine-tune a model on your own data: You will learn when prompting stops being enough and training begins, then fine-tune Granite to turn plain English into SQL using QLoRA with Unsloth, understand SFT versus preference tuning, and export the result to GGUF to run it in Ollama.
- Build against the Python SDK: You will stop parsing raw JSON by hand and start building real Python programs against Ollama, with typed responses your editor can autocomplete and your code can trust, ending with a small CLI that does the everyday model-management jobs from inside your own tooling.
- Build a working chat loop and see why it forgets: You will write a REPL that sends one message and prints one reply, then watch it fail to recall the previous turn, the concrete proof that the model itself holds no state.
- Give the conversation a memory: You will keep a running message list and resend it every turn, so the assistant can follow a multi-turn conversation within a session.
- Stream replies and accept multi-line input: You will print tokens the moment they arrive instead of waiting for the full reply, and take pasted, multi-line prompts without breaking the loop.
- Keep long chats inside the context window: You will build chats that keep working past the point where they normally break, dropping the oldest turns on your terms so the prompt never overflows the context window and the model never silently forgets where it started.
- Summarize old turns instead of dropping them: You will replace hard trimming with a second model that condenses earlier messages, wired in through LangChain's summarization middleware, so a long conversation keeps its gist instead of its raw length.
- Cache replies in Redis: You will return repeated questions instantly from a cache, cutting both latency and the compute you spend regenerating the same answer.
- Add long-term memory that survives restarts: You will wire in mem0 so the assistant recalls facts about a user across separate sessions, not just within the current one, and handle the background writes cleanly on exit.
- Give the model tools to fetch live data: You will add function calling so the model can invoke your Python functions for things it cannot know, like the current weather or air quality, and guard it with a prompt that makes it admit ignorance instead of inventing numbers when a tool fails.
- Source those tools from an external MCP server: You will swap your hand-written tools for ones served over MCP, so the same agent gains capabilities you did not write and do not have to maintain, and you will see why the M times N integration problem becomes M plus N.
- Put a graphical interface in front of Ollama: You will stand up Open WebUI in Docker against a local or remote Ollama, pull models and chat with your own documents from the browser, and lock it down with the admin approval gate that turns a personal install into something you can safely hand to a team.
The through line is the build. You do not just learn what Ollama does; you leave with a model you customized, a model you trained, and an agent you assembled, all of it running on hardware you own.
If you can run a command and edit a file, you are qualified! Downloadable code included.
So what are you waiting for to stop renting, start owning, and get a model running tonight?