Preface

Using AI models does not always mean using Super Scalers like Google and OpenAI. This book covers use cases where using smaller models that meet system requirements can save money, can be run locally for privacy and security, more control of systems using smaller LLMs, help preserve the environment, and control costs.

Dear reader, I first used neural network technology starting in the 1980s (DARPA advisory panel for neural network (NN) tools, commercial software and NN hardware products), continued for a decade from 2014 through 2019 using deep learning models (I managed a deep learning team at Capital One), and since 2020 primarily used LLMs for application development and research.

As I write this in January 2026 many people I interact with are very strident in wanting to use the best and usually most expensive available AI models. If this, dear reader, is your philosophy then I hope to at least slightly change your mind in this book. Personally I prefer only using strong models like Google Gemini 3 Pro or OpenAI’s GPT-5.2 for research and learning but for practical engineering projects I prefer defining acceptable performance metrics for embedded LLMs and using the fastest and least expensive options that meet the target performance.

Choosing Which “Small AI” Tools to Use in this Book

Mostly due to my own preferences, I have chosen a very narrow range of tools for use in this book:

  • Local open weight/open source Large Language Models (LLMs) run locally using Ollama.
  • OPTIONAL: running open weight/open source LLMs on Ollama Cloud (free for low volume use, otherwise $20/month).
  • Google’s very low cost (or free for limited monthly use) gemini-3-flash API.
  • For Python examples: the uv tool and various libraries we will install as we need them.

I have left several vendors like Anthropic and OpenAI off the table in my tool selection. I asked Gemini to estimate cost per 1 million input tokens (some models are more ‘token efficient’ so take the following as very approximate to environmental costs) for commercial LLM APIs:

Model Name Energy Use (per 1M tokens)
Gemini 3 Flash ~0.6 kWh
Gemini 3 Pro ~1.2 kWh
GLM-4.7 ~2.2 kWh
Qwen-3-Coder-480B ~2.5 kWh
Claude 4 Opus ~3.5 kWh
OpenAI GPT-5 ~5.0 – 8.0 kWh
OpenAI o3 ~33.0 kWh

Gemini 3 flash is an excellent all-around model and I will use it in this book. Much later in this book we will look at the library litellm that allows programs to switch easily between models and model providers.

Energy use is much lower running very small models with Ollama on a local Mac, Windows, or Linux laptop. Here are some estimates for running on a MacBook M4:

Model Name Energy Use (per 1M tokens)
Gemma 3 1B ~0.02 kWh
Qwen 3 1.7B ~0.04 kWh
Qwen 3 4B ~0.08 kWh

We will use several small Google Gemma models in examples, as well as occasionally using Alibaba’s Qwen models. For comparison of power use, a modern LED light bulb uses about 0.009 kW and a classic incandescent bulb uses 0.06kW.

Obviously large commercial models are more capable but smaller models can also be very effective. It is a theme in this book to select models that are adequate for the task at hand.

Resources

  • I have written separate books on using Ollama and LM Studio (also for running models on a personal computer) that you can read online for free: https://leanpub.com/ollama and https://leanpub.com/LMstudio by clicking on the links labeled Free To Read Online.
  • You will want a Google Gemini API key. The free version is fine for casual use and is available on the Google AI Studio https://aistudio.google.com by clicking the link Get API Key on the left menu panel.