What if you could turn any open-weights language model into a domain expert that knows your contracts, your databases, and your tools, and ship it without a research lab budget?
Inside Large Language Models, Volume II is the book that takes the foundation built in Volume I and turns it into a working production system. It is the book for the engineer who has stopped
wondering how attention works and started wondering why their fine-tuning bill is bigger than their server bill, why a seven-billion-parameter model is the largest they can fit on their hardware,and how the production teams shipping LLMs to millions of users do it without a research lab budget.
Volume II picks up where Volume I left off. The transformer is built. The model is trained. Now what? The next nine chapters answer that question end to end: how inference actually works token by token, how to align a model with human preferences using RLHF, how to fine-tune billion-parameter models on consumer hardware with LoRA and QLoRA, how to make production inference ten times cheaper, and how to build four real applied systems on real data.
Along the way you will:
See every production technique worked out the same way the math was in Volume I. When the book introduces the KV cache, you watch a concrete attention computation grow token by token and see exactly which tensors get cached and which get recomputed. When it introduces quantisation, you take a real weight matrix from FP32 down to INT4 and check that the dequantized version still produces sensible outputs. There is no "this is an industry standard," no "the framework handles it." You see the bytes.
Understand fine-tuning at the level where you can pick the right tool. Full fine-tuning, LoRA, QLoRA, parameter-efficient tuning, instruction tuning, RLHF with PPO. Each method gets a chapter that explains the problem it solves, the math behind it, the cost trade-off, and a working PyTorch implementation you can run on your laptop. By the end you can look at a new problem and pick the cheapest method that will actually solve it, rather than reaching for whatever was in the last tutorial you read.
Build four end-to-end applied projects on real data. A contract-type classifier trained on real legal documents (Chapter 14).
A legal-document assistant fine-tuned with QLoRA on a real legal corpus (Chapter 15).
A text-to-SQL system that translates natural language into working database queries (Chapter 16).
A function-calling system that teaches an LLM to use your APIs and powers the
AI agents and agentic workflows everyone is building right now (Chapter 17).
Every project has runnable code, a real dataset, and a step-by-step walkthrough from data preparation to a deployed model.
Make production inference fast and cheap. The KV cache. Prefix caching. Quantisation. Continuous batching. Speculative decoding. Each one is broken down with concrete examples, real numbers, and the reasoning behind why it works. You will understand why putting variables at the end of your prompt makes API calls ten times cheaper, why a 70-billion-parameter model fits on a single consumer GPU after QLoRA, and why the same prompt sometimes produces different outputs at temperature zero.
Who this book is for:
Software engineers who have shipped with the OpenAI or Claude API and are tired of paying for capabilities they could fine-tune themselves for a fraction of the price.
Machine-learning engineers who need to take an open-weights model and adapt it to a specific business domain, on a real budget, on real hardware.
Practitioners building agents and agentic systems who need to understand function calling at the level beneath the framework abstractions.
Engineering managers and tech leads who need to make build-versus-buy decisions about LLM features and want the technical depth to defend those decisions in front of a CFO.
What makes this book different:
Most fine-tuning content online is either toy examples on famous datasets (which never transfer to real work) or library tours that teach you which Hugging Face button to click without explaining why.
Inside Large Language Models, Volume II takes the third path. It teaches the underlying mechanics of every production technique, then walks through four real applied projects from data preparation to deployed model. You finish the book with code you can actually use, models you have actually trained, and the judgement to know which technique fits your next problem before you have written a single line of code.