- Unit 0 — Foundation & Setup
- Unit 1 — Tensor & Memory
- Unit 2 — Math Kernels
- Unit 3 — Matrix Multiplication
- Unit 4 — Tokenizer
- Unit 5 — Transformer
- Unit 6 — Mixture of Experts
- Unit 7 — KV Cache
- Unit 8 — Flash Attention
- Unit 9 — Quantization
- Unit 10 — Model Loading
- Unit 11 — Sampling & Generation
- Unit 12 — Speculative Decoding
- Unit 13 — Serving & Batching
- Unit 14 — Performance (Engine Complete)
- Unit 15 — Capstone: GPU/CUDA
Build an LLM Inference Engine in C++
A Challenge-Driven Guide to Building a CPU-First Inference Engine in C++20
Build a complete LLM inference engine in C++ — from a blank
project to a working Transformer that loads a real model and
generates text. Forged one challenge at a time, with tests
that prove every piece works before you move on.
Minimum price
$24.00
$24.00
You pay
Author earns
Buying multiple copies for your team? See below for a discount!
About
About the Book
Build an LLM Inference Engine in C++ — Through Challenges
You don't truly understand how large language models run
until you've built the engine yourself.
This book takes you from a blank C++ project to a complete,
working inference engine that loads a real Llama-family model
and generates text — one challenge at a time.
What you'll build:
- A strided tensor system with zero-copy views and arena allocation
- Math kernels: RMSNorm, SwiGLU, softmax, GEMM with SIMD
- A byte-level BPE tokenizer
- A full Transformer: RoPE, GQA, Flash Attention, KV Cache
- int8/int4 quantization with direct block multiplication
- GGUF model loading with mmap
- Sampling, streaming, speculative decoding, and continuous batching
- An optional CUDA capstone for the heaviest kernels
By Unit 14, the engine runs a real model on CPU. Every concept
earns its place right after you've built the thing it improves.
Who this is for: C++20 developers comfortable with algorithms
and memory layout who want to understand what actually happens
inside an LLM runtime — not by reading, but by building.
Feedback
Team Discounts
Team Discounts
Get a team discount on this book!
Up to 3 members
- Minimum price
- $60.00
- Suggested price
- $60.00
Up to 5 members
- Minimum price
- $96.00
- Suggested price
- $96.00
Up to 10 members
- Minimum price
- $168
- Suggested price
- $168
Up to 15 members
- Minimum price
- $240
- Suggested price
- $240
Up to 25 members
- Minimum price
- $360
- Suggested price
- $360
Bundles
Bundles that include this book
- Pricing
$27.00
Minimum priceBought separately$32.00Suggested price$27.00
- Pricing
$40.00
Minimum priceBought separately$56.00Suggested price$40.00
Author
About the Author
Hatem M. is a programmer and technical author whose work focuses on modern C++, large language models, and AI systems.
His books combine first-principles explanations with complete implementations and reproducible experiments. They include C++ Algorithmic Mastery, an eight-volume series on algorithms and problem solving; Build an LLM Inference Engine in C++, which constructs a GPT-style inference engine from scratch; LLM Quantization: From the Bits Up, which develops the theory and practice of neural network quantization from the bit level upward; and C++ Autopsy, a forensic investigation of ten subtle C++ bugs that compiled successfully, ran correctly, and still produced the wrong answers.
Contents
Table of Contents
Get the free sample chapters
Click the buttons to get the free sample in PDF or EPUB, or read the sample online here
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
Write and Publish on Leanpub
You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!
Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.
Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

