Leanpub Header

Skip to main content

Build an LLM Inference Engine in C++

A Challenge-Driven Guide to Building a CPU-First Inference Engine in C++20

This book is 100% completeLast updated on 2026-06-22

Build a complete LLM inference engine in C++ — from a blank

project to a working Transformer that loads a real model and

generates text. Forged one challenge at a time, with tests

that prove every piece works before you move on.

Minimum price

$24.00

$24.00

You pay

Author earns

$

Also available for 1 book credit with a Reader Membership

Buying multiple copies for your team? See below for a discount!

PDF
About

About

About the Book

Build an LLM Inference Engine in C++ — Through Challenges

You don't truly understand how large language models run

until you've built the engine yourself.

This book takes you from a blank C++ project to a complete,

working inference engine that loads a real Llama-family model

and generates text — one challenge at a time.

What you'll build:

- A strided tensor system with zero-copy views and arena allocation

- Math kernels: RMSNorm, SwiGLU, softmax, GEMM with SIMD

- A byte-level BPE tokenizer

- A full Transformer: RoPE, GQA, Flash Attention, KV Cache

- int8/int4 quantization with direct block multiplication

- GGUF model loading with mmap

- Sampling, streaming, speculative decoding, and continuous batching

- An optional CUDA capstone for the heaviest kernels

By Unit 14, the engine runs a real model on CPU. Every concept

earns its place right after you've built the thing it improves.

Who this is for: C++20 developers comfortable with algorithms

and memory layout who want to understand what actually happens

inside an LLM runtime — not by reading, but by building.

Team Discounts

Team Discounts

Get a team discount on this book!

  • Up to 3 members

    Minimum price
    $60.00
    Suggested price
    $60.00
  • Up to 5 members

    Minimum price
    $96.00
    Suggested price
    $96.00
  • Up to 10 members

    Minimum price
    $168
    Suggested price
    $168
  • Up to 15 members

    Minimum price
    $240
    Suggested price
    $240
  • Up to 25 members

    Minimum price
    $360
    Suggested price
    $360

Bundles

Bundles that include this book

Author

About the Author

Hatem M.

Hatem M. is a programmer and technical author whose work focuses on modern C++, large language models, and AI systems.

His books combine first-principles explanations with complete implementations and reproducible experiments. They include C++ Algorithmic Mastery, an eight-volume series on algorithms and problem solving; Build an LLM Inference Engine in C++, which constructs a GPT-style inference engine from scratch; LLM Quantization: From the Bits Up, which develops the theory and practice of neural network quantization from the bit level upward; and C++ Autopsy, a forensic investigation of ten subtle C++ bugs that compiled successfully, ran correctly, and still produced the wrong answers.

Contents

Table of Contents

  • Unit 0 — Foundation & Setup
  • Unit 1 — Tensor & Memory
  • Unit 2 — Math Kernels
  • Unit 3 — Matrix Multiplication
  • Unit 4 — Tokenizer
  • Unit 5 — Transformer
  • Unit 6 — Mixture of Experts
  • Unit 7 — KV Cache
  • Unit 8 — Flash Attention
  • Unit 9 — Quantization
  • Unit 10 — Model Loading
  • Unit 11 — Sampling & Generation
  • Unit 12 — Speculative Decoding
  • Unit 13 — Serving & Batching
  • Unit 14 — Performance (Engine Complete)
  • Unit 15 — Capstone: GPU/CUDA

Get the free sample chapters

Click the buttons to get the free sample in PDF or EPUB, or read the sample online here

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub