Table of Contents
Chapter 0 — Building from ScratchSets expectations for the book. Explains what "from scratch" truly means, what it does not mean, and what prerequisites the reader needs. Introduces the overall journey of building a small language model from first principles.
Chapter 1 — Understanding Neural Networks: The Foundations of Modern AICovers the core building blocks of neural networks: neurons, weights, biases, activation functions, forward and backward propagation, losses, optimizers, and training challenges. This chapter builds the intuition needed before diving into transformers.
Chapter 2 — PyTorch Fundamentals: The Building Blocks of Deep LearningIntroduces tensors, operations, reshaping, indexing, GPU support, and key PyTorch APIs. Builds the practical foundation needed to implement neural network components later in the book.
Chapter 3 — GPUs: The Computational Engine Behind LLM TrainingExplains CPU vs GPU architecture, VRAM, tensor cores, FLOPS, monitoring GPU memory, avoiding OOM errors, and understanding how deep learning workloads run on hardware. Provides context for training performance and hardware choices.
Chapter 4 — Where Intelligence Comes From: A Deep Look at DataFocuses on why data quality matters more than architecture. Explores real-world datasets like Common Crawl, Books, Wikipedia, StackExchange, and GitHub. Discusses scaling laws, data curation, deduplication, and multi-stage training datasets.
Chapter 5 — Understanding Language Models: From Foundations to Small-Scale DesignExplains what a language model is mathematically, why scaling matters, emergent abilities, transformer basics, and why building smaller custom models remains valuable. Sets the stage for designing your own LLM.
Chapter 6 — Tokenizer: How Language Models Break Text into Meaningful UnitsIntroduces character, word, and subword tokenization. Explains why tokenization exists, how it affects downstream model performance, and why small models must optimize vocabulary carefully.
Chapter 7 — Understanding Embeddings, Positional Encodings, and RoPEDiscusses embeddings as dense vector representations, positional encodings (integer, binary, sinusoidal), their limitations, and why RoPE (Rotary Position Embedding) became the modern standard. Includes intuitive and mathematical explanations.
Chapter 8 — Understanding Attention: From Self-Attention to Multi-Head AttentionCovers the attention mechanism step-by-step: queries, keys, values, dot products, scaling, softmax, causal masks, multi-head attention, and detailed PyTorch-like breakdowns. Builds intuition for how transformers process context.
Chapter 9 — Making Inference Fast: KV Cache, Multi-Query, and Grouped-Query AttentionExplains the inference loop, KV caching, why only the last token matters, and how cache size affects memory and speed. Introduces Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) to reduce KV cache memory costs while preserving performance.
Chapter 10 — Inside the Transformer Block: RMSNorm, SwiGLU, and Residual ConnectionsBreaks down the internal block structure, normalization layers, why RMSNorm is preferred, how SwiGLU works, and why residual connections improve depth and gradient flow. Prepares the reader to assemble full transformer blocks.
Chapter 11 — Building Qwen from ScratchA hands-on implementation chapter covering tokenization, dataset preparation (TinyStories), RoPE, RMSNorm, GQA, SwiGLU, transformer blocks, causal masks, loss computation, generation loop, and training loop, culminating in a full Qwen-style model.
Chapter 12 — QuantizationExplains how LLM weights are stored, numerical precision, integer vs floating formats, 8-bit/4-bit quantization, BitsAndBytes usage, perplexity evaluation, and how quantization affects performance and accuracy.
Chapter 13 — Mixture of ExpertsIntroduces the MoE architecture, sparse activation, expert routing, top-k gating, load balancing, and the historical evolution of MoE from 1990s research to modern implementations like DeepSeek. Includes conceptual and mathematical explanations.
Chapter 14 — Training Small Language Models: A Practical JourneyCovers architectural choices, tokenizer selection, dataset curation, debugging, GPU selection, memory optimization, training loops, and evaluation strategies. Wraps up the end-to-end pipeline for training effective small language models.