Mathematics of Reinforcement Learning: From Bellman Equations to Q-Learning VOL-1
A Mathematical Journey through Dynamic Programming and Optimal Decision-Making
Author: Anshuman Mishra, M.Tech (Computer Science)
Assistant Professor, Doranda College, Ranchi University
COPYRIGHT PAGE
© 2025 Anshuman Mishra, M.Tech (Computer Science)
All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the author or publisher, except for brief quotations used in reviews, academic references, or scholarly works.
First Edition: 2025
DISCLAIMER This book is designed to provide academic and research-based knowledge on Mathematics of Reinforcement Learning, including the principles of dynamic programming, Bellman equations, Q-learning, and related computational models. The information contained herein is intended solely for educational purposes for students, teachers, and researchers in computer science, mathematics, and artificial intelligence.
While every effort has been made to ensure the accuracy of the contents, the author and publisher make no representations or warranties with respect to the accuracy or completeness of the contents of this book. The examples, algorithms, and derivations have been thoroughly checked, but errors may still exist. The author and publisher shall not be liable for any damages arising from the use of the material contained herein.
The mathematical examples and algorithms are for educational and illustrative purposes only.
Readers implementing algorithms for research or practical projects are encouraged to verify results independently and consult additional resources as needed.
All trademarks, trade names, or logos mentioned belong to their respective owners. Any resemblance of examples or case studies to actual data, individuals, or organizations is purely coincidental.
BOOK DESCRIPTION
Title: Mathematics of Reinforcement Learning: From Bellman Equations to Q-Learning VOL-1 Subtitle: A Mathematical Journey through Dynamic Programming and Optimal Decision-Making Author: Anshuman Mishra, M.Tech (Computer Science)
Assistant Professor, Doranda College, Ranchi University
About the Book The 21st century marks a revolutionary transformation in artificial intelligence (AI), where machines are not only learning from data but are also learning how to act intelligently in dynamic environments. Among the various branches of AI, Reinforcement Learning (RL) stands as the mathematical and conceptual foundation that allows computers and robots to make autonomous decisions through trial and reward.
This book, Mathematics of Reinforcement Learning, serves as a bridge between mathematical theory and practical algorithms, enabling readers to deeply understand the mathematical intuition behind learning systems that think, adapt, and optimize behavior.
Unlike traditional AI books that focus only on algorithmic implementation, this book unfolds the complete mathematical foundation—from Bellman equations and dynamic programming to Monte Carlo methods, temporal-difference learning, and Q-learning. Each topic is mathematically derived, systematically explained, and complemented with step-by-step numerical examples and proofs.
This book is written specifically for:
· Undergraduate and postgraduate students (B.Tech, BCA, MCA, M.Sc. AI, Data Science)
· Teachers and researchers in artificial intelligence and applied mathematics
· Industry professionals and developers seeking deeper theoretical clarity in RL
Philosophy Behind the Book Most introductory books on reinforcement learning explain algorithms but rarely delve into why these algorithms work or how their mathematical properties guarantee convergence, stability, and optimality. This book aims to unveil the mathematics that drives intelligence, presenting reinforcement learning not as a set of black-box algorithms but as a beautifully structured mathematical framework grounded in linear algebra, probability, optimization, and dynamic programming.
Each chapter begins with fundamental theory and builds toward algorithmic application, showing how every step—from expectation computation to Bellman optimization—can be rigorously formulated using mathematical logic.
The goal is to empower readers to not only use reinforcement learning but to understand and innovate upon it.
Structure and Organization This book is divided into seven modules and twenty comprehensive chapters, organized in an intuitive learning sequence.
Module I: Foundations of Reinforcement Learning It begins with the basic building blocks—agents, environments, states, actions, and rewards—and introduces readers to the concept of learning through interaction.
Chapters 1 to 3 explore:
· The mathematical definitions of Markov Processes and Decision Models
· The essential linear algebra and probability theory underlying reinforcement learning
· The formal structure of Markov Decision Processes (MDPs) and Bellman equations
By the end of this module, the reader understands the theoretical backbone of RL, paving the way for algorithmic exploration.
Module II: Bellman Equations and Dynamic Programming Here, the mathematics of optimality takes center stage. The Bellman equations are explored in full depth—both expectation and optimality formulations—along with proofs of convergence and computational methods.
Dynamic programming methods such as policy evaluation, policy iteration, and value iteration are introduced with complete derivations and worked-out numerical examples. The connection between dynamic programming and reinforcement learning is clearly established, showing how each step in the algorithm emerges from a recursive mathematical structure.
Module III: Monte Carlo and Temporal-Difference Learning This module blends probability, sampling, and prediction. It explains how learning can happen from experience through Monte Carlo estimation and Temporal Difference (TD) learning.
Readers learn the relationships between bias, variance, convergence speed, and data efficiency. The transition from offline to online learning is demonstrated through examples like the Blackjack problem and Random Walk prediction.
Eligibility traces and TD(λ) methods are explained rigorously with mathematical equivalence proofs, bridging theory with implementation.
Module IV: Control Algorithms — From Sarsa to Q-Learning The heart of reinforcement learning—learning to control—is covered in this section.
Starting with on-policy control (Sarsa) and progressing to off-policy control (Q-Learning), readers explore the mathematical mechanisms that enable agents to learn optimal strategies.
The derivation of the Q-learning update rule from the Bellman optimality principle is shown step-by-step, providing a strong conceptual understanding of how agents converge to optimal policies.
Comparisons between different approaches (Sarsa, Expected Sarsa, and Q-Learning) are backed with numerical and graphical examples.
Module V: Advanced Mathematical Tools and Extensions At this point, the book transitions from classical reinforcement learning to advanced formulations.
Topics include:
· Policy Gradient Theorem and its derivation
· Actor-Critic architecture with detailed gradient calculations
· Regularization and constrained optimization for safe and stable learning
· Entropy and KL-Divergence based formulations for robust policy optimization
Readers are introduced to Lagrangian optimization in RL, showing how constraints can be mathematically imposed to ensure balanced exploration and exploitation.
Module VI: Deep and Approximate Reinforcement Learning This section connects traditional reinforcement learning to deep neural networks and function approximation.
The mathematical underpinnings of Deep Q-Networks (DQN) are derived, explaining loss functions, gradient backpropagation, and the role of target networks.
Advanced architectures such as Double DQN, Dueling Networks, Prioritized Replay, and Proximal Policy Optimization (PPO) are also presented with mathematical clarity.
Through carefully designed examples, the book shows how deep learning integrates with reinforcement learning, resulting in modern AI systems like AlphaGo and autonomous robots.
Module VII: Theoretical and Research Perspectives The final section consolidates all mathematical insights, focusing on proofs, convergence theorems, and future research directions.
It contains:
· Rigorous proofs of TD and Q-learning convergence
· Stability analysis using stochastic approximation theory
· Exploration of open challenges such as safe RL, explainable RL, and quantum RL
This section encourages teachers and researchers to extend the theoretical boundaries of reinforcement learning.
Pedagogical Features To ensure clarity and academic depth, each chapter includes:
· Conceptual Explanation: Theoretical context and motivation
· Mathematical Derivation: Step-by-step proofs and equations
· Algorithm Design: Pseudocode for each major algorithm
· Numerical Examples: Solved problems for classroom and self-practice
· Visual Illustrations: Graphical understanding of value functions and convergence
· Exercises and Research Notes: For deeper investigation
This structure makes the book equally useful for students learning the subject, teachers designing course material, and researchers developing new models.
Why This Book Is Unique 1. Mathematical Depth: Every equation is derived and explained, not merely presented.
2. Pedagogical Precision: Structured for both classroom teaching and independent study.
3. Balanced Approach: Covers both classical RL (Bellman, DP, Q-learning) and modern RL (DQN, PPO, Actor-Critic).
4. Research Orientation: Provides open problems, mathematical proofs, and advanced theoretical questions.
5. Language Clarity: Written in simple, academic English with minimal jargon.
While most books treat RL as a subset of machine learning, this book presents RL as a pure mathematical science of decision-making under uncertainty.