Kick off your book project in 3 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Saturday, May 16, 2026. Learn more…

Leanpub Header

Skip to main content

Mathematics of Reinforcement Learning VOL-2

This book is 100% completeLast updated on 2026-05-17
Pedagogical Features

To ensure clarity and academic depth, each chapter includes:

·        Conceptual Explanation: Theoretical context and motivation

·        Mathematical Derivation: Step-by-step proofs and equations

·        Algorithm Design: Pseudocode for each major algorithm

·        Numerical Examples: Solved problems for classroom and self-practice

·        Visual Illustrations: Graphical understanding of value functions and convergence

·        Exercises and Research Notes: For deeper investigation

This structure makes the book equally useful for students learning the subject, teachers designing course material, and researchers developing new models.

Why This Book Is Unique

1.      Mathematical Depth: Every equation is derived and explained, not merely presented.

2.      Pedagogical Precision: Structured for both classroom teaching and independent study.

3.      Balanced Approach: Covers both classical RL (Bellman, DP, Q-learning) and modern RL (DQN, PPO, Actor-Critic).

4.      Research Orientation: Provides open problems, mathematical proofs, and advanced theoretical questions.

5.      Language Clarity: Written in simple, academic English with minimal jargon.

While most books treat RL as a subset of machine learning, this book presents RL as a pure mathematical science of decision-making under uncertainty.

Minimum price

$19.00

$29.00

You pay

Author earns

$
PDF
EPUB
About

About

About the Book

About the Book

The 21st century marks a revolutionary transformation in artificial intelligence (AI), where machines are not only learning from data but are also learning how to act intelligently in dynamic environments. Among the various branches of AI, Reinforcement Learning (RL) stands as the mathematical and conceptual foundation that allows computers and robots to make autonomous decisions through trial and reward.

This book, Mathematics of Reinforcement Learning, serves as a bridge between mathematical theory and practical algorithms, enabling readers to deeply understand the mathematical intuition behind learning systems that think, adapt, and optimize behavior.

Unlike traditional AI books that focus only on algorithmic implementation, this book unfolds the complete mathematical foundation—from Bellman equations and dynamic programming to Monte Carlo methods, temporal-difference learning, and Q-learning. Each topic is mathematically derived, systematically explained, and complemented with step-by-step numerical examples and proofs.

This book is written specifically for:

·        Undergraduate and postgraduate students (B.Tech, BCA, MCA, M.Sc. AI, Data Science)

·        Teachers and researchers in artificial intelligence and applied mathematics

·        Industry professionals and developers seeking deeper theoretical clarity in RL

Philosophy Behind the Book

Most introductory books on reinforcement learning explain algorithms but rarely delve into why these algorithms work or how their mathematical properties guarantee convergence, stability, and optimality. This book aims to unveil the mathematics that drives intelligence, presenting reinforcement learning not as a set of black-box algorithms but as a beautifully structured mathematical framework grounded in linear algebra, probability, optimization, and dynamic programming.

Each chapter begins with fundamental theory and builds toward algorithmic application, showing how every step—from expectation computation to Bellman optimization—can be rigorously formulated using mathematical logic.

The goal is to empower readers to not only use reinforcement learning but to understand and innovate upon it.

Structure and Organization

This book is divided into seven modules and twenty comprehensive chapters, organized in an intuitive learning sequence.

Module I: Foundations of Reinforcement Learning

It begins with the basic building blocks—agents, environments, states, actions, and rewards—and introduces readers to the concept of learning through interaction.
Chapters 1 to 3 explore:

·        The mathematical definitions of Markov Processes and Decision Models

·        The essential linear algebra and probability theory underlying reinforcement learning

·        The formal structure of Markov Decision Processes (MDPs) and Bellman equations

By the end of this module, the reader understands the theoretical backbone of RL, paving the way for algorithmic exploration.

Module II: Bellman Equations and Dynamic Programming

Here, the mathematics of optimality takes center stage. The Bellman equations are explored in full depth—both expectation and optimality formulations—along with proofs of convergence and computational methods.

Dynamic programming methods such as policy evaluation, policy iteration, and value iteration are introduced with complete derivations and worked-out numerical examples. The connection between dynamic programming and reinforcement learning is clearly established, showing how each step in the algorithm emerges from a recursive mathematical structure.

Module III: Monte Carlo and Temporal-Difference Learning

This module blends probability, sampling, and prediction. It explains how learning can happen from experience through Monte Carlo estimation and Temporal Difference (TD) learning.
Readers learn the relationships between bias, variance, convergence speed, and data efficiency. The transition from offline to online learning is demonstrated through examples like the Blackjack problem and Random Walk prediction.

Eligibility traces and TD(λ) methods are explained rigorously with mathematical equivalence proofs, bridging theory with implementation.

Module IV: Control Algorithms — From Sarsa to Q-Learning

The heart of reinforcement learning—learning to control—is covered in this section.
Starting with on-policy control (Sarsa) and progressing to off-policy control (Q-Learning), readers explore the mathematical mechanisms that enable agents to learn optimal strategies.

The derivation of the Q-learning update rule from the Bellman optimality principle is shown step-by-step, providing a strong conceptual understanding of how agents converge to optimal policies.
Comparisons between different approaches (Sarsa, Expected Sarsa, and Q-Learning) are backed with numerical and graphical examples.

 

 

Module V: Advanced Mathematical Tools and Extensions

At this point, the book transitions from classical reinforcement learning to advanced formulations.
Topics include:

·        Policy Gradient Theorem and its derivation

·        Actor-Critic architecture with detailed gradient calculations

·        Regularization and constrained optimization for safe and stable learning

·        Entropy and KL-Divergence based formulations for robust policy optimization

Readers are introduced to Lagrangian optimization in RL, showing how constraints can be mathematically imposed to ensure balanced exploration and exploitation.

Module VI: Deep and Approximate Reinforcement Learning

This section connects traditional reinforcement learning to deep neural networks and function approximation.
The mathematical underpinnings of Deep Q-Networks (DQN) are derived, explaining loss functions, gradient backpropagation, and the role of target networks.

Advanced architectures such as Double DQN, Dueling Networks, Prioritized Replay, and Proximal Policy Optimization (PPO) are also presented with mathematical clarity.
Through carefully designed examples, the book shows how deep learning integrates with reinforcement learning, resulting in modern AI systems like AlphaGo and autonomous robots.

Module VII: Theoretical and Research Perspectives

The final section consolidates all mathematical insights, focusing on proofs, convergence theorems, and future research directions.
It contains:

·        Rigorous proofs of TD and Q-learning convergence

·        Stability analysis using stochastic approximation theory

·        Exploration of open challenges such as safe RL, explainable RL, and quantum RL

This section encourages teachers and researchers to extend the theoretical boundaries of reinforcement learning.

Pedagogical Features

To ensure clarity and academic depth, each chapter includes:

·        Conceptual Explanation: Theoretical context and motivation

·        Mathematical Derivation: Step-by-step proofs and equations

·        Algorithm Design: Pseudocode for each major algorithm

·        Numerical Examples: Solved problems for classroom and self-practice

·        Visual Illustrations: Graphical understanding of value functions and convergence

·        Exercises and Research Notes: For deeper investigation

This structure makes the book equally useful for students learning the subject, teachers designing course material, and researchers developing new models.

Why This Book Is Unique

1.      Mathematical Depth: Every equation is derived and explained, not merely presented.

2.      Pedagogical Precision: Structured for both classroom teaching and independent study.

3.      Balanced Approach: Covers both classical RL (Bellman, DP, Q-learning) and modern RL (DQN, PPO, Actor-Critic).

4.      Research Orientation: Provides open problems, mathematical proofs, and advanced theoretical questions.

5.      Language Clarity: Written in simple, academic English with minimal jargon.

While most books treat RL as a subset of machine learning, this book presents RL as a pure mathematical science of decision-making under uncertainty.

Author

About the Author

Anshuman Mishra

Anshuman Kumar Mishra is a seasoned educator and prolific author with over 20 years of experience in the teaching field. He has a deep passion for technology and a strong commitment to making complex concepts accessible to students at all levels. With an M.Tech in Computer Science from BIT Mesra, he brings both academic expertise and practical experience to his work.

Currently serving as an Assistant Professor at Doranda College, Anshuman has been a guiding force for many aspiring computer scientists and engineers, nurturing their skills in various programming languages and technologies. His teaching style is focused on clarity, hands-on learning, and making students comfortable with both theoretical and practical aspects of computer science.

Throughout his career, Anshuman Kumar Mishra has authored over 25 books on a wide range of topics including Python, Java, C, C++, Data Science, Artificial Intelligence, SQL, .NET, Web Programming, Data Structures, and more. His books have been well-received by students, professionals, and institutions alike for their straightforward explanations, practical exercises, and deep insights into the subjects.

Anshuman's approach to teaching and writing is rooted in his belief that learning should be engaging, intuitive, and highly applicable to real-world scenarios. His experience in both academia and industry has given him a unique perspective on how to best prepare students for the evolving world of technology.

In his books, Anshuman aims not only to impart knowledge but also to inspire a lifelong love for learning and exploration in the world of computer science and programming.

Contents

Table of Contents

Book Title Mathematics of Reinforcement Learning Subtitle: From Bellman Equations to Q-Learning: A Mathematical Journey through Dynamic Programming and Optimal Decision-Making Author: Anshuman Mishra, M.Tech (Computer Science), Assistant Professor, Doranda College, Ranchi University ________________________________________ Module IV: Control Algorithms — From Sarsa to Q-Learning VOL-2 ________________________________________ Chapter 10: On-Policy Control 1-28 10.1 Introduction to On-Policy Control 10.2 Sarsa Algorithm and Update Rule 10.3 Expected Sarsa Algorithm 10.4 Epsilon-Greedy Exploration Strategy 10.5 Comparison between Sarsa and Q-Learning 10.6 Example: Windy Grid World Problem 10.7 Pseudocode for Sarsa Implementation 10.8 Convergence and Limitations 10.9 Parameter Sensitivity in Sarsa 10.10 Summary and Exercise Problems ________________________________________ Chapter 11: Off-Policy Control 29-56 11.1 Introduction to Off-Policy Methods 11.2 Importance Sampling Concept 11.3 Off-Policy Evaluation and Estimation 11.4 Q-Learning Algorithm: Derivation and Equation 11.5 Proof of Q-Learning Convergence 11.6 Example: Q-Table Update Step-by-Step 11.7 Comparison: On-Policy vs. Off-Policy 11.8 Double Q-Learning Overview 11.9 Practical Implementation Example 11.10 Summary and Exercise Problems Chapter 12: Function Approximation in Reinforcement Learning 57-87 12.1 Curse of Dimensionality and Need for Approximation 12.2 Linear Function Approximation 12.3 Gradient Descent in Value Function Estimation 12.4 Mean Squared Error Minimization 12.5 Least Squares Temporal Difference (LSTD) Method 12.6 Example: Function Approximation in Continuous State Space 12.7 Stability and Divergence in Approximation 12.8 Feature Engineering in RL 12.9 Practical Implementation Guidelines 12.10 Summary and Exercise Problems Module V: Advanced Mathematical Tools and Extensions ________________________________________ Chapter 13: Policy Gradient and Actor-Critic Methods 88-115 13.1 Policy Parameterization and Representation 13.2 Policy Gradient Theorem 13.3 REINFORCE Algorithm 13.4 Variance Reduction Techniques 13.5 Actor-Critic Framework 13.6 Derivation of Actor and Critic Updates 13.7 Example: Policy Gradient in 2-State Environment 13.8 Practical Algorithm Steps 13.9 Convergence of Policy Gradient Methods 13.10 Summary and Exercise Problems ________________________________________ Chapter 14: Constrained and Regularized Reinforcement Learning 116-145 14.1 Motivation for Constrained RL 14.2 Regularization in Policy Optimization 14.3 KL-Divergence and Entropy Regularization 14.4 Lagrangian Formulation in RL 14.5 Example: Soft Actor-Critic Intuition 14.6 Safe RL and Risk-Aware Optimization 14.7 Dual Methods and Convergence Proof 14.8 Entropy-Regularized Value Functions 14.9 Practical Case Study 14.10 Summary and Exercise Problems ________________________________________ Chapter 15: Exploration vs. Exploitation Mathematics 146-180 15.1 The Exploration–Exploitation Dilemma 15.2 Multi-Armed Bandit Problem Formulation 15.3 Upper Confidence Bound (UCB) Approach 15.4 Thompson Sampling 15.5 Information-Theoretic Exploration 15.6 Entropy and Mutual Information in RL 15.7 Example: Bandit Problem with UCB 15.8 Derivation of Regret Bounds 15.9 Practical Implementation Notes 15.10 Summary and Exercise Problems ________________________________________ Module VI: Deep and Approximate Reinforcement Learning ________________________________________ Chapter 16: Mathematical Foundation of Deep Q-Networks (DQN) 181-217 16.1 Role of Neural Networks in RL 16.2 Function Approximation and Bellman Error 16.3 Loss Function Derivation in DQN 16.4 Gradient Descent and Backpropagation 16.5 Target Networks and Experience Replay 16.6 Example: Deep Q-Learning in Maze Environment 16.7 Pseudocode of DQN Algorithm 16.8 Convergence and Stability Issues 16.9 Performance Evaluation Metrics 16.10 Summary and Exercise Problems ________________________________________ Chapter 17: Advanced Deep RL Architectures 218-247 17.1 Double DQN and Its Mathematical Motivation 17.2 Dueling DQN and Advantage Function 17.3 Prioritized Experience Replay 17.4 Multi-Agent Reinforcement Learning 17.5 Hierarchical RL and Options Framework 17.6 Continuous Control with Deep Deterministic Policy Gradient (DDPG) 17.7 Trust Region Policy Optimization (TRPO) 17.8 Proximal Policy Optimization (PPO) 17.9 Case Study: Autonomous Navigation 17.10 Summary and Exercise Problems ________________________________________ Module VII: Theoretical and Research Perspectives ________________________________________ Chapter 18: Convergence Analysis and Stability 248-284 18.1 Theoretical Convergence of Value Iteration 18.2 Robbins-Monro Conditions 18.3 Convergence of TD and Q-Learning 18.4 Asynchronous and Batch RL Convergence 18.5 Nonlinear Function Approximators 18.6 Stability of Actor-Critic Methods 18.7 Error Propagation in Approximation 18.8 Regularization for Stability 18.9 Open Problems in Theoretical RL 18.10 Summary and Exercise Problems ________________________________________ Chapter 19: Mathematical Proofs and Derivations 285-317 19.1 Derivation of Bellman Expectation Equation 19.2 Derivation of Bellman Optimality Equation 19.3 Proof of Policy Gradient Theorem 19.4 Proof of TD(λ) Convergence 19.5 Proof of Q-Learning Convergence 19.6 Contraction Mapping Theorem and Application in RL 19.7 Proof of Dynamic Programming Convergence 19.8 Lemmas and Theorems for RL 19.9 Example: Analytical Proof Walkthroughs 19.10 Summary and Exercise Problems ________________________________________ Chapter 20: Applications and Future Directions 318-360 20.1 RL Applications in Robotics 20.2 RL in Finance and Trading Systems 20.3 RL in Healthcare Decision Making 20.4 RL in Games and Simulation 20.5 Quantum Reinforcement Learning 20.6 Safe and Explainable RL 20.7 RL in Edge and IoT Systems 20.8 Research Challenges and Open Questions 20.9 Future Directions and Trends 20.10 Summary and Research Exercises

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub