The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...
Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…
Foundations: From Your First DataFrame to Production-Ready Joins and Aggregations
PySpark from page one. Ten chapters that take a Python user who knows pandas and turn them into someone who can write, read, and debug production PySpark, without a three-chapter detour through distributed-computing theory.
Minimum price
$19.00
$24.00
About the Book
A practical, depth-first guide to Apache Spark for engineers who want to actually
ship code, not just pass an interview.
Volume 1 of Spark 4.0 from Scratch takes a Python user who knows pandas (or nothing) and turns them into someone who can write, read, and debug PySpark Code in a real production codebase. 10 chapters of foundations, each grounded in the same running retail-analytics dataset, so the concepts stack instead of scatter.
Key Features
- Spark 4.0 current: Catalyst, Tungsten, Adaptive Query Execution, the
unified DataFrame/Dataset API, Spark Connect mentions where they matter.
- PySpark from page one: every code example, every API, every error
message you'll actually see. No Scala detours.
- Real performance discipline early: partition sizing, shuffle reading,
explain() output, the Spark UI tabs, covered as you go, not deferred to
a final chapter that no reader reaches.
- The running example threads through every chapter: one retail-analytics
dataset, used in Chapter 1's first SparkSession and still alive in Chapter
10's grouped aggregations. Concepts stack.
Every chapter ends with scenario-based Knowledge Check questions modeled on how senior engineers interview each other, not "what does count() return," but "why is your groupBy producing one task that takes thirty minutes."
What you will learn
- Read and write every common data format on the Spark side of the boundary
- Build correct PySpark pipelines using transformations and actions correctly
- Choose between SQL and the DataFrame API and mix them when needed
- Pick the right join strategy and verify it in the query plan
- Tune partitions, spot shuffles, and read the Spark UI without guessing
- Handle skew with broadcast hints and Adaptive Query Execution
- Use explain() as the first debugging tool you reach for
Who this book is for
Python developers, data analysts, and data engineers who are tired of toy examples and want a Spark book that respects their time. Basic Python is assumed (functions, dictionaries, list comprehensions); Appendix A includes a refresher on the parts of Python that matter most for PySpark.
Table of Contents
Bundles that include this book
About the Author
Ritesh Modi is Head of AI at MarketOnce and a former Forward Deployed Engineer at Microsoft. He has spent more than a decade building and shipping production systems across cloud, distributed computing, and applied machine learning, working with organizations ranging from global enterprises to fast-moving startups. His recent work focuses on applied large language models, designing systems that turn pretrained models into reliable, task-specific tools.
Ritesh has authored multiple technology books and speaks regularly at industry conferences on AI, cloud architecture, and software engineering. His writing philosophy rests on a simple belief: the best technical books are written by practitioners who still remember what it felt like to not understand something, not by experts who have forgotten. Every explanation in this book was tested against that standard, if it would not have made sense to him when he was first learning this material, it was rewritten until it did.
He writes, shares ideas, and connects with readers at www.riteshmodi.com. When he is not writing or building AI systems, he can be found mentoring engineers, exploring new architectures, or debugging a training run that should have converged three hours ago.
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.
Learn more about writing on Leanpub
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!
Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.
Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.