Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…

Leanpub Header

Skip to main content

Spark 4.0 from Scratch

Foundations: From Your First DataFrame to Production-Ready Joins and Aggregations

PySpark from page one. Ten chapters that take a Python user who knows pandas and turn them into someone who can write, read, and debug production PySpark, without a three-chapter detour through distributed-computing theory.

Minimum price

$19.00

$24.00

You pay

Author earns

$

Also available for 1 book credit with a Reader Membership

PDF
About

About

About the Book

A practical, depth-first guide to Apache Spark for engineers who want to actually

ship code, not just pass an interview.

Volume 1 of Spark 4.0 from Scratch takes a Python user who knows pandas (or nothing) and turns them into someone who can write, read, and debug PySpark Code in a real production codebase. 10 chapters of foundations, each grounded in the same running retail-analytics dataset, so the concepts stack instead of scatter.

Key Features

- Spark 4.0 current: Catalyst, Tungsten, Adaptive Query Execution, the

unified DataFrame/Dataset API, Spark Connect mentions where they matter.

- PySpark from page one: every code example, every API, every error

message you'll actually see. No Scala detours.

- Real performance discipline early: partition sizing, shuffle reading,

explain() output, the Spark UI tabs, covered as you go, not deferred to

a final chapter that no reader reaches.

- The running example threads through every chapter: one retail-analytics

dataset, used in Chapter 1's first SparkSession and still alive in Chapter

10's grouped aggregations. Concepts stack.

Every chapter ends with scenario-based Knowledge Check questions modeled on how senior engineers interview each other, not "what does count() return," but "why is your groupBy producing one task that takes thirty minutes."

What you will learn

- Read and write every common data format on the Spark side of the boundary

- Build correct PySpark pipelines using transformations and actions correctly

- Choose between SQL and the DataFrame API and mix them when needed

- Pick the right join strategy and verify it in the query plan

- Tune partitions, spot shuffles, and read the Spark UI without guessing

- Handle skew with broadcast hints and Adaptive Query Execution

- Use explain() as the first debugging tool you reach for

Who this book is for

Python developers, data analysts, and data engineers who are tired of toy examples and want a Spark book that respects their time. Basic Python is assumed (functions, dictionaries, list comprehensions); Appendix A includes a refresher on the parts of Python that matter most for PySpark.

Table of Contents

  • Welcome to Spark. Your first SparkSession on page 8, not in Chapter 4. The big-data problem, the Spark-vs-pandas decision, a sales-CSV walkthrough that ends with a real groupBy.
  • Spark Architecture. Driver, executors, DAG, stages, tasks, shuffles, lazy evaluation, Catalyst, Tungsten. The full lifecycle of a Spark job, traced step by step so you can read any query plan that follows.
  • RDDs: The Foundation. Why RDDs still matter in the DataFrame era. Narrow vs wide transformations, pair RDDs, lineage and fault tolerance, partitioning strategies that decide your job's parallelism.
  • DataFrames and Datasets. Schemas, every file format you'll touch, the column-expression API, and the Catalyst optimizer that quietly makes your code fast.
  • Spark SQL. Temp views, CTEs, mixed SQL/DataFrame pipelines, Hive integration, the catalog API. SQL as a first-class citizen, not a fallback.
  • Data Sources and Formats. CSV, JSON, Parquet, ORC, Avro deep dives. Partitioning, bucketing, schema evolution, and the Delta Lake introduction that sets up Volume 2.
  • Transformations Deep Dive. Column expressions, explode and flatten, pivot and unpivot, repartition vs coalesce, deduplication, sorting, and method chaining that stays readable.
  • Actions and Output. What actually triggers execution, collect safety, write modes that won't surprise you, partitioned and bucketed writes, controlling file count.
  • Joins and Set Operations. Every join type with one decision tree. Broadcast vs sort-merge, skew detection in the Spark UI, salting the hot key, the six strategies you actually need.
  • Aggregations and Grouping. groupBy, pivot, cube, rollup, approximate functions, conditional aggregations, grouping sets. Reporting and feature engineering, side by side.

Bundle

Bundles that include this book

Author

About the Author

Ritesh Modi

Ritesh Modi is Head of AI at MarketOnce and a former Forward Deployed Engineer at Microsoft. He has spent more than a decade building and shipping production systems across cloud, distributed computing, and applied machine learning, working with organizations ranging from global enterprises to fast-moving startups. His recent work focuses on applied large language models, designing systems that turn pretrained models into reliable, task-specific tools.

Ritesh has authored multiple technology books and speaks regularly at industry conferences on AI, cloud architecture, and software engineering. His writing philosophy rests on a simple belief: the best technical books are written by practitioners who still remember what it felt like to not understand something, not by experts who have forgotten. Every explanation in this book was tested against that standard, if it would not have made sense to him when he was first learning this material, it was rewritten until it did.

He writes, shares ideas, and connects with readers at www.riteshmodi.com. When he is not writing or building AI systems, he can be found mentoring engineers, exploring new architectures, or debugging a training run that should have converged three hours ago.

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub