Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…

Leanpub Header

Skip to main content

Spark 4.0 from Scratch

Advanced Processing & Production Mastery

Structured Streaming, MLlib, GraphFrames, performance tuning, testing and CI, and the lakehouse. Eleven chapters that take a competent PySpark user from "the job runs" to "the on-call team trusts the job.

Minimum price

$19.00

$24.00

You pay

Author earns

$

Also available for 1 book credit with a Reader Membership

PDF
About

About

About the Book

The second volume of Spark 4.0 from Scratch turns a competent PySpark user into a production Spark engineer. Eleven chapters covering everything that sits between "the job runs" and "the on-call team trusts the job."

Key Features

  • Structured Streaming, end to end: sources, sinks, watermarks, stateful processing, stream-stream joins, exactly-once with checkpointing, and the monitoring surface that catches a stuck stream before the dashboard does.
  • MLlib without hand-waving: Pipelines, feature engineering, classification, regression, clustering, hyper-parameter tuning, model saving, a complete model lifecycle on real data.
  • GraphFrames: vertices, edges, motif finding, PageRank, connected components, triangle counting, and the fraud-detection patterns the payment-processing crowd actually uses.
  • Performance tuning as a diagnostic discipline: the Spark UI tabs, explain() reading, partition sizing, AQE, dynamic partition pruning, manual skew salting, the Z-order story.
  • Testing and CI/CD for Spark: pytest fixtures, GitHub Actions pipelines, code-quality linting, deployment patterns including canary and blue/green.
  • The Lakehouse, in depth: Delta Lake (time travel, OPTIMIZE, ZORDER, VACUUM, Change Data Feed), Apache Iceberg (hidden partitioning, partition evolution, REST catalog), the trade-off table, and modern table modelling.

What you will learn

- Process unbounded streams with watermarks, stateful operators, and stream-stream joins

- Train and evaluate ML pipelines at scale with MLlib's Estimator/Transformer API

- Use GraphFrames for PageRank, connected components, motifs, and fraud patterns

- Diagnose any slow Spark job from the UI and explain() output

- Write unit and integration tests for PySpark code that actually catch bugs

- Deploy Spark applications to Kubernetes, YARN, or standalone clusters

- Design Delta Lake and Iceberg tables for production lakehouses

- Implement medallion architecture with bronze/silver/gold layers and SCDs

Who this book is for

Data engineers and data scientists who finished Volume 1 (or already write PySpark professionally) and now need streaming, ML, performance discipline, and production deployment skills. Some chapters touch ML and graph theory lightly, no prior experience is required, but a hunger to ship matters.

Table of Contents

  • Window Functions. Top-N per group, running totals, moving averages, gap-and-island, session detection from raw event logs. The analytics that groupBy cannot express.
  • User-Defined Functions (UDFs). Python UDFs, Pandas UDFs (the 3-100x speedup), grouped map for per-group model fitting, and the rule for when to avoid UDFs entirely.
  • Structured Streaming. Sources, sinks, watermarks, stateful processing, stream-stream joins, exactly-once with checkpointing, monitoring stuck queries before the dashboard does.
  • Machine Learning with MLlib. Pipelines, feature engineering, classification, regression, clustering, hyperparameter tuning, and a complete model lifecycle on real data.
  • Graph Processing with GraphFrames. Vertices, edges, motif finding, PageRank, connected components, triangle counting, and the fraud-detection patterns the payments crowd uses.
  • Performance Tuning. Spark UI tabs read like an X-ray, query plan reading, partition sizing, AQE, dynamic partition pruning, manual salting for skew, Z-ordering for data skipping.
  • Deploying Spark Applications. spark-submit, packaging, client vs cluster mode, the resource-calculation formula, dynamic allocation, Kubernetes, log configuration that doesn't drown you.
  • End-to-End Data Pipeline Project. Bronze/silver/gold with SCDs, a reusable data-quality framework, incremental processing, the full e-commerce capstone wired together.
  • Testing and CI/CD for Spark. pytest fixtures that survive a session-scoped SparkSession, chispa for DataFrame equality, GitHub Actions pipelines, blue/green and canary deployment.
  • The Lakehouse: Delta Lake, Iceberg, and Modern Table Design. Time travel, OPTIMIZE, ZORDER, partition evolution, schema evolution, and the medallion patterns that make data trustworthy.

Bundle

Bundles that include this book

Author

About the Author

Ritesh Modi

Ritesh Modi is Head of AI at MarketOnce and a former Forward Deployed Engineer at Microsoft. He has spent more than a decade building and shipping production systems across cloud, distributed computing, and applied machine learning, working with organizations ranging from global enterprises to fast-moving startups. His recent work focuses on applied large language models, designing systems that turn pretrained models into reliable, task-specific tools.

Ritesh has authored multiple technology books and speaks regularly at industry conferences on AI, cloud architecture, and software engineering. His writing philosophy rests on a simple belief: the best technical books are written by practitioners who still remember what it felt like to not understand something, not by experts who have forgotten. Every explanation in this book was tested against that standard, if it would not have made sense to him when he was first learning this material, it was rewritten until it did.

He writes, shares ideas, and connects with readers at www.riteshmodi.com. When he is not writing or building AI systems, he can be found mentoring engineers, exploring new architectures, or debugging a training run that should have converged three hours ago.

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub