The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...
Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…

Four volumes. 76 chapters. 2,000+ pages. The complete data-engineering arc from your first spark.read.csv to a production multi-agent system on Databricks, written for the engineer who gets paged when the pipeline breaks at 2 a.m.
Bought separately
$96.00
Minimum price
$65.00
$80.00
About the Bundle
The full Data Engineering with Agents and AI, in four volumes
Most Spark and Databricks books pick one slice- beginner Spark, performance tuning, or just the AI pieces- and stop there. The reader is left to stitch four books from three publishers into one coherent path.
This bundle is that path, written by one engineer, in one voice, with one running retail-analytics dataset that threads from Volume 1's first SparkSession all the way to Volume 4's multi-agent supervisor. 76 chapters. Roughly 2,000 pages. Built so you can read straight through or pick the
volume that matches your week.
What each volume gives you
Volume 1. Foundations - PySpark from page one. The execution model, RDDs (still worth understanding), DataFrames, Spark SQL, every file format you'll touch, transformations and actions as separate disciplines, the half-dozen join strategies Spark actually uses, and aggregations including cube, rollup, and approximate functions. By the end you can
read, write, and debug PySpark in a production codebase.
Volume 2. Advanced Processing and Production Mastery. Window functions, UDFs done right, Structured Streaming end to end (watermarks, stateful processing, stream-stream joins, exactly-once), MLlib's full lifecycle, GraphFrames, performance tuning as a measured discipline, the
deployment surface (spark-submit, Kubernetes, dynamic allocation), a complete medallion-architecture project, testing and CI/CD, and the lakehouse with Delta Lake and Iceberg side by side.
Volume 3. Databricks for Practitioners: The Production Lakehouse Playbook. The platform-and-data-engineering volume. Workspaces, classic and serverless compute, cluster policies, Unity Catalog from metastore to volume, privileges and ABAC, Governed Tags, lineage, Entra ID identity,
service principals, managed Delta and Iceberg tables with UniForm, ingestion with Lakeflow Connect and Auto Loader, Lakeflow Spark Declarative Pipelines, Jobs and scheduling, Declarative Automation Bundles, CI/CD with GitHub Actions, observability through system tables, and performance tuning with Photon, AQE, Liquid Clustering, and the Query Profile UI.
Volume 4. Databricks for Practitioners: The AI Lakehouse Playbook. The complete production AI playbook. Databricks SQL, external BI, AI/BI Dashboards, Genie, AI SQL functions, Model Serving, Foundation Model APIs, Vector Search and RAG, MLflow 3 and the UC Model Registry,
Feature Store, MLOps, Lakehouse Monitoring (drift detection), distributed deep learning, Agent Bricks, the Multi-Agent Supervisor and MCP, Lakebase (operational Postgres for AI apps), and a capstone that wires everything into a full retail intelligence application.
Why a bundle instead of four separate books
Three reasons.
Coherence. Concepts introduced in Volume 1 still hold in Volume 4. The retail-analytics dataset that loads in Chapter 1's first SparkSession is the same dataset the Chapter 54 capstone serves through a multi-agent system.
Updates. Leanpub's killer feature: every update to any volume reaches every bundle buyer, free, for the lifetime of the book. Databricks ships quickly; this bundle keeps up.
Who this is for
Data engineers, platform engineers, ML engineers, and architects who want a single coherent learning path from Apache Spark fundamentals to production AI on Databricks. Beginners can start with Volume 1; senior engineers can jump to Volume 3 or 4. The bundle covers both paths.
About the Books
A practical, depth-first guide to Apache Spark for engineers who want to actually
ship code, not just pass an interview.
Volume 1 of Spark 4.0 from Scratch takes a Python user who knows pandas (or nothing) and turns them into someone who can write, read, and debug PySpark Code in a real production codebase. 10 chapters of foundations, each grounded in the same running retail-analytics dataset, so the concepts stack instead of scatter.
Key Features
- Spark 4.0 current: Catalyst, Tungsten, Adaptive Query Execution, the
unified DataFrame/Dataset API, Spark Connect mentions where they matter.
- PySpark from page one: every code example, every API, every error
message you'll actually see. No Scala detours.
- Real performance discipline early: partition sizing, shuffle reading,
explain() output, the Spark UI tabs, covered as you go, not deferred to
a final chapter that no reader reaches.
- The running example threads through every chapter: one retail-analytics
dataset, used in Chapter 1's first SparkSession and still alive in Chapter
10's grouped aggregations. Concepts stack.
Every chapter ends with scenario-based Knowledge Check questions modeled on how senior engineers interview each other, not "what does count() return," but "why is your groupBy producing one task that takes thirty minutes."
What you will learn
- Read and write every common data format on the Spark side of the boundary
- Build correct PySpark pipelines using transformations and actions correctly
- Choose between SQL and the DataFrame API and mix them when needed
- Pick the right join strategy and verify it in the query plan
- Tune partitions, spot shuffles, and read the Spark UI without guessing
- Handle skew with broadcast hints and Adaptive Query Execution
- Use explain() as the first debugging tool you reach for
Who this book is for
Python developers, data analysts, and data engineers who are tired of toy examples and want a Spark book that respects their time. Basic Python is assumed (functions, dictionaries, list comprehensions); Appendix A includes a refresher on the parts of Python that matter most for PySpark.
Table of Contents
The second volume of Spark 4.0 from Scratch turns a competent PySpark user into a production Spark engineer. Eleven chapters covering everything that sits between "the job runs" and "the on-call team trusts the job."
Key Features
What you will learn
- Process unbounded streams with watermarks, stateful operators, and stream-stream joins
- Train and evaluate ML pipelines at scale with MLlib's Estimator/Transformer API
- Use GraphFrames for PageRank, connected components, motifs, and fraud patterns
- Diagnose any slow Spark job from the UI and explain() output
- Write unit and integration tests for PySpark code that actually catch bugs
- Deploy Spark applications to Kubernetes, YARN, or standalone clusters
- Design Delta Lake and Iceberg tables for production lakehouses
- Implement medallion architecture with bronze/silver/gold layers and SCDs
Who this book is for
Data engineers and data scientists who finished Volume 1 (or already write PySpark professionally) and now need streaming, ML, performance discipline, and production deployment skills. Some chapters touch ML and graph theory lightly, no prior experience is required, but a hunger to ship matters.
Table of Contents
A practical, depth-first guide to running Databricks in production.
The complete platform-and-data-engineering playbook for the engineers who own pipelines, govern catalogs, and keep workloads on schedule. Current to 2026, with examples on Azure Databricks and concepts that apply unchanged on AWS and GCP.
Key Features
Examples run on Azure Databricks. What happens inside Databricks is identical on AWS and GCP; where the cloud seams differ (identity, storage, secrets, networking), chapters name the AWS and GCP equivalents explicitly.
What you will learn
- Architect a governed Databricks workspace from metastore to volume
- Configure Unity Catalog with privileges, ABAC, Governed Tags, and lineage
- Integrate Microsoft Entra ID identity, SCIM, and service principals
- Build ingestion with Lakeflow Connect, Auto Loader, and streaming tables
- Author bronze-silver-gold pipelines with Lakeflow Spark Declarative Pipelines
- Deploy from Git with Declarative Automation Bundles and GitHub Actions
- Observe billing, audit, query history, and lineage through system tables
- Tune performance with Photon, AQE, Liquid Clustering, and the Query Profile UI
Who this book is for
Data engineers, platform engineers, and architects who already know PySpark and now need to run it on Databricks at production depth. A working knowledge of PySpark, Spark SQL, and Delta Lake is expected. Readers new to Spark should start with Volumes 1 and 2 of the series.
Table of Contents
1. Databricks: The Platform on Top of Spark. Why Databricks exists, what it adds on top of open-source Spark, and how the layers stack so you can see the platform clearly.
2. Workspaces, Notebooks, and Git Folders. The workspace surface, multiplayer notebook ergonomics, Git Folders, and the seven traps that ruin notebook-driven development.
3. Compute: Classic, Serverless, and Cluster Policies. Six compute types with one decision tree. Photon eligibility, cluster policies that prevent runaway costs, the cold-start math.
4. Unity Catalog Architecture. Metastore, catalogs, schemas, tables, volumes, the three-grant cascade. How every securable connects, traced end to end with a single SELECT query.
5. Access Control: Privileges, ABAC, and Governed Tags. The grant model, attribute-based access for cross-table policies, tag-driven row filters and column masks that scale to 400 tables.
6. Identity: Entra ID, SCIM, and Service Principals. Users, groups, SPs; OAuth flows; the anti-patterns the auditor will catch, including the JDBC-string trap and the printed-secret trap.
7. Managed Tables: Delta, Iceberg, and UniForm. The default Delta path, Iceberg v3 features, REST Catalog access from DuckDB, and when to use which format.
8. Liquid Clustering and Predictive Optimization. CLUSTER BY versus PARTITION BY, the four-column limit, Predictive Optimization, and when PO does what you would otherwise script.
9. System Tables and Platform Observability. Billing, audit, query history, lineage. The SQL surface for everything the UI shows, so you can put a dashboard on top of any of it.
10. Ingestion: Lakeflow Connect, Auto Loader, and Streaming Tables. File-discovery modes, file-notification vs directory listing, CDC pipelines, partner connectors, the four trigger modes.
11. Lakeflow Spark Declarative Pipelines. The CREATE PIPELINE shape, bronze/silver/gold patterns, the DLT-to-SDP migration path, the validation contracts that catch bad data early.
12. Lakeflow Jobs and Scheduling. Tasks, dependencies, retries, file-arrival triggers, repair-and-rerun. The Job that runs the nightly bronze-silver-gold pipeline, fully traced.
13. Declarative Automation Bundles. Project structure, targets, deploy/run/destroy, the YAML decoded clause by clause, the staging-vs-production override patterns.
14. CI/CD with GitHub Actions. OIDC federation (no client secrets), six PR validations, staging vs production gates, the full promote-the-pipeline worked example with rollback.
15. Performance: Photon, AQE, and the Query Profile UI. Four EXPLAIN flavors, reading the plan bottom-up, the salting decision tree, the cache vs MV vs raw-query choice.
16. Metric Views and the Bridge to Volume 4. The CREATE METRIC VIEW shape, composing metrics from metrics, the certified-semantics layer every dashboard in Volume 4 reads through.
The complete production AI playbook for Databricks.
RAG. Agent Bricks. Multi-agent supervisors. Lakebase. Feature Store. MLflow 3.
Lakehouse Monitoring. Foundation Model APIs. Vector Search. AI Gateway. Every
AI surface Databricks shipped at GA in 2025-2026, taught by a senior
practitioner, current to 2026.
Key Features
- RAG end to end with Vector Search, embedding models, hybrid retrieval, and citation grounding
- Agent Bricks + Multi-Agent Supervisor + MCP with the eval loop that catches routing failures before production
- Lakebase: operational Postgres at sub-10ms reads, the layer that turns AI models and agents into user-facing apps
- Full ML lifecycle: Feature Store, MLflow 3, Lakehouse Monitoring, AI Gateway
AI on Databricks went from research project to production platform in 2025-2026. Foundation Model APIs, Vector Search, Agent Bricks, the Multi-Agent Supervisor, MLflow 3, Lakehouse Monitoring, and Lakebase all shipped at GA. Volume 4 is the book that uses them together.
It walks the full Mosaic AI surface. Model Serving for production inference. Foundation Model APIs for prebuilt frontier models. Vector Search and Retrieval-Augmented Generation for grounding AI in your own data. MLflow 3 and the UC Model Registry for the experiment-to-production lifecycle. Feature Store for features that bridge offline training and online serving.
The agentic chapters are the center piece. Agent Bricks ships declarative single-agent systems (classification, information extraction) without writing prompts. The Multi-Agent Supervisor orchestrates specialists, Genie spaces, custom agents, Agent Brick children, external tools via the Model Context Protocol, with the eval loop that catches wrong routing, tool hallucination,
and unbounded recursion before production.
Lakebase, the operational Postgres shipped with Databricks in 2026, gives AI apps the sub-ten-millisecond reads they need to serve users, the layer that turns models and agents into real products. Lakehouse Monitoring catches the quiet drift that destroys ML systems while every dashboard stays green. The capstone wires Lakebase, the agents, and the analytics into a
full retail intelligence app. Azure examples; concepts apply on AWS and GCP.
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.
Learn more about writing on Leanpub
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!
Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.
Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.