Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…

Leanpub Header

Skip to main content

The Data and AI Engineering Playbook

Four volumes. 76 chapters. 2,000+ pages. The complete data-engineering arc from your first spark.read.csv to a production multi-agent system on Databricks, written for the engineer who gets paged when the pipeline breaks at 2 a.m.

Bought separately

$96.00

Minimum price

$65.00

$80.00

You pay

Author earns

$

Also available for 3 book credits with a Reader Membership

These books have a total suggested price of $96.00. Get them now for only $65.00!
About

About

About the Bundle

The full Data Engineering with Agents and AI, in four volumes

Most Spark and Databricks books pick one slice- beginner Spark, performance tuning, or just the AI pieces- and stop there. The reader is left to stitch four books from three publishers into one coherent path.

This bundle is that path, written by one engineer, in one voice, with one running retail-analytics dataset that threads from Volume 1's first SparkSession all the way to Volume 4's multi-agent supervisor. 76 chapters. Roughly 2,000 pages. Built so you can read straight through or pick the

volume that matches your week.

What each volume gives you

Volume 1. Foundations - PySpark from page one. The execution model, RDDs (still worth understanding), DataFrames, Spark SQL, every file format you'll touch, transformations and actions as separate disciplines, the half-dozen join strategies Spark actually uses, and aggregations including cube, rollup, and approximate functions. By the end you can

read, write, and debug PySpark in a production codebase.

Volume 2. Advanced Processing and Production Mastery. Window functions, UDFs done right, Structured Streaming end to end (watermarks, stateful processing, stream-stream joins, exactly-once), MLlib's full lifecycle, GraphFrames, performance tuning as a measured discipline, the

deployment surface (spark-submit, Kubernetes, dynamic allocation), a complete medallion-architecture project, testing and CI/CD, and the lakehouse with Delta Lake and Iceberg side by side.

Volume 3. Databricks for Practitioners: The Production Lakehouse Playbook. The platform-and-data-engineering volume. Workspaces, classic and serverless compute, cluster policies, Unity Catalog from metastore to volume, privileges and ABAC, Governed Tags, lineage, Entra ID identity,

service principals, managed Delta and Iceberg tables with UniForm, ingestion with Lakeflow Connect and Auto Loader, Lakeflow Spark Declarative Pipelines, Jobs and scheduling, Declarative Automation Bundles, CI/CD with GitHub Actions, observability through system tables, and performance tuning with Photon, AQE, Liquid Clustering, and the Query Profile UI.

Volume 4. Databricks for Practitioners: The AI Lakehouse Playbook. The complete production AI playbook. Databricks SQL, external BI, AI/BI Dashboards, Genie, AI SQL functions, Model Serving, Foundation Model APIs, Vector Search and RAG, MLflow 3 and the UC Model Registry,

Feature Store, MLOps, Lakehouse Monitoring (drift detection), distributed deep learning, Agent Bricks, the Multi-Agent Supervisor and MCP, Lakebase (operational Postgres for AI apps), and a capstone that wires everything into a full retail intelligence application.

Why a bundle instead of four separate books

Three reasons.

Coherence. Concepts introduced in Volume 1 still hold in Volume 4. The retail-analytics dataset that loads in Chapter 1's first SparkSession is the same dataset the Chapter 54 capstone serves through a multi-agent system.

Updates. Leanpub's killer feature: every update to any volume reaches every bundle buyer, free, for the lifetime of the book. Databricks ships quickly; this bundle keeps up.

Who this is for

Data engineers, platform engineers, ML engineers, and architects who want a single coherent learning path from Apache Spark fundamentals to production AI on Databricks. Beginners can start with Volume 1; senior engineers can jump to Volume 3 or 4. The bundle covers both paths.

Books

About the Books

Spark 4.0 from Scratch

Foundations: From Your First DataFrame to Production-Ready Joins and Aggregations

A practical, depth-first guide to Apache Spark for engineers who want to actually

ship code, not just pass an interview.

Volume 1 of Spark 4.0 from Scratch takes a Python user who knows pandas (or nothing) and turns them into someone who can write, read, and debug PySpark Code in a real production codebase. 10 chapters of foundations, each grounded in the same running retail-analytics dataset, so the concepts stack instead of scatter.

Key Features

- Spark 4.0 current: Catalyst, Tungsten, Adaptive Query Execution, the

unified DataFrame/Dataset API, Spark Connect mentions where they matter.

- PySpark from page one: every code example, every API, every error

message you'll actually see. No Scala detours.

- Real performance discipline early: partition sizing, shuffle reading,

explain() output, the Spark UI tabs, covered as you go, not deferred to

a final chapter that no reader reaches.

- The running example threads through every chapter: one retail-analytics

dataset, used in Chapter 1's first SparkSession and still alive in Chapter

10's grouped aggregations. Concepts stack.

Every chapter ends with scenario-based Knowledge Check questions modeled on how senior engineers interview each other, not "what does count() return," but "why is your groupBy producing one task that takes thirty minutes."

What you will learn

- Read and write every common data format on the Spark side of the boundary

- Build correct PySpark pipelines using transformations and actions correctly

- Choose between SQL and the DataFrame API and mix them when needed

- Pick the right join strategy and verify it in the query plan

- Tune partitions, spot shuffles, and read the Spark UI without guessing

- Handle skew with broadcast hints and Adaptive Query Execution

- Use explain() as the first debugging tool you reach for

Who this book is for

Python developers, data analysts, and data engineers who are tired of toy examples and want a Spark book that respects their time. Basic Python is assumed (functions, dictionaries, list comprehensions); Appendix A includes a refresher on the parts of Python that matter most for PySpark.

Table of Contents

  • Welcome to Spark. Your first SparkSession on page 8, not in Chapter 4. The big-data problem, the Spark-vs-pandas decision, a sales-CSV walkthrough that ends with a real groupBy.
  • Spark Architecture. Driver, executors, DAG, stages, tasks, shuffles, lazy evaluation, Catalyst, Tungsten. The full lifecycle of a Spark job, traced step by step so you can read any query plan that follows.
  • RDDs: The Foundation. Why RDDs still matter in the DataFrame era. Narrow vs wide transformations, pair RDDs, lineage and fault tolerance, partitioning strategies that decide your job's parallelism.
  • DataFrames and Datasets. Schemas, every file format you'll touch, the column-expression API, and the Catalyst optimizer that quietly makes your code fast.
  • Spark SQL. Temp views, CTEs, mixed SQL/DataFrame pipelines, Hive integration, the catalog API. SQL as a first-class citizen, not a fallback.
  • Data Sources and Formats. CSV, JSON, Parquet, ORC, Avro deep dives. Partitioning, bucketing, schema evolution, and the Delta Lake introduction that sets up Volume 2.
  • Transformations Deep Dive. Column expressions, explode and flatten, pivot and unpivot, repartition vs coalesce, deduplication, sorting, and method chaining that stays readable.
  • Actions and Output. What actually triggers execution, collect safety, write modes that won't surprise you, partitioned and bucketed writes, controlling file count.
  • Joins and Set Operations. Every join type with one decision tree. Broadcast vs sort-merge, skew detection in the Spark UI, salting the hot key, the six strategies you actually need.
  • Aggregations and Grouping. groupBy, pivot, cube, rollup, approximate functions, conditional aggregations, grouping sets. Reporting and feature engineering, side by side.

Spark 4.0 from Scratch

Advanced Processing & Production Mastery

The second volume of Spark 4.0 from Scratch turns a competent PySpark user into a production Spark engineer. Eleven chapters covering everything that sits between "the job runs" and "the on-call team trusts the job."

Key Features

  • Structured Streaming, end to end: sources, sinks, watermarks, stateful processing, stream-stream joins, exactly-once with checkpointing, and the monitoring surface that catches a stuck stream before the dashboard does.
  • MLlib without hand-waving: Pipelines, feature engineering, classification, regression, clustering, hyper-parameter tuning, model saving, a complete model lifecycle on real data.
  • GraphFrames: vertices, edges, motif finding, PageRank, connected components, triangle counting, and the fraud-detection patterns the payment-processing crowd actually uses.
  • Performance tuning as a diagnostic discipline: the Spark UI tabs, explain() reading, partition sizing, AQE, dynamic partition pruning, manual skew salting, the Z-order story.
  • Testing and CI/CD for Spark: pytest fixtures, GitHub Actions pipelines, code-quality linting, deployment patterns including canary and blue/green.
  • The Lakehouse, in depth: Delta Lake (time travel, OPTIMIZE, ZORDER, VACUUM, Change Data Feed), Apache Iceberg (hidden partitioning, partition evolution, REST catalog), the trade-off table, and modern table modelling.

What you will learn

- Process unbounded streams with watermarks, stateful operators, and stream-stream joins

- Train and evaluate ML pipelines at scale with MLlib's Estimator/Transformer API

- Use GraphFrames for PageRank, connected components, motifs, and fraud patterns

- Diagnose any slow Spark job from the UI and explain() output

- Write unit and integration tests for PySpark code that actually catch bugs

- Deploy Spark applications to Kubernetes, YARN, or standalone clusters

- Design Delta Lake and Iceberg tables for production lakehouses

- Implement medallion architecture with bronze/silver/gold layers and SCDs

Who this book is for

Data engineers and data scientists who finished Volume 1 (or already write PySpark professionally) and now need streaming, ML, performance discipline, and production deployment skills. Some chapters touch ML and graph theory lightly, no prior experience is required, but a hunger to ship matters.

Table of Contents

  • Window Functions. Top-N per group, running totals, moving averages, gap-and-island, session detection from raw event logs. The analytics that groupBy cannot express.
  • User-Defined Functions (UDFs). Python UDFs, Pandas UDFs (the 3-100x speedup), grouped map for per-group model fitting, and the rule for when to avoid UDFs entirely.
  • Structured Streaming. Sources, sinks, watermarks, stateful processing, stream-stream joins, exactly-once with checkpointing, monitoring stuck queries before the dashboard does.
  • Machine Learning with MLlib. Pipelines, feature engineering, classification, regression, clustering, hyperparameter tuning, and a complete model lifecycle on real data.
  • Graph Processing with GraphFrames. Vertices, edges, motif finding, PageRank, connected components, triangle counting, and the fraud-detection patterns the payments crowd uses.
  • Performance Tuning. Spark UI tabs read like an X-ray, query plan reading, partition sizing, AQE, dynamic partition pruning, manual salting for skew, Z-ordering for data skipping.
  • Deploying Spark Applications. spark-submit, packaging, client vs cluster mode, the resource-calculation formula, dynamic allocation, Kubernetes, log configuration that doesn't drown you.
  • End-to-End Data Pipeline Project. Bronze/silver/gold with SCDs, a reusable data-quality framework, incremental processing, the full e-commerce capstone wired together.
  • Testing and CI/CD for Spark. pytest fixtures that survive a session-scoped SparkSession, chispa for DataFrame equality, GitHub Actions pipelines, blue/green and canary deployment.
  • The Lakehouse: Delta Lake, Iceberg, and Modern Table Design. Time travel, OPTIMIZE, ZORDER, partition evolution, schema evolution, and the medallion patterns that make data trustworthy.

DATABRICKS FOR PRACTITIONERS: Volume 1

The Production Lakehouse Playbook: Platform, Governance, and Data Engineering

A practical, depth-first guide to running Databricks in production.

The complete platform-and-data-engineering playbook for the engineers who own pipelines, govern catalogs, and keep workloads on schedule. Current to 2026, with examples on Azure Databricks and concepts that apply unchanged on AWS and GCP.

Key Features

  • Build the governed Databricks platform layer by layer, from workspaces and compute to Unity Catalog, identity, and access control
  • Ship production data pipelines with Lakeflow Spark Declarative Pipelines, Jobs, Declarative Automation Bundles, and CI/CD from Git
  • Tune for cost and performance with Photon, Adaptive Query Execution, Liquid Clustering, and the Query Profile UI

Examples run on Azure Databricks. What happens inside Databricks is identical on AWS and GCP; where the cloud seams differ (identity, storage, secrets, networking), chapters name the AWS and GCP equivalents explicitly.

What you will learn

- Architect a governed Databricks workspace from metastore to volume

- Configure Unity Catalog with privileges, ABAC, Governed Tags, and lineage

- Integrate Microsoft Entra ID identity, SCIM, and service principals

- Build ingestion with Lakeflow Connect, Auto Loader, and streaming tables

- Author bronze-silver-gold pipelines with Lakeflow Spark Declarative Pipelines

- Deploy from Git with Declarative Automation Bundles and GitHub Actions

- Observe billing, audit, query history, and lineage through system tables

- Tune performance with Photon, AQE, Liquid Clustering, and the Query Profile UI

Who this book is for

Data engineers, platform engineers, and architects who already know PySpark and now need to run it on Databricks at production depth. A working knowledge of PySpark, Spark SQL, and Delta Lake is expected. Readers new to Spark should start with Volumes 1 and 2 of the series.

Table of Contents

1. Databricks: The Platform on Top of Spark. Why Databricks exists, what it adds on top of open-source Spark, and how the layers stack so you can see the platform clearly.

2. Workspaces, Notebooks, and Git Folders. The workspace surface, multiplayer notebook ergonomics, Git Folders, and the seven traps that ruin notebook-driven development.

3. Compute: Classic, Serverless, and Cluster Policies. Six compute types with one decision tree. Photon eligibility, cluster policies that prevent runaway costs, the cold-start math.

4. Unity Catalog Architecture. Metastore, catalogs, schemas, tables, volumes, the three-grant cascade. How every securable connects, traced end to end with a single SELECT query.

5. Access Control: Privileges, ABAC, and Governed Tags. The grant model, attribute-based access for cross-table policies, tag-driven row filters and column masks that scale to 400 tables.

6. Identity: Entra ID, SCIM, and Service Principals. Users, groups, SPs; OAuth flows; the anti-patterns the auditor will catch, including the JDBC-string trap and the printed-secret trap.

7. Managed Tables: Delta, Iceberg, and UniForm. The default Delta path, Iceberg v3 features, REST Catalog access from DuckDB, and when to use which format.

8. Liquid Clustering and Predictive Optimization. CLUSTER BY versus PARTITION BY, the four-column limit, Predictive Optimization, and when PO does what you would otherwise script.

9. System Tables and Platform Observability. Billing, audit, query history, lineage. The SQL surface for everything the UI shows, so you can put a dashboard on top of any of it.

10. Ingestion: Lakeflow Connect, Auto Loader, and Streaming Tables. File-discovery modes, file-notification vs directory listing, CDC pipelines, partner connectors, the four trigger modes.

11. Lakeflow Spark Declarative Pipelines. The CREATE PIPELINE shape, bronze/silver/gold patterns, the DLT-to-SDP migration path, the validation contracts that catch bad data early.

12. Lakeflow Jobs and Scheduling. Tasks, dependencies, retries, file-arrival triggers, repair-and-rerun. The Job that runs the nightly bronze-silver-gold pipeline, fully traced.

13. Declarative Automation Bundles. Project structure, targets, deploy/run/destroy, the YAML decoded clause by clause, the staging-vs-production override patterns.

14. CI/CD with GitHub Actions. OIDC federation (no client secrets), six PR validations, staging vs production gates, the full promote-the-pipeline worked example with rollback.

15. Performance: Photon, AQE, and the Query Profile UI. Four EXPLAIN flavors, reading the plan bottom-up, the salting decision tree, the cache vs MV vs raw-query choice.

16. Metric Views and the Bridge to Volume 4. The CREATE METRIC VIEW shape, composing metrics from metrics, the certified-semantics layer every dashboard in Volume 4 reads through.

Databricks for Practitioners: Volume 2

The AI Lakehouse and Agentic Playbook: Analytics, Mosaic AI, Agents, and Lakebase

The complete production AI playbook for Databricks.

RAG. Agent Bricks. Multi-agent supervisors. Lakebase. Feature Store. MLflow 3.

Lakehouse Monitoring. Foundation Model APIs. Vector Search. AI Gateway. Every

AI surface Databricks shipped at GA in 2025-2026, taught by a senior

practitioner, current to 2026.

Key Features

- RAG end to end with Vector Search, embedding models, hybrid retrieval, and citation grounding

- Agent Bricks + Multi-Agent Supervisor + MCP with the eval loop that catches routing failures before production

- Lakebase: operational Postgres at sub-10ms reads, the layer that turns AI models and agents into user-facing apps

- Full ML lifecycle: Feature Store, MLflow 3, Lakehouse Monitoring, AI Gateway

AI on Databricks went from research project to production platform in 2025-2026. Foundation Model APIs, Vector Search, Agent Bricks, the Multi-Agent Supervisor, MLflow 3, Lakehouse Monitoring, and Lakebase all shipped at GA. Volume 4 is the book that uses them together.

It walks the full Mosaic AI surface. Model Serving for production inference. Foundation Model APIs for prebuilt frontier models. Vector Search and Retrieval-Augmented Generation for grounding AI in your own data. MLflow 3 and the UC Model Registry for the experiment-to-production lifecycle. Feature Store for features that bridge offline training and online serving.

The agentic chapters are the center piece. Agent Bricks ships declarative single-agent systems (classification, information extraction) without writing prompts. The Multi-Agent Supervisor orchestrates specialists, Genie spaces, custom agents, Agent Brick children, external tools via the Model Context Protocol, with the eval loop that catches wrong routing, tool hallucination,

and unbounded recursion before production.

Lakebase, the operational Postgres shipped with Databricks in 2026, gives AI apps the sub-ten-millisecond reads they need to serve users, the layer that turns models and agents into real products. Lakehouse Monitoring catches the quiet drift that destroys ML systems while every dashboard stays green. The capstone wires Lakebase, the agents, and the analytics into a

full retail intelligence app. Azure examples; concepts apply on AWS and GCP.

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub