Modern Data Pipelines Testing Techniques
Modern Data Pipelines Testing Techniques
A Visual Guide
About the Book
Any software product deteriorates rapidly without disciplined testing.
However, testing data pipelines is a hellish experience for new data developers.
Unfortunately, existing training about data pipeline testing give a scattered view of techniques for testing data pipelines. This book will help with a full view of modern data pipelines testing techniques in a highly-visual and coherent body of work. I hope it helps you in your career.
Why bother testing data pipelines? Billions of budget dollars regularly rely on the excellence of the data scientists, data engineers, and machine learning engineers behind the countless software data pipelines that inform critical business decisions.
Not sure?
Here are 3 blog posts to get started:
Table of Contents
-
Chapter 1: Testing Your Patience
- Data Pipeline Transitive Failure Modes: The Reality Check
- Bad Data Devs Lifestyle
- TDD + CICD to the rescue?
- Objections to TDD for Data Work
- Sources of Data Validation Complexity
- The Data Product Promise No One Can Keep
- Fighting Against The Manual Auto-Pilot
- Observability vs. Testing vs. Monitoring
- Test-Driven Theater vs Continuous Delivery Theater
-
Chapter 2: Core Types of Data Pipeline Tests
- Discovering Holistic Testing
- Types of Tests: Test Boundaries
- Types of Tests: Test Sizes
- Types of Tests: Data Product Testing Quadrant
- Types of Tests: Write-Audit-Publish
- Types of Tests: Testing Grid
- Types of Tests: Code Scale vs Data Scale Testing Grid
- Types of Tests: Structuring Data Quality Tests
- Types of Tests: Pointwise vs Pairwise vs Composite
- Types of Tests: Testing SQL Queries
- Types of Tests: Assembling The Testing Parts + Bug Tests
- Feedback Levels vs. Testing Scales
- Test Pyramids and Test Summits
-
Chapter 3: Supporting Components for Data Pipelines Tests
- Supporting Pattern: Static vs Dynamic Test Data Generation
- Supporting Pattern: Data Copies, Clones, and Snapshots
- Supporting Pattern: Reverse Data Plane to Support Testing
- Supporting Pattern: Parallel Dev-Test Data Streams
-
Chapter 4: Testing Legacy Data Pipelines
- Legacy Testing Pattern I: Before Touching Anything -- End to End Characterization Tests
- Legacy Testing Pattern III: Semantic Monitoring
- Legacy Testing Pattern IV: Data Processing Platform Alerts
- Legacy Testing Pattern V: Co-Control Data Contracts
- Legacy Testing Pattern VI: Legacy Pipelines Golden Rule
-
Chapter 5: Design for Testability
- Designing Hidden Data Pipelines
- Designing Temporally Decoupled Data Pipelines
- Designing Debuggable Data Pipelines
- Designing Encapsulated Data Pipelines
- Designing Right-Tool-For-The-Job Data Pipelines
- Designing Feature Engineering Data Pipelines
- Designing Iceberg Data Pipelines
-
Chapter 6: Data-oriented Development Environments
- What Can You Do From Your Laptop?
- Optimal Data Development Environment
- Fundamental Data Dev Repo Components
- Coding Timeline vs Data Job Timeline
-
Chapter 7: Deploying Data Pipelines
- Useful CICD workflows for Data pipelines
- Data Pipeline Release lifecycle
- Testable Scheduled Jobs CICD Workflow
- Database Schema Versioning Rational
- Database Schema Versioning Golden Rule
- Database Schema Migrations - Fields Strategies
- Database Schema Migrations - Hidden Things To Test
-
Chapter 8: Tips for Data Organizations
- Data Organization Testability Score Cards vs Your Average Data Dev
- When To Give Up On Testing Data Pipelines
- Actors In A Data Product
- Organizational Friction To Disable Data Pipelines Testability
- Organizational Changes To Enable Data Pipelines Testability
-
Chapter 9: Is This It?
- With Great Responsibility Comes Great Capped Autonomy
- Data Dev Autonomy Destruction Cookbook
- The Fear Of Obsolescence
-
Outro
- References
- Release Notes
Causes Supported

Tree-Nation
You reforest the world
https://tree-nation.comTree-Nation is the largest reforestation platform enabling citizens and companies to plant trees around the world.
Other books by this author
The Leanpub 60-day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
80% Royalties. Earn $16 on a $20 book.
We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earned$12,307,240writing, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
Top Books
OpenIntro Statistics
David Diez, Christopher Barr, Mine Cetinkaya-Rundel, and OpenIntroA complete foundation for Statistics, also serving as a foundation for Data Science.
Leanpub revenue supports OpenIntro (US-based nonprofit) so we can provide free desk copies to teachers interested in using OpenIntro Statistics in the classroom and expand the project to support free textbooks in other subjects.
More resources: openintro.org.
Personal Finance
Jason AndersonThis textbook provides an in-depth analysis on personal finance that is both practical and straightforward in its approach. It has been written in such a way that the readers can gain knowledge without getting overwhelmed by the technical terms. Suitable for both beginners and advanced learners.
Getting to Know IntelliJ IDEA
Trisha Gee and Helen ScottIf we treat our IDE as a text editor, we are doing ourselves a disservice. Using a combination of tutorials and a questions-and-answers approach, Getting to Know IntelliJ IDEA will help you find ways to use IntelliJ IDEA that enable you to work comfortably and productively as a professional developer.
R Programming for Data Science
Roger D. PengThis book brings the fundamentals of R programming to you, using the same material developed as part of the industry-leading Johns Hopkins Data Science Specialization. The skills taught in this book will lay the foundation for you to begin your journey learning data science. Printed copies of this book are available through Lulu.
C++20 - The Complete Guide
Nicolai M. JosuttisAll new language and library features of C++20 (for those who know previous C++ versions).
The book presents all new language and library features of C++20. Learn how this impacts day-to-day programming, to benefit in practice, to combine new features, and to avoid all new traps.
Buy early, pay less, free updates.
Other books:
Mastering STM32 - Second Edition
Carmine NovielloWith more than 1200 microcontrollers, STM32 is probably the most complete ARM Cortex-M platform on the market. This book aims to be the most complete guide around introducing the reader to this exciting MCU portfolio from ST Microelectronics and its official CubeHAL and STM32CubeIDE development environment.
Stats One
William FooteMachine Learning Q and AI
Sebastian Raschka, PhDHave you recently completed a machine learning or deep learning course and wondered what to learn next? With 30 questions and answers on key concepts in machine learning and AI, this book provides bite-sized bits of knowledge for your journey to becoming a machine learning expert.
Ansible for DevOps
Jeff GeerlingAnsible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server—or thousands.
Gradual Modularization for Ruby and Rails
Stephan HagemannGet yourself a new tool to manage your Rails application and your growing engineering organization! Prevent the ball-of-mud (and fix it!). Go for microservices or SOA if it makes sense not just because you don't have any other tool. Do all this through a low-overhead tool: packages. Enable better conversations to make practical changes today.
Top Bundles
- #1
Software Architecture
2 Books
"Software Architecture for Developers" is a practical and pragmatic guide to modern, lightweight software architecture, specifically aimed at developers. You'll learn:The essence of software architecture.Why the software architecture role should include coding, coaching and collaboration.The things that you really need to think about before... - #2
CCIE Service Provider Ultimate Study Bundle
2 Books
Piotr Jablonski, Lukasz Bromirski, and Nick Russo have joined forces to deliver the only CCIE Service Provider training resource you'll ever need. This bundle contains a detailed and challenging collection of workbook labs, plus an extensively detailed technical reference guide. All of us have earned the CCIE Service Provider certification... - #3
1500 QUIZ COMMENTATI (3 libri)
3 Books
Tre libri dei QUIZ MMG Commentati al prezzo di DUE! I QUIZ dei concorsi ufficiali di Medicina Generale relativi agli anni: 2000-2001-2003-2012-2013-2014-2015-2016-2017-2018-2019-2020-2021 +100 inediti Raccolti in unico bundle per aiutarvi nello studio e nella preparazione al concorso. All'interno di ogni libro i quiz sono stati suddivisi per... - #4
Pattern-Oriented Memory Forensics and Malware Detection
2 Books
This training bundle for security engineers and researchers, malware and memory forensics analysts includes two accelerated training courses for Windows memory dump analysis using WinDbg. It is also useful for technical support and escalation engineers who analyze memory dumps from complex software environments and need to check for possible... - #5
Practical FP in Scala + Functional event-driven architecture
2 Books
Practical FP in Scala (A hands-on approach) & Functional event-driven architecture, aka FEDA, (Powered by Scala 3), together as a bundle! The content of PFP in Scala is a requirement to understand FEDA so why not take advantage of this bundle!? - #7
Linux Administration Complet
4 Books
Ce lot comprend les quatre volumes du Guide Linux Administration :Linux Administration, Volume 1, Administration fondamentale : Guide pratique de préparation aux examens de certification LPIC 1, Linux Essentials, RHCSA et LFCS. Administration fondamentale. Introduction à Linux. Le Shell. Traitement du texte. Arborescence de fichiers. Sécurité... - #9
Development and Deployment of Multiplayer Online Games, Part ARCH. Architecture (Vol. I-III)
3 Books
What's the Big Idea? The idea behind this book is to summarize the body of knowledge that already exists on multiplayer games but is not available in one single place.And quite a fewof the issues discussed within this series (planned as three nine volumes ~300 pages each), while known in the industry, have not been published at all (except for... - #10
Modern C++ Collection
3 Books
Get All about Modern C++C++ Standard Library, including C++20Concurrency with Modern C++, including C++20C++20Each book has about 200 complete code examples. Updates are included. When I update one of the books, you immediately get the updated bundle. You can expect significant updates to each new C++ standard (C++23, C++26, .. ) and also...