Testing Spark Applications
Testing Spark Applications
Writing Spark code is hard... well designed, performant Spark tests are even harder. You need a robust test suite to identify performance bottlenecks in your code and refactor with ease. This book teaches you how to write a beautiful test suite and how to run the tests whenever code is pushed to the master branch.
About the Book
The book discusses Scala testing basics with the Scalatest framework. It uses the spark-fast-tests library to demonstrate column equality testing and DataFrame equality testing. Spark tests can run slowly so the book provides several practical workflows to keep tests running quickly. Spark code frequently reads and writes to disk and the book covers how to write tests for code with I/O. Configuring a test suite properly can make it around 70% faster and this book explains the configuration options you should have on your radar. Complex transformations (e.g. aggregations) and column types (e.g. MapType, ArrayType, StructType, BinaryType) have special testing considerations that are addressed in separate chapters.
This book has a heavy emphasis on software engineering best practices and will teach you skills that are useful for any language or framework.
Table of Contents
-
Introduction
- Messy data
- Nightmare deploys
- Empower refactoring
- Tests encourage code that doesn’t have side effects
- Identifying code bottlenecks
- Test suites document behavior
- Technologies used
-
Testing Scala with Scalatest
- Writing a simple test
- Directory organization
- build.sbt
- More tests
- Running tests and configuring output
- assertThrows
- assertDoesNotCompile
- Other assertions
- Other test formats
- Test library alternatives
- Testing Spark applications
- Next steps
-
Column Equality Tests
- Custom DataFrame Transformation Refresher
- Spark project setup
- assertColumnEquality with spark-fast-tests
- Conclusion
-
Quieting Test Output
- Customizing test suite output
-
Creating DataFrames for Tests
- toDF
- createDataFrame
- createDF
- Including spark-daria in your projects
- Next steps
-
DataFrame Equality Tests
- Simple example
- assertSmallDataFrameEquality error messages
- Next steps
-
Running Tests
- Running from the SBT console
- Running a single test file
- Running a single test
- Best workflow
-
Approximate Equality
- Difference between double, float and decimal
- When assertColumnEquality falls short
- assertFloatTypeColumnEquality to the rescue
- assertApproximateDataFrameEquality
- Conclusion
-
Testing User Defined Functions
- Creating a UDF
- Testing a UDF
- Check the UDF fails with null input
- The billion dollar mistake
- Verifying test failure in the test suite
- Next steps
-
Testing Spark Column Functions
- Simple example
- How Spark functions handle null
- Important takeaway
- Why print DataFrames from the test suite?
- Next steps
-
Testing Filesystem Reads
- Untestable code
- Setting the path as a param
- Testing with the config pattern
- Elegant testing with dependency injection
- Abstracting custom transformation to a separate function
- Next steps
-
Testing Filesystem Writes
- Simple example
- Rude tests leave garbage behind
- Performance considerations
- Next steps
-
Identifying Bottlenecks
- Let’s find the bottleneck
- Benchmarking individual transformations
- Contrived but representative
- Conclusion
-
Organizing Tests
- Some pure Scala
- Poorly organized Spark tests
- Test suite rules to follow
- Good Spark test organization
- Tests should be descriptive and document behavior
- Quantifying performance difference
- Next steps
-
Test Suite Configuration
- Shuffle partitions
- javaOptions
- Conclusion
-
Testing Aggregations
- groupBy refresher
- groupBy with two columns
- groupBy with filters
- Conclusions
The Leanpub 60-day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
Do Well. Do Good.
Authors have earned$11,841,865writing, publishing and selling on Leanpub, earning 80% royalties while saving up to 25 million pounds of CO2 and up to 46,000 trees.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
Top Books
The Hitchhiker's Guide to DFIR: Experiences From Beginners and Experts
Barry Grundy, Tristram, John Haynes, ApexPredator, Andrew Rathbun, Kevin Pagano, Victor Heiland, Nisarg Suthar, Guus Beckers, and Jason WilkinsA first of its kind crowdsourced DFIR book written on GitHub by the members of the Digital Forensics Discord Server to share knowledge!
OpenIntro Statistics
David Diez, Christopher Barr, Mine Cetinkaya-Rundel, and OpenIntroA complete foundation for Statistics, also serving as a foundation for Data Science.
Leanpub revenue supports OpenIntro (US-based nonprofit) so we can provide free desk copies to teachers interested in using OpenIntro Statistics in the classroom and expand the project to support free textbooks in other subjects.
More resources: openintro.org.
Ansible for DevOps
Jeff GeerlingAnsible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server—or thousands.
R Programming for Data Science
Roger D. PengThis book brings the fundamentals of R programming to you, using the same material developed as part of the industry-leading Johns Hopkins Data Science Specialization. The skills taught in this book will lay the foundation for you to begin your journey learning data science. Printed copies of this book are available through Lulu.
Maîtriser Apache JMeter
Philippe Mouawad, Bruno Demion (Milamber), and Antonio Gomes RodriguesToute la puissance d'Apache JMeter expliquée par ses commiteurs et utilisateurs experts. De l'intégration continue en passant par le Cloud, vous découvrirez comment intégrer JMeter à vos processus "Agile" et Devops.
If you're looking for the newer english version of this book, go to Master JMeter : From load testing to DevOps
Software Architecture for Developers
Simon BrownA developer-friendly, practical and pragmatic guide to lightweight software architecture, technical leadership and the balance with agility.
Introduction to Data Science
Rafael A IrizarryThe demand for skilled data science practitioners in industry, academia, and government is rapidly growing. This book introduces concepts from probability, statistical inference, linear regression and machine learning and R programming skills. Throughout the book we demonstrate how these can help you tackle real-world data analysis challenges.
Concurrency with Modern C++
Rainer GrimmC++11 is the first C++ standard that deals with concurrency. The story goes on with C++17, C++20, and will continue with C++23.
I'll give you a detailed insight into the current and the upcoming concurrency in C++. This insight includes the theory and a lot of practice.
Optics By Example
Chris PennerA comprehensive example-driven guide to optics. Examples in Haskell, but adaptable to other languages.
Become a data-manipulation wizard using optics to manipulate data!
This book takes you from beginner to advanced using Lenses, Traversals, Prisms, and more!
Functional Event-Driven Architecture
Gabriel VolpeExplore the event-driven architecture (EDA) in a purely functional way. Learn to design and develop distributed systems that scale. Identify common design patterns in such systems.
Take your functional programming skills to the next level by joining me in developing a distributed system powered by Apache Pulsar and Fs2 streams, all in Scala 3!
Top Bundles
- #1
CCIE Service Provider Ultimate Study Bundle
2 Books
Piotr Jablonski, Lukasz Bromirski, and Nick Russo have joined forces to deliver the only CCIE Service Provider training resource you'll ever need. This bundle contains a detailed and challenging collection of workbook labs, plus an extensively detailed technical reference guide. All of us have earned the CCIE Service Provider certification... - #2
Practical FP in Scala + Functional event-driven architecture
2 Books
Practical FP in Scala (A hands-on approach) & Functional event-driven architecture, aka FEDA, (Powered by Scala 3), together as a bundle! The content of PFP in Scala is a requirement to understand FEDA so why not take advantage of this bundle!? - #3
Software Architecture
2 Books
"Software Architecture for Developers" is a practical and pragmatic guide to modern, lightweight software architecture, specifically aimed at developers. You'll learn:The essence of software architecture.Why the software architecture role should include coding, coaching and collaboration.The things that you really need to think about before... - #4
Modern C++ Collection
3 Books
Get All about Modern C++C++ Standard Library, including C++20Concurrency with Modern C++, including C++20C++20Each book has about 200 complete code examples. Updates are included. When I update one of the books, you immediately get the updated bundle. You can expect significant updates to each new C++ standard (C++23, C++26, .. ) and also... - #5
Pattern-Oriented Memory Forensics and Malware Detection
2 Books
This training bundle for security engineers and researchers, malware and memory forensics analysts includes two accelerated training courses for Windows memory dump analysis using WinDbg. It is also useful for technical support and escalation engineers who analyze memory dumps from complex software environments and need to check for possible... - #6
All the Books of The Medical Futurist
6 Books
We put together the most popular books from The Medical Futurist to provide a clear picture about the major trends shaping the future of medicine and healthcare. Digital health technologies, artificial intelligence, the future of 20 medical specialties, big pharma, data privacy, digital health investments and how technology giants such as Amazon... - #9
Learn Git, Bash, and Terraform the Hard Way
3 Books
Learn Git, Bash and Terraform using the Hard Way method.These technologies are essential tools in the DevOps armoury. These books walk you through their features and subtleties in a simple, gradual way that reinforces learning rather than baffling you with theory. - #10
Static Analysis and Automated Refactoring
2 Books
As PHP developers we are living in the "Age of Static Analysis". We can use a tool like PHPStan to learn about potential bugs before we ship our code to production, and we can enforce our team's programming standards using custom PHPStan rules. Recipes for Decoupling by Matthias Noback teaches you in great detail how to do this, while also...