Writing Beautiful Apache Spark Code
Writing Beautiful Apache Spark Code
Processing massive datasets with ease
About the Book
This book teaches Spark fundamentals and shows you how to build production grade libraries and applications. It took years for the Spark community to develop the best practices outlined in this book. This book will fast track your Spark learning journey and put you on the path to mastery.
Table of Contents
-
Introduction
- Typical painful workflow
- Productionalizing advanced analytics models is hard
- Why Scala?
- Who should read this book?
- Is this book for data engineers or data scientists?
- Beautiful Spark philosophy
- DataFrames vs. RDDs
- Spark streaming
- Machine learning
- The “coalesce test” for evaluating learning resources
- Will we cover the entire Spark SQL API?
- How this book is organized
- Spark programming levels
- Note about Spark versions
-
Running Spark Locally
- Starting the console
- Running Scala code in the console
- Accessing the SparkSession in the console
- Console commands
-
Databricks Community
- Creating a notebook and cluster
- Running some code
- Next steps
-
Introduction to DataFrames
- Creating DataFrames
- Adding columns
- Filtering rows
- More on schemas
- Creating DataFrames with createDataFrame()
- Next Steps
-
Working with CSV files
- Reading a CSV file into a DataFrame
- Writing a DataFrame to disk
- Reading CSV files in Databricks Notebooks
-
Just Enough Scala for Spark Programmers
- Scala function basics
- Currying functions
- object
- trait
-
package
- Implicit classes
- Next steps
-
Column Methods
- A simple example
- Instantiating Column objects
- gt
- substr
- + operator
- lit
- isNull
- isNotNull
- when / otherwise
- Next steps
-
Introduction to Spark SQL functions
- High level review
- lit() function
- when() and otherwise() functions
- Writing your own SQL function
- Next steps
-
User Defined Functions (UDFs)
- Simple UDF example
- Using Column Functions
- Conclusion
-
Chaining Custom DataFrame Transformations in Spark
- Dataset Transform Method
- Transform Method with Arguments
-
Whitespace data munging with Spark
- trim(), ltrim(), and rtrim()
- singleSpace()
- removeAllWhitespace()
- Conclusion
-
Defining DataFrame Schemas with StructField and StructType
- Defining a schema to create a DataFrame
-
StructField
-
Defining schemas with the
::
operator -
Defining schemas with the
add()
method - Common errors
-
LongType
- Next steps
-
Different approaches to manually create Spark DataFrames
- toDF
- createDataFrame
- createDF
- How we’ll create DataFrames in this book
-
Dealing with null in Spark
- What is null?
- Spark uses null by default sometimes
- nullable Columns
- Native Spark code
- Scala null Conventions
- User Defined Functions
- Spark Rules for Dealing with null
-
Using JAR Files Locally
- Starting the console with a JAR file
- Adding JAR file to an existing console session
- Attaching JARs to Databricks clusters
- Review
-
Working with Spark ArrayType columns
- Scala collections
- Splitting a string into an ArrayType column
-
Directly creating an
ArrayType
column -
array_contains
-
explode
-
collect_list
- Single column array functions
- Generic single column array functions
- Multiple column array functions
- Split array column into multiple columns
- Closing thoughts
-
Working with Spark MapType Columns
- Scala maps
- Creating MapType columns
- Fetching values from maps with element_at()
- Appending MapType columns
- Creating MapType columns from two ArrayType columns
- Converting Arrays to Maps with Scala
- Merging maps with map_concat()
- Using StructType columns instead of MapType columns
- Writing MapType columns to disk
- Conclusion
-
Adding StructType columns to DataFrames
- StructType overview
- Appending StructType columns
- Using StructTypes to eliminate order dependencies
- Order dependencies can be a big problem in large Spark codebases
-
Working with dates and times
- Creating DateType columns
- year(), month(), dayofmonth()
- minute(), second()
- datediff()
- date_add()
- Next steps
-
Performing operations on multiple columns with foldLeft
- foldLeft review in Scala
- Eliminating whitespace from multiple columns
- snake_case all columns in a DataFrame
- Wrapping foldLeft operations in custom transformations
- Next steps
-
Equality Operators
- =
-
Introduction to Spark Broadcast Joins
- Conceptual overview
- Simple example
- Analyzing physical plans of joins
-
Eliminating the duplicate
city
column -
Diving deeper into
explain()
- Next steps
-
Partitioning Data in Memory
- Intro to partitions
- coalesce
- Increasing partitions
- repartition
- Differences between coalesce and repartition
- Real World Example
-
Partitioning on Disk with partitionBy
- Memory partitioning vs. disk partitioning
- Simple example
- partitionBy with repartition(5)
- partitionBy with repartition(1)
- Partitioning datasets with a max number of files per partition
- Partitioning dataset with max rows per file
- Partitioning dataset with max rows per file pre Spark 2.2
- Small file problem
- Conclusion
-
Fast Filtering with Spark PartitionFilters and PushedFilters
- Normal DataFrame filter
-
partitionBy()
- PartitionFilters
- PushedFilters
- Partitioning in memory vs. partitioning on disk
- Disk partitioning with skewed columns
- Next steps
-
Scala Text Editing
- Syntax highlighting
- Import reminders
- Import hints
- Argument type checking
- Flagging unnecessary imports
- When to use text editors and Databricks notebooks?
-
Structuring Spark Projects
- Project name
- Package naming convention
- Typical library structure
- Applications
-
Introduction to SBT
- Sample code
- Running SBT commands
- build.sbt
-
libraryDependencies
- sbt test
- sbt doc
- sbt console
- sbt package / sbt assembly
- sbt clean
- Next steps
-
Managing the SparkSession, The DataFrame Entry Point
- Accessing the SparkSession
- Example of using the SparkSession
- Creating a DataFrame
- Reading a DataFrame
- Creating a SparkSession
- Reusing the SparkSession in the test suite
- SparkContext
- Conclusion
-
Testing Spark Applications
- Hello World Example
- Testing a User Defined Function
- A Real Test
- How Testing Improves Your Codebase
- Running a Single Test File
-
Environment Specific Config in Spark Scala Projects
- Basic use case
- Environment specific code anitpattern
- Overriding config
-
Setting the
PROJECT_ENV
variable for test runs - Other implementations
- Next steps
-
Building Spark JAR Files with SBT
- JAR File Basics
- Building a Thin JAR File
- Building a Fat JAR File
- Next Steps
-
Shading Dependencies in Spark Projects with SBT
- When shading is useful
-
How to shade the
spark-daria
dependency - Conclusion
-
Dependency Injection with Spark
- Code with a dependency
- Injecting a path
- Injecting an entire DataFrame
- Conclusion
-
Broadcasting Maps
- Simple example
- Refactored code
- Building Maps from data files
- Conclusion
-
Validating Spark DataFrame Schemas
- Custom Transformations Refresher
- A Custom Transformation Making a Bad Assumption
- Column Presence Validation
- Full Schema Validation
- Documenting DataFrame Assumptions is Especially Important for Chained DataFrame Transformations
- Conclusion
Authors have earned$10,088,725writing, publishing and selling on Leanpub, earning 80% royalties while saving up to 25 million pounds of CO2 and up to 46,000 trees.
Learn more about writing on Leanpub
The Leanpub 45-day 100% Happiness Guarantee
Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers), EPUB (for phones and tablets) and MOBI (for Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
Top Books
C++20
Rainer GrimmC++20 is the next big C++ standard after C++11. As C++11 did it, C++20 changes the way we program modern C++. This change is, in particular, due to the big four of C++20: ranges, coroutines, concepts, and modules.
The book is almost daily updated. These incremental updates ease my interaction with the proofreaders.
C++ Best Practices
Jason TurnerLevel up your C++, get the tools working for you, eliminate common problems, and move on to more exciting things!
Atomic Kotlin
Bruce Eckel and Svetlana IsakovaFor both beginning and experienced programmers! From the author of the multi-award-winning Thinking in C++ and Thinking in Java together with a member of the Kotlin language team comes a book that breaks the concepts into small, easy-to-digest "atoms," along with exercises supported by hints and solutions directly inside IntelliJ IDEA!
Sockets and Pipes
Type ClassesSockets and Pipes is not an introduction to Haskell; it is an introduction to writing software in Haskell. Using a handful of everyday Haskell libraries, this book walks through reading the HTTP specification and implementing it to create a web server.
Introducing EventStorming
Alberto BrandoliniThe deepest tutorial and explanation about EventStorming, straight from the inventor.
Composing Software
Eric ElliottAll software design is composition: the act of breaking complex problems down into smaller problems and composing those solutions. Most developers have a limited understanding of compositional techniques. It's time for that to change.
Cloud Strategy
Gregor Hohpe“Strategy is the difference between making a wish and making it come true.” A successful migration to the cloud can transform your organization, but it shouldn’t be driven by wishes. This book tells you how to develop a sound strategy guided by frameworks and decision models without being overly abstract nor getting lost in product details.
node-opcua by example
Etienne RossignonGet the best out of node-opcua through a set of documented examples by the author himself that will allow you to create stunning OPCUA Servers or Clients.
Functional Design and Architecture
Alexander GraninSoftware Design in Functional Programming, Design Patterns and Practices, Methodologies and Application Architectures. How to build real software in Haskell with less efforts and low risks. The first complete source of knowledge.
Ansible for DevOps
Jeff GeerlingAnsible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server—or thousands.
Top Bundles
- #1
Software Architecture for Developers: Volumes 1 & 2 - Technical leadership and communication
2 Books
"Software Architecture for Developers" is a practical and pragmatic guide to modern, lightweight software architecture, specifically aimed at developers. You'll learn:The essence of software architecture.Why the software architecture role should include coding, coaching and collaboration.The things that you really need to think about before... - #2
Django for Beginners/APIs/Professionals
3 Books
- #3
PowerShell
3 Books
Buy every PowerShell book from Adam Bertram at a 20% discount! - #4
CCIE Service Provider Ultimate Study Bundle
2 Books
Piotr Jablonski, Lukasz Bromirski, and Nick Russo have joined forces to deliver the only CCIE Service Provider training resource you'll ever need. This bundle contains a detailed and challenging collection of workbook labs, plus an extensively detailed technical reference guide. All of us have earned the CCIE Service Provider certification... - #5
Cisco CCNA 200-301 Complet
4 Books
Ce lot comprend les quatre volumes du guide préparation à l'examen de certification Cisco CCNA 200-301. - #6
All the Books of The Medical Futurist
6 Books
We put together the most popular books from The Medical Futurist to provide a clear picture about the major trends shaping the future of medicine and healthcare. Digital health technologies, artificial intelligence, the future of 20 medical specialties, big pharma, data privacy, digital health investments and how technology giants such as Amazon... - #7
Linux Administration Complet
4 Books
Ce lot comprend les quatre volumes du Guide Linux Administration :Linux Administration, Volume 1, Administration fondamentale : Guide pratique de préparation aux examens de certification LPIC 1, Linux Essentials, RHCSA et LFCS. Administration fondamentale. Introduction à Linux. Le Shell. Traitement du texte. Arborescence de fichiers. Sécurité... - #8
Software Architecture and Beautiful APIs
2 Books
There is no better way to learn how to design good APIs than to look at many existing examples, complementing the Software Architecture theory on API design. - #9
Learn Git, Bash, and Terraform the Hard Way
3 Books
Learn Git, Bash and Terraform using the Hard Way method.These technologies are essential tools in the DevOps armoury. These books walk you through their features and subtleties in a simple, gradual way that reinforces learning rather than baffling you with theory. - #10
9 Books-Bundle: Shut Up and Code!
9 Books
"Shut up and code." Laughter in the audience. The hacker had just plugged in his notebook and started sharing his screen to present his super-smart Python script. "Shut up and code" The letters written in a white literal coding font on black background was the hackers' home screen background mantra. At the time, I was a first-year computer...