Writing Beautiful Apache Spark Code
Writing Beautiful Apache Spark Code
Processing massive datasets with ease
About the Book
This book teaches Spark fundamentals and shows you how to build production grade libraries and applications. It took years for the Spark community to develop the best practices outlined in this book. This book will fast track your Spark learning journey and put you on the path to mastery.
Table of Contents
-
Introduction
- Typical painful workflow
- Productionalizing advanced analytics models is hard
- Why Scala?
- Who should read this book?
- Is this book for data engineers or data scientists?
- Beautiful Spark philosophy
- DataFrames vs. RDDs
- Spark streaming
- Machine learning
- The “coalesce test” for evaluating learning resources
- Will we cover the entire Spark SQL API?
- How this book is organized
- Spark programming levels
- Note about Spark versions
-
Running Spark Locally
- Starting the console
- Running Scala code in the console
- Accessing the SparkSession in the console
- Console commands
-
Databricks Community
- Creating a notebook and cluster
- Running some code
- Next steps
-
Introduction to DataFrames
- Creating DataFrames
- Adding columns
- Filtering rows
- More on schemas
- Creating DataFrames with createDataFrame()
- Next Steps
-
Working with CSV files
- Reading a CSV file into a DataFrame
- Writing a DataFrame to disk
- Reading CSV files in Databricks Notebooks
-
Just Enough Scala for Spark Programmers
- Scala function basics
- Currying functions
- object
- trait
-
package
- Implicit classes
- Next steps
-
Column Methods
- A simple example
- Instantiating Column objects
- gt
- substr
- + operator
- lit
- isNull
- isNotNull
- when / otherwise
- Next steps
-
Introduction to Spark SQL functions
- High level review
- lit() function
- when() and otherwise() functions
- Writing your own SQL function
- Next steps
-
User Defined Functions (UDFs)
- Simple UDF example
- Using Column Functions
- Conclusion
-
Chaining Custom DataFrame Transformations in Spark
- Dataset Transform Method
- Transform Method with Arguments
-
Whitespace data munging with Spark
- trim(), ltrim(), and rtrim()
- singleSpace()
- removeAllWhitespace()
- Conclusion
-
Defining DataFrame Schemas with StructField and StructType
- Defining a schema to create a DataFrame
-
StructField
-
Defining schemas with the
::
operator -
Defining schemas with the
add()
method - Common errors
-
LongType
- Next steps
-
Different approaches to manually create Spark DataFrames
- toDF
- createDataFrame
- createDF
- How we’ll create DataFrames in this book
-
Dealing with null in Spark
- What is null?
- Spark uses null by default sometimes
- nullable Columns
- Native Spark code
- Scala null Conventions
- User Defined Functions
- Spark Rules for Dealing with null
-
Using JAR Files Locally
- Starting the console with a JAR file
- Adding JAR file to an existing console session
- Attaching JARs to Databricks clusters
- Review
-
Working with Spark ArrayType columns
- Scala collections
- Splitting a string into an ArrayType column
-
Directly creating an
ArrayType
column -
array_contains
-
explode
-
collect_list
- Single column array functions
- Generic single column array functions
- Multiple column array functions
- Split array column into multiple columns
- Closing thoughts
-
Working with Spark MapType Columns
- Scala maps
- Creating MapType columns
- Fetching values from maps with element_at()
- Appending MapType columns
- Creating MapType columns from two ArrayType columns
- Converting Arrays to Maps with Scala
- Merging maps with map_concat()
- Using StructType columns instead of MapType columns
- Writing MapType columns to disk
- Conclusion
-
Adding StructType columns to DataFrames
- StructType overview
- Appending StructType columns
- Using StructTypes to eliminate order dependencies
- Order dependencies can be a big problem in large Spark codebases
-
Working with dates and times
- Creating DateType columns
- year(), month(), dayofmonth()
- minute(), second()
- datediff()
- date_add()
- Next steps
-
Performing operations on multiple columns with foldLeft
- foldLeft review in Scala
- Eliminating whitespace from multiple columns
- snake_case all columns in a DataFrame
- Wrapping foldLeft operations in custom transformations
- Next steps
-
Equality Operators
- =
-
Introduction to Spark Broadcast Joins
- Conceptual overview
- Simple example
- Analyzing physical plans of joins
-
Eliminating the duplicate
city
column -
Diving deeper into
explain()
- Next steps
-
Partitioning Data in Memory
- Intro to partitions
- coalesce
- Increasing partitions
- repartition
- Differences between coalesce and repartition
- Real World Example
-
Partitioning on Disk with partitionBy
- Memory partitioning vs. disk partitioning
- Simple example
- partitionBy with repartition(5)
- partitionBy with repartition(1)
- Partitioning datasets with a max number of files per partition
- Partitioning dataset with max rows per file
- Partitioning dataset with max rows per file pre Spark 2.2
- Small file problem
- Conclusion
-
Fast Filtering with Spark PartitionFilters and PushedFilters
- Normal DataFrame filter
-
partitionBy()
- PartitionFilters
- PushedFilters
- Partitioning in memory vs. partitioning on disk
- Disk partitioning with skewed columns
- Next steps
-
Scala Text Editing
- Syntax highlighting
- Import reminders
- Import hints
- Argument type checking
- Flagging unnecessary imports
- When to use text editors and Databricks notebooks?
-
Structuring Spark Projects
- Project name
- Package naming convention
- Typical library structure
- Applications
-
Introduction to SBT
- Sample code
- Running SBT commands
- build.sbt
-
libraryDependencies
- sbt test
- sbt doc
- sbt console
- sbt package / sbt assembly
- sbt clean
- Next steps
-
Managing the SparkSession, The DataFrame Entry Point
- Accessing the SparkSession
- Example of using the SparkSession
- Creating a DataFrame
- Reading a DataFrame
- Creating a SparkSession
- Reusing the SparkSession in the test suite
- SparkContext
- Conclusion
-
Testing Spark Applications
- Hello World Example
- Testing a User Defined Function
- A Real Test
- How Testing Improves Your Codebase
- Running a Single Test File
-
Environment Specific Config in Spark Scala Projects
- Basic use case
- Environment specific code anitpattern
- Overriding config
-
Setting the
PROJECT_ENV
variable for test runs - Other implementations
- Next steps
-
Building Spark JAR Files with SBT
- JAR File Basics
- Building a Thin JAR File
- Building a Fat JAR File
- Next Steps
-
Shading Dependencies in Spark Projects with SBT
- When shading is useful
-
How to shade the
spark-daria
dependency - Conclusion
-
Dependency Injection with Spark
- Code with a dependency
- Injecting a path
- Injecting an entire DataFrame
- Conclusion
-
Broadcasting Maps
- Simple example
- Refactored code
- Building Maps from data files
- Conclusion
-
Validating Spark DataFrame Schemas
- Custom Transformations Refresher
- A Custom Transformation Making a Bad Assumption
- Column Presence Validation
- Full Schema Validation
- Documenting DataFrame Assumptions is Especially Important for Chained DataFrame Transformations
- Conclusion
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them