Email the Author
You can use this page to email MrDataPsycho about From Pandas to PySpark DataFrame.
About the Book
Pandas is a popular Python library used for processing data, but it has certain limitations in its ability to process large datasets. The Apache Spark analytics library offers significant performance improvements.
This book will help improve your Python-based data processing by leveraging Apache Spark’s multithreading capabilities through the PySpark library. You’ll start by reading data into a PySpark DataFrame before performing basic input/output functions, such as renaming attributes, selecting, and writing data. You’ll move on to transformation functions like
aggregation, statistical analysis, and joins before creating custom, user-defined functions. At each step, you’ll get a quick Pandas review before being walked through leveraging the more robust PySpark library to unlock Apache Spark.
By following the contents and exercise of the book, You’ll be able to quickly and reliably process large amounts of data, even stored across multiple files, using PySpark.
Some key takeaways from the book:
● A working knowledge of Apache Spark and the PySpark library for Python
● A strong understanding of the advantages of using PySpark instead of Pandas for
processing large datasets
● The ability to create, analyze, and produce visualizations using PySpark
● Hands-on experience reading, transforming, and analyzing real-world data using PySpark
● Writing production-grade code which your colleague will appreciate
*** New chapter and Updates will be released end of summer 2023
About the Author
I am a Data Science professional working in the industry for more than five years. Beside my full time job, I also try to explore new technology. I like to share my experience of using certain tolls and technology used in Data Science through blogs and books.