Introduction
What is this book?
Today, most people enter the world of Data Science through the buzz and allure of “AI.” We tackle Kaggle challenges, voraciously consume Stack Overflow, and eat, live, and breathe through the Jupyter Notebook. Python, along with its “killer app” of Machine Learning, has done nothing short of revolutionize the way we “do data science,” and the world is a more interesting place because of it!
The Big Cloud providers, and many open source tools, have done wonders to democratize this technology. But, ‘easy access’ to high technology comes with a cost - we can easily go too far, rely too much on the tools we have today, and forget how to build the tools we need to truly transform our individual projects.
Most of the time, your impact as a Data Scientist is limited by your ability to enact your ideas - not by the ideas themselves. You can train a model on ‘clean’ data using Scikit Learn or FastAI, or run an ANOVA, in a notebook. Enacting that idea means getting to the data in the first place. It means knowing how to store it. It means processing your data at scale. It means running your processing script, reliably, every day on fresh data. It means testing that script. It means collaborating on that script with a coworker - or 10 - as the project scales. It means curating a library and building tools to solve the same problem for 5 new projects. It means packaging a model up for distribution - sharing with another data scientist, or deploying it as a service.
It means changing the way you think about problems by adopting new paradigms that accelerate you - and your work - across your organization. It means building an approach to data science within the broader python ecosystem.
This book is about python, and how to be an effective python programmer, as a Data Scientist. We learn the advanced python skills we need to accelerate you, and solve the real, daily problems you face in your DS role.
Who is it for?
The ideal reader has at least 1 year of active daily python use and is ready to learn more. You should be comfortable writing functions, classes, and reading source code. You should not be confused about basic syntax, data types, etc, and when you see something you don’t understand, you should be comfortable reading the docs or source code yourself. You should know how to import a package or module. You should know how to define a class. You should know how to look up python syntax and how to answer questions about python. You should be comfortable reading the source code for packages you work with, rather than reverting to StackOverflow for every question. You should know how to debug an error message - if you are not used to reading and interpreting errors, launching a line-by-line debugger, you should get comfortable breaking and debugging code before you dive into a book like this.
Often, I get feedback from students who find this material too challenging - “it’s like sitting down to a piano lesson and learning how to tune a piano” and “it’s like showing up for an intro to Russian class only to learn the class is taught in Russian.” These, in fact, are very appropriate descriptions - as an advanced subject, this book is written in native python, and we learn exactly when and how to tune python for advanced, but very common, needs. It is extremely technical, and earns the advanced designation. I only hope you will find the material valuable.
What will you learn?
Congratulations on starting this journey - it is tough, but I promise it will be rewarding. Roughly speaking we will cover 4 main themes in this book, learning advanced Python syntax and tools in the context of solving real-world Data Science problems. The themes are:
- Workflows
- How to test, deploy, share, merge, version, review, and iterate on your code. We will look at build systems - pipenv, conda, poetry - versioning tools, templating with cookiecutter, how to work with git, and more.
- Skeletons
- The bones upon which we build projects. These frameworks give us new perspectives and approaches - how to build ETL’s with Airflow and Luigi, how to structure and train a model with MLOps tools, how to compose new frameworks and build repeatable tools using context managers, decorators, and descriptors.
- Data
- How to store, process, search in the context of big data and small, row based or columnar, using parquet or SQL, scaling with tools like Dask
- Algorithms
- The universal CS concepts that help many projects, including hashing, memoization techniques, JIT and static compilation, visualization, data sketching, and more
I hope this whet’s your appetite, and you join me and your fellow scientists on this journey.
The state of the book
This book originated as slides and lecture notes I developed while teaching CSCI E-29, Advanced Python for Data Science, at Harvard Extension School. Currently, the book contains light transformation of those notes from slide into prose, which leaves a lot of opportunity to improve and flesh out the text. After all the notes are transcribed, diagrams reformatted for the book medium, etc, I will come back to rewrite the text as needed for textbook form.