Data Science Bootstrap
Data Science Bootstrap
A practical guide to getting organized for your best data science work
About the Book
Hey there, data scientist! Have you found yourself struggling with your compute environment, such as package conflicts, this cryptic term called "environment variables", this weird concept called "containers", and technology terms that feel like an entire stack of things to learn? How about being lost amongst your multitude of projects, and not being able to get organized? This book, a linearized form of my freely available knowledge base, will give you a concise and practical guide to getting up-to-speed and organized, to help you do your best data science work.
The content in this book has been battle-tested since 2013 when I first switched over from bench science to computational science. You'll benefit from my distilled experience, where I learned all the hard lessons that come from not applying "best data science practices" to my projects, so you can avoid the same mistakes. You'll also benefit from the countless number of times I have guided newcomer colleagues and data practitioners to data science on getting their systems set up. You'll see the backing philosophies, the "whys" that explain why we ought to do certain things a certain way, as well as manifestations of those "whys".
End the struggle with getting your computer to do what you want. Gain full control over it instead. Come and learn how!
Get bootstrapped on your data science projects
- Why this knowledge base exists
- Where I think you, the reader, are coming from
- Things you’ll learn
- Apply these ideas just-in-time
- Ways to support the project
- The philosophies that ground the bootstrap
- Data scientists should strive to know every last detail about their compute stack
- There should be one, and preferably only one, obvious source of truth for things
- Eliminate drudgery by investing in automation
- Organize your projects by leveraging categories
Configure your machine
- Initial setup
- Getting Anaconda Python installed
- Master the shell
- Further configuration
Install homebrew on your machine
- Why install Homebrew?
- How do we install Homebrew?
- Once you’re done…
- What about Linux machines?
Install a suite of really cool utilities on your machine using homebrew
- What utilities are recommended?
- Install these really cool utilities
- See also
Install Anaconda on your machine
- What is anaconda
- Why use anaconda?
- How to get anaconda?
- Next steps
- Level-up your conda skills
Configure your conda installation
- Why you would want to configure your conda installation
- How to configure your condarc
- Other conda-related pages to look at
Bootstrap your base conda environment
- Why would you want to install some packages in your base conda environment
- How to bootstrap your base conda environment
Take full control of your shell environment variables
- Why control your environment variables
- How do I control my environment variables
Create shell command aliases for your commonly used commands
- Why create shell aliases
- How do I create aliases?
- Where do I store these aliases?
- What are some aliases that could be useful?
Install zsh and oh-my-zsh for shell hacks
- Why install zsh
- How do I install zsh and oh-my-zsh
Leverage dotfiles to get your machine configured quickly
- Why create a dotfiles repository
- How do you structure a dotfiles repository
Configure Jupyter and Jupyter Lab
- Jupyter configuration files
Install Docker on your machine
- Why you will need Docker
- How do you install Docker
Install and configure git on your machine
- Why do we need Git
- How to install Git
- How to configure Git with basic information
- How to configure Git with fancy features
Automate the bootstrapping of your new computer
- Why automate your configuration
- How to create a bootstrap script
- Get prepped per project
Follow the rule of one-to-one in managing your projects
- What is this rule all about
- Why is this important
- When can we break this rule
Sanely name things consistently
- Why should you name things consistently
- What constitutes a “sane” name?
One project should get one git repository
- Why one project should get one Git repository
- How to get this implemented
Set up an awesome default gitignore for your projects
- Why setup a “gitignore” file?
- How do I set up an awesome “gitignore” file?
How is a
Adhere to best git practices
- Why adhere to best Git practices?
- What best practices should we adhere to?
Set up pre-commit hooks to automate checks before making git commits
- Why use pre-commit hooks?
- How do I set up pre-commit hooks?
- What pre-commit hooks are good to install?
- How does this relate to continuous integration pipeline checks?
Set up your project with a sane directory structure
- Why setup your project with a sane directory structure
- What does a sane directory look like
Place custom source code inside a lightweight package
- Why write a package for your custom source code
- How to create a custom source package for a project
- How often should the package be updated?
Keep your notebooks organized with logical categories
Prototyping notebooks go under
Documentation notebooks go under
Application notebooks go under
- Prototyping notebooks go under
Create one conda environment per project
- Why use one conda environment per project
- How do you set up your conda environment files
- How do you decide which versions of packages to use?
- When do you upgrade/install new packages?
- Ensure your environment kernels are available to Jupyter
- Further tips
Create runtime environment variable configuration files for each of your projects
- Why configure environment variables per project
- How to configure environment variables for your project
Use scripts to automate routine execution of tasks
- Where should these scripts live?
- How do I decide what language to write those scripts in?
- What else should I pay attention to when building these scripts?
Configuration file overview
- What configuration files go with which tools?
Write effective documentation for your projects
- Why write documentation
- How do you write useful documentation
- What tools should we use to write documentation?
- What principles should we keep in mind when writing docs?
Install code checking tools to help write better code
- Why install code checking tools?
- What kind of things should I check for?
Define project-wide constants inside your custom package
- Why you would want to define project-wide constants
- How do you define project-wide constants
Write tests that test your custom code
- Why write tests for your code
- How do you write tests
Build a continuous integration pipeline for your source
- What is a continuous integration pipeline
- Why write a continuous integration pipeline
- How to build a CI pipeline
Use pyprojroot to define relative paths to the project root
- Why you should use pyprojroot
- How do you use pyprojroot effectively
Create configuration files for code checking tools
- Why configure code checking tools using configuration files?
- What configuration files belong with which code checking tools?
- When do I create these configuration files?
- Navigate the packaging worl
Prioritize conda to install packages
- Why should you use conda for packages
- How to search for conda-installable versions of packages
Use pip only when you cannot find packages on conda
- When you can use pip
- How to use pip with conda environments
Use docker containers for system-level packages
- Why you might need to use Docker
- How do we use Docker
- How to handle data
Write data descriptor files for your data sources
- Why write data descriptor files
- How do you write data descriptor files
- Alternatives to data descriptor files
Define single sources of truth for your data sources
- Why define single sources of truth for data
- Examples of single sources of data truth in action
Validate your data wherever practically possible
- What is “data validation”?
- When to validate data
- Parallels to software testing
- Tools for validating data
Build your projects thinking in terms of pipelines
- Why think in “pipelines”?
- What to look out for in pipelining tools?
- What pipelining tools exist?
Iteratively scope out and define the most appropriate data structures for your proble
- Why you need to define good data structures
- How to design good data structures for a problem
Never commit data into version control repositories
- Why you should never commit data to Git
- Add data to .gitignore
- See also
Programmatically clean your data rather than manually
- Why clean your data programmatically
- What tools can we use to programmatically clean our data?
- How do we design good data cleaning pipelines?
- Choose and customize your development environment
Use VSCode to help you with software development and collaboration
- Why you might want to consider using VSCode
- How do we get those extension packs?
- Alternatives to VSCode
Configure VSCode for maximum productivity
- How do we configure VSCode?
- What built-in options should I configure VSCode for maximum productivity?
- What VSCode extensions help with productivity?
Use Jupyter as an experimentation playgroun
- What are the use cases for Jupyter?
- How do you get Jupyter?
- How do you get Jupyter to recognize your environment’s Python?
- How do you launch Jupyter?
Turbocharge Jupyter Lab using Language Servers
- Why install Jupyter LSP?
Shell-based plain text editors give you a quick way to edit texts
- Why you should learn how to use shell-based text editors
- Most important shell text editor usage steps
- Which shell text editors exist?
- Further reading
Enhance nano with syntax highlighting
- Why you might want syntax highlighting in nano
- How to upgrade nano with syntax highlighting
- Bonus links
The Leanpub 60-day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
80% Royalties. Earn $16 on a $20 book.
We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $12 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.