Data Science Bootstrap
Minimum price
Suggested price

Data Science Bootstrap

A practical guide to getting organized for your best data science work

About the Book

Hey there, data scientist! Have you found yourself struggling with your compute environment, such as package conflicts, this cryptic term called "environment variables", this weird concept called "containers", and technology terms that feel like an entire stack of things to learn? How about being lost amongst your multitude of projects, and not being able to get organized? This book, a linearized form of my freely available knowledge base, will give you a concise and practical guide to getting up-to-speed and organized, to help you do your best data science work.

The content in this book has been battle-tested since 2013 when I first switched over from bench science to computational science. You'll benefit from my distilled experience, where I learned all the hard lessons that come from not applying "best data science practices" to my projects, so you can avoid the same mistakes. You'll also benefit from the countless number of times I have guided newcomer colleagues and data practitioners to data science on getting their systems set up. You'll see the backing philosophies, the "whys" that explain why we ought to do certain things a certain way, as well as manifestations of those "whys".

End the struggle with getting your computer to do what you want. Gain full control over it instead. Come and learn how!

About the Author

Eric Ma
Eric Ma

As Principal Data Scientist at Moderna Eric leads the Data Science and Artificial Intelligence (Research) team to accelerate science to the speed of thought. Prior to Moderna, he was at the Novartis Institutes for Biomedical Research conducting biomedical data science research with a focus on using Bayesian statistical methods in the service of discovering medicines for patients. Prior to Novartis, he was an Insight Health Data Fellow in the summer of 2017 and defended his doctoral thesis in the Department of Biological Engineering at MIT in the spring of 2017.

Eric is also an open-source software developer and has led the development of pyjanitor, a clean API for cleaning data in Python, and nxviz, a visualization package for NetworkX. He is also on the core developer team of NetworkX and PyMC. In addition, he gives back to the community through code contributionsbloggingteaching, and writing.

His personal life motto is found in the Gospel of Luke 12:48.

Table of Contents

  • Get bootstrapped on your data science projects
    • Why this knowledge base exists
    • Where I think you, the reader, are coming from
    • Things you’ll learn
    • Apply these ideas just-in-time
    • Ways to support the project
  • The philosophies that ground the bootstrap
  • Data scientists should strive to know every last detail about their compute stack
  • There should be one, and preferably only one, obvious source of truth for things
  • Eliminate drudgery by investing in automation
  • Organize your projects by leveraging categories
  • Configure your machine
    • Initial setup
    • Getting Anaconda Python installed
    • Master the shell
    • Further configuration
  • Install homebrew on your machine
    • Why install Homebrew?
    • How do we install Homebrew?
    • Once you’re done…
    • What about Linux machines?
  • Install a suite of really cool utilities on your machine using homebrew
    • What utilities are recommended?
    • Install these really cool utilities
    • See also
  • Install Anaconda on your machine
    • What is anaconda
    • Why use anaconda?
    • How to get anaconda?
    • Next steps
    • Level-up your conda skills
  • Configure your conda installation
    • Why you would want to configure your conda installation
    • How to configure your condarc
    • Other conda-related pages to look at
  • Bootstrap your base conda environment
    • Why would you want to install some packages in your base conda environment
    • How to bootstrap your base conda environment
  • Take full control of your shell environment variables
    • Why control your environment variables
    • How do I control my environment variables
  • Create shell command aliases for your commonly used commands
    • Why create shell aliases
    • How do I create aliases?
    • Where do I store these aliases?
    • What are some aliases that could be useful?
  • Install zsh and oh-my-zsh for shell hacks
    • Why install zsh
    • How do I install zsh and oh-my-zsh
  • Leverage dotfiles to get your machine configured quickly
    • Why create a dotfiles repository
    • How do you structure a dotfiles repository
  • Configure Jupyter and Jupyter Lab
    • Jupyter configuration files
  • Install Docker on your machine
    • Why you will need Docker
    • How do you install Docker
  • Install and configure git on your machine
    • Why do we need Git
    • How to install Git
    • How to configure Git with basic information
    • How to configure Git with fancy features
  • Automate the bootstrapping of your new computer
    • Why automate your configuration
    • How to create a bootstrap script
  • Get prepped per project
  • Follow the rule of one-to-one in managing your projects
    • What is this rule all about
    • Why is this important
    • When can we break this rule
  • Sanely name things consistently
    • Why should you name things consistently
    • What constitutes a “sane” name?
  • One project should get one git repository
    • Why one project should get one Git repository
    • How to get this implemented
  • Set up an awesome default gitignore for your projects
    • Why setup a “gitignore” file?
    • How do I set up an awesome “gitignore” file?
    • How is a .gitignore file parsed?
  • Adhere to best git practices
    • Why adhere to best Git practices?
    • What best practices should we adhere to?
  • Set up pre-commit hooks to automate checks before making git commits
    • Why use pre-commit hooks?
    • How do I set up pre-commit hooks?
    • What pre-commit hooks are good to install?
    • How does this relate to continuous integration pipeline checks?
  • Set up your project with a sane directory structure
    • Why setup your project with a sane directory structure
    • What does a sane directory look like
  • Place custom source code inside a lightweight package
    • Why write a package for your custom source code
    • How to create a custom source package for a project
    • How often should the package be updated?
  • Keep your notebooks organized with logical categories
    • Prototyping notebooks go under notebooks/
    • Documentation notebooks go under docs/
    • Application notebooks go under app/
  • Create one conda environment per project
    • Why use one conda environment per project
    • How do you set up your conda environment files
    • How do you decide which versions of packages to use?
    • When do you upgrade/install new packages?
    • Ensure your environment kernels are available to Jupyter
    • Further tips
  • Create runtime environment variable configuration files for each of your projects
    • Why configure environment variables per project
    • How to configure environment variables for your project
  • Use scripts to automate routine execution of tasks
    • Where should these scripts live?
    • How do I decide what language to write those scripts in?
    • What else should I pay attention to when building these scripts?
  • Configuration file overview
    • What configuration files go with which tools?
  • Write effective documentation for your projects
    • Why write documentation
    • How do you write useful documentation
    • What tools should we use to write documentation?
    • What principles should we keep in mind when writing docs?
    • Resources
  • Install code checking tools to help write better code
    • Why install code checking tools?
    • What kind of things should I check for?
  • Define project-wide constants inside your custom package
    • Why you would want to define project-wide constants
    • How do you define project-wide constants
  • Write tests that test your custom code
    • Why write tests for your code
    • How do you write tests
  • Build a continuous integration pipeline for your source
    • What is a continuous integration pipeline
    • Why write a continuous integration pipeline
    • How to build a CI pipeline
  • Use pyprojroot to define relative paths to the project root
    • Why you should use pyprojroot
    • How do you use pyprojroot effectively
  • Create configuration files for code checking tools
    • Why configure code checking tools using configuration files?
    • What configuration files belong with which code checking tools?
    • When do I create these configuration files?
  • Navigate the packaging worl
  • Prioritize conda to install packages
    • Why should you use conda for packages
    • How to search for conda-installable versions of packages
  • Use pip only when you cannot find packages on conda
    • When you can use pip
    • How to use pip with conda environments
  • Use docker containers for system-level packages
    • Why you might need to use Docker
    • How do we use Docker
  • Handling data
    • How to handle data
  • Write data descriptor files for your data sources
    • Why write data descriptor files
    • How do you write data descriptor files
    • Alternatives to data descriptor files
  • Define single sources of truth for your data sources
    • Why define single sources of truth for data
    • Examples of single sources of data truth in action
  • Validate your data wherever practically possible
    • What is “data validation”?
    • When to validate data
    • Parallels to software testing
    • Tools for validating data
  • Build your projects thinking in terms of pipelines
    • Why think in “pipelines”?
    • What to look out for in pipelining tools?
    • What pipelining tools exist?
  • Iteratively scope out and define the most appropriate data structures for your proble
    • Why you need to define good data structures
    • How to design good data structures for a problem
  • Never commit data into version control repositories
    • Why you should never commit data to Git
    • Add data to .gitignore
    • See also
  • Programmatically clean your data rather than manually
    • Why clean your data programmatically
    • What tools can we use to programmatically clean our data?
    • How do we design good data cleaning pipelines?
  • Choose and customize your development environment
  • Use VSCode to help you with software development and collaboration
    • Why you might want to consider using VSCode
    • How do we get those extension packs?
    • Alternatives to VSCode
  • Configure VSCode for maximum productivity
    • How do we configure VSCode?
    • What built-in options should I configure VSCode for maximum productivity?
    • What VSCode extensions help with productivity?
  • Use Jupyter as an experimentation playgroun
    • What are the use cases for Jupyter?
    • How do you get Jupyter?
    • How do you get Jupyter to recognize your environment’s Python?
    • How do you launch Jupyter?
  • Turbocharge Jupyter Lab using Language Servers
    • Why install Jupyter LSP?
    • Prerequisites
    • Installation
    • References
  • Shell-based plain text editors give you a quick way to edit texts
    • Why you should learn how to use shell-based text editors
    • Most important shell text editor usage steps
    • Which shell text editors exist?
    • Further reading
  • Enhance nano with syntax highlighting
    • Why you might want syntax highlighting in nano
    • How to upgrade nano with syntax highlighting
  • Bonus links

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

80% Royalties. Earn $16 on a $20 book.

We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earnedover $13 millionwriting, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub