Data Science Bootstrap
Data Science Bootstrap
A practical guide to getting organized for your best data science work
About the Book
Hey there, data scientist! Have you found yourself struggling with your compute environment, such as package conflicts, this cryptic term called "environment variables", this weird concept called "containers", and technology terms that feel like an entire stack of things to learn? How about being lost amongst your multitude of projects, and not being able to get organized? This book, a linearized form of my freely available knowledge base, will give you a concise and practical guide to getting up-to-speed and organized, to help you do your best data science work.
The content in this book has been battle-tested since 2013 when I first switched over from bench science to computational science. You'll benefit from my distilled experience, where I learned all the hard lessons that come from not applying "best data science practices" to my projects, so you can avoid the same mistakes. You'll also benefit from the countless number of times I have guided newcomer colleagues and data practitioners to data science on getting their systems set up. You'll see the backing philosophies, the "whys" that explain why we ought to do certain things a certain way, as well as manifestations of those "whys".
End the struggle with getting your computer to do what you want. Gain full control over it instead. Come and learn how!
Table of Contents
-
The Data Science Bootstrap Notes
- Where I think you, the reader, are coming from
- Things you’ll learn
- Apply these ideas just-in-time
- Changes from the first edition
- Ways to support the project
-
Philosophies
- How these philosophies work together
-
You should know your computing stack
- See this philosophy in action
-
Automate and standardize everywhere possible
- See this philosophy in action
-
You should always know the source of truth
- See this philosophy in action
-
Categorize everything that you can
- See this philosophy in action
-
Putting it all together
- How you’ll see these philosophies in action
-
Setup your machine
- Why this is important
- What you’ll learn in this section
-
Configure your shell
- Install Starship
- Configure environment variables
- Create shell aliases
- Troubleshooting
- Quick Reference
-
Install and configure system-wide software
- Install package managers
- Install software
- Configure your PATH
- TL;DR: Quick installation commands
- Troubleshooting
- Quick Reference
-
Install and configure Git on your machine
- Why do we need Git
- How to install Git
- How to configure Git with basic information
- How to configure Git with fancy features
- Troubleshooting
- Productivity Tip: Shell Aliases
- Quick Reference
-
Install
uv
to manage and install Python-based command line tools- Further reading
-
Install Homebrew on your Mac (fallback package manager)
- Why install Homebrew?
- When to use Homebrew
- How to install Homebrew
- Using Homebrew on Linux
- See also
-
Install and configure
direnv
for environment management-
Why we need
direnv
-
How to install
direnv
-
How to configure
direnv
-
Loading
.env
files automatically (direnv >= 2.31.0) - Troubleshooting
-
Why we need
-
Leverage dotfiles to get your machine configured quickly
- Why create a dotfiles repository
- How to structure a dotfiles repository
- Examples and resources
-
Configure VSCode for maximum productivity
- How do I access VSCode settings?
- What built-in settings have transformed my workflow?
- What extensions have actually improved my productivity?
- What keyboard shortcuts do I actually use?
- How do I handle project-specific settings?
- What about collaborative coding?
- Remember: start simple, grow gradually
-
Master your shell for data science productivity
- Why shell mastery matters for data scientists
- What you’ll learn in this section
-
Take full control of your shell environment variables
- Why control your environment variables
- How do I control my environment variables
-
Create shell command aliases for your commonly used commands
- Why create shell aliases
- How to create aliases
- Where to store these aliases
- Useful aliases to get started
- Git aliases cheat sheet
-
Shell commands cheat sheet
- Basic navigation and file operations
- File permissions and ownership
- Process management
- Network and system info
- Archive and compression
- Git shortcuts
- Text processing
- Advanced patterns
- Time-saving tips
-
Shell-based text editors
- Why should I care about shell editors?
- What do I actually need to learn?
- My recommendation: start with nano
- What about vim and emacs?
- My philosophy on shell editors
- Getting started
-
Manage and configure your projects
- Follow the 1:1:1:1… rule
- When can we break this rule
-
Start with a sane repository structure
- How to structure a standard repository structure
- Automate the scaffolding of new projects
-
Use pixi for maximally ergonomic and reproducible environments
- The Cheat Sheet of pixi commands
- Long-term reproducibility through lock files
- Composable multi-environment projects
-
Structure your source code repository sanely
- Phase 1: Initial Exploration
- Phase 2: Emerging Patterns
- Phase 3: One-off Scripts
- Phase 4: Production Structure
- Leveraging AI Assistants
- Core Development Principles
-
Store your project documentation in your project repository
- Introduction to the Diataxis framework
- When to add documentation
- Code comments as documentation
- AI assistance in documentation
- Reference documentation
- Automating documentation with CI/CD
-
Use CI/CD to automate tasks
- Key Concepts of CI/CD
- Environment Considerations
- Configuration and Environment Variables
- Leveraging Pixi Environment
- Practical Examples with GitHub Actions
-
Use data catalogs to manage data
- What are data catalogs?
- Traditional data catalog examples
- Modern ML data storage with xarray and zarr
- What are the advantages of data catalogs?
- When should you use data catalogs?
-
Choose your data formats wisely
- The binary vs text format decision
- The hybrid approach: Binary source, text derivatives
- High-dimensional data: The xarray advantage
- Format-specific recommendations
- How to implement this in practice
- The bottom line
-
Take advantage of
uv
for one-off projects- Using PEP723 for script dependencies
- Self-contained notebooks
- Benefits of this approach
- Real-world examples
- Conclusion
-
Configuration files guide
- Why do we even need all these config files?
- What are the core configuration files you should know about?
- What about documentation configuration?
- How do I handle environment variables?
- What’s the quick reference for which tools use which files?
- What are my best practices for configuration files?
- How do I get started with all this?
-
Set environment variables in a
.env
file- Why configure environment variables per project
- How to configure environment variables for your project
-
Name things consistently
- What constitutes a “sane” name?
-
Skills for Effective Data Science
- Core Technical Skills
- Effective Ways of Working
-
How to write software tests
- Why should I bother writing tests?
- How do I actually write tests?
- What do real tests look like?
- What about testing data assumptions?
- Don’t forget to test error conditions
- How do I actually run these tests?
- What are some advanced patterns worth knowing?
- How do I organize my tests?
- How do I know if I’m testing enough?
- How does this fit into my development workflow?
- What mistakes should I avoid?
- How do I get started?
- Refactor code
-
Working with AI tools
- The speed of thought
- The right kind of lazy
- Effective patterns for AI interaction
- Beyond code generation
- Moving forward
-
Collaborating on Data Science Projects
- The power of pair programming
- Leveraging AI in collaborative work
- The science in data science projects
- Effective work distribution
- Handling merge conflicts without the drama
- The art of managing unproductive patches
-
Use notebooks effectively
- Choose Marimo over Jupyter for reactivity
- Notebooks as prototyping tools, not production code
- Data access best practices
- Scratch pad vs. Report-style notebooks
- Refactor with the help of AI
- Publish notebooks and strip outputs before committing
- Jupyter hygiene (if you must use Jupyter)
-
Looking back, moving forward
- What we’ve covered together
- The journey ahead
- A personal note
- Where to go from here
- Final thoughts
- Notes
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them