Name: Data Science Bootstrap Notes
Brand: Leanpub
Price: 7.99 USD
Availability: InStock

Your model is solid. Your statistics are sound. But your environment just broke, your project is a folder of untitled notebooks, and the code that ran fine yesterday won't run today.

Most data science programs teach the math, the models, and the domain. Almost none teach how to set up the machine underneath it all, structure a project so it survives six months, or build a workflow that runs identically on your laptop, your teammate's machine, and in production. That missing half is where data scientists lose days, miss deadlines, and quietly wonder whether everyone else has this figured out.

They don't. They just hit the same walls earlier, or they had a mentor.

Data Science Bootstrap is that mentor. It distills a decade of hard-won infrastructure lessons, battle-tested since I left bench science for computational work in 2013, into a short guide you can read in a weekend. You get the four philosophies that turn chaos into compounding productivity, the exact tools and configs to set up a real machine, a project structure that scales, the core skills that make your work hold up, and the modern AI-augmented workflow that lets you move at the speed of thought.

What's inside:

The four philosophies that turn chaos into compounding productivity, so each new project starts further ahead than the last.
How to configure a Mac or Linux machine for data science in an afternoon, with a dotfiles repo you can version and replicate anywhere, which means a new machine or a new teammate takes hours, not days.
Reproducible environments with pixi that end "it works on my machine," so your code runs identically on your laptop, your colleague's machine, and in CI.
A project structure that scales from a throwaway script to production, which means future-you isn't lost in a swamp of untitled notebooks.
Shell mastery that turns twenty keystrokes into two, so the boring stuff stops eating your day.
How to write tests for data science code and models, so you catch breaks before they reach production and change your code without fear.
The modern AI-augmented workflow: repository memory with AGENTS.md, reusable skills, and agentic data science with marimo, so your coding agents compound in leverage instead of starting from zero.
CI/CD automation that builds, tests, and ships while you sleep, which means the drudgery disappears and you focus on the work that matters.

Why the eBook, when the knowledge base is free?

The online version will always be free; that's the point of the project. The eBook gives you the same content linearized into a focused read, portable offline in PDF and EPUB, and updated free for as long as I maintain it. It's also the most direct way to support the hundreds of hours behind it.

Who wrote this:

I'm Eric Ma, Senior Principal Data Scientist at Moderna, where I lead the Data Science and AI Research team. MIT ScD, formerly biomedical data science at Novartis, a past core developer on NetworkX and PyMC, and creator of pyjanitor and nxviz. Every practice here was forged in real projects, real collaborations, and real deadlines, not in a classroom.

The Data Science Bootstrap Notes

Where I think you, the reader, are coming from
Things you’ll learn
Apply these ideas just-in-time
Changes from the first edition
Ways to support the project

Philosophies

How these philosophies work together

You should know your computing stack

See this philosophy in action

Automate and standardize everywhere possible

See this philosophy in action

You should always know the source of truth

Repository standards for tools and agents
See this philosophy in action

Categorize everything that you can

See this philosophy in action

Putting it all together

How you’ll see these philosophies in action

Setup your machine

Why this is important
What you’ll learn in this section

Configure your shell

Install Starship
Configure environment variables
Create shell aliases
Troubleshooting
Quick Reference

Install and configure system-wide software

Install package managers
Install software
Configure your PATH
TL;DR: Quick installation commands
Troubleshooting
Quick Reference

Install and configure Git on your machine

Why do we need Git
How to install Git
How to configure Git with basic information
How to configure Git with fancy features
Troubleshooting
Productivity Tip: Shell Aliases
Quick Reference

Install `uv` to manage and install Python-based command line tools

Further reading

Install Homebrew on your Mac (fallback package manager)

Why install Homebrew?
When to use Homebrew
How to install Homebrew
Using Homebrew on Linux
See also

Install and configure `direnv` for environment management

Why we need direnv
How to install direnv
How to configure direnv
Loading .env files automatically
Troubleshooting

Leverage dotfiles to get your machine configured quickly

Why create a dotfiles repository
How to structure a dotfiles repository
Examples and resources

Configure VSCode for maximum productivity

How do I access VSCode settings?
What built-in settings have transformed my workflow?
What extensions have actually improved my productivity?
What keyboard shortcuts do I actually use?
How do I handle project-specific settings?
What about collaborative coding?
AI Agent Harnesses
Remember: start simple, grow gradually

Master your shell for data science productivity

Why shell mastery matters for data scientists
What you’ll learn in this section

Take full control of your shell environment variables

Why control your environment variables
How do I control my environment variables

Create shell command aliases for your commonly used commands

Why create shell aliases
How to create aliases
Where to store these aliases
Useful aliases to get started
Git aliases cheat sheet
Port management aliases
Enhancing built-in commands with functions

Shell commands cheat sheet

Basic navigation and file operations
File permissions and ownership
Process management
Network and system info
Archive and compression
Git shortcuts
Text processing
Advanced patterns
Time-saving tips

Shell-based text editors

Why should I care about shell editors?
What do I actually need to learn?
My recommendation: start with nano
What about vim and emacs?
My philosophy on shell editors
Getting started

Manage and configure your projects

Follow the 1:1:1:1… rule
When can we break this rule

Start with a sane repository structure

How to structure a standard repository structure
Automate the scaffolding of new projects

Use pixi for maximally ergonomic and reproducible environments

A practical onboarding path (clone to first command)
Choose one config layout on purpose
Pixi command cheat sheet
Long-term reproducibility through lock files
Composable multi-environment projects

Structure your source code repository sanely

Phase 1: Initial Exploration
Phase 2: Emerging Patterns
Phase 3: One-off Scripts
Phase 4: Production Structure
Leveraging AI Assistants
Core Development Principles

Store your project documentation in your project repository

Introduction to the Diataxis framework
When to add documentation
Code comments as documentation
AI assistance in documentation
Reference documentation
Automating documentation with CI/CD
Auditing and improving documentation

Use CI/CD to automate tasks

Key Concepts of CI/CD
Environment Considerations
Configuration and Environment Variables
Leveraging Pixi Environment
Practical Examples with GitHub Actions

Use data catalogs to manage data

What are data catalogs?
Traditional data catalog examples
Modern ML data storage with xarray and zarr
What are the advantages of data catalogs?
When should you use data catalogs?

Choose your data formats wisely

The binary vs text format decision
The hybrid approach: Binary source, text derivatives
High-dimensional data: The xarray advantage
Format-specific recommendations
How to implement this in practice
The bottom line

Take advantage of `uv` for one-off projects

Using PEP723 for script dependencies
Self-contained notebooks
Benefits of this approach
Real-world examples
Conclusion

Configuration files guide

Why do we even need all these config files?
What are the core configuration files you should know about?
What about documentation configuration?
How do I handle environment variables?
What’s the quick reference for which tools use which files?
What are my best practices for configuration files?
How do I get started with all this?

Set environment variables in a `.env` file

Why configure environment variables per project
How to configure environment variables for your project

Name things consistently

What constitutes a “sane” name?

Skills for Effective Data Science

Core Technical Skills
Effective Ways of Working

How to write software tests

Using AI coding agents for tests
What this chapter covers (and what it skips)
Why should I bother writing tests?
How do I actually write tests?
What do real tests look like?
What about testing data assumptions?
Reproducibility habits that tests can enforce
Smoke checks for models and training code
Don’t forget to test error conditions
How do I actually run these tests?
What are some advanced patterns worth knowing?
How do I organize my tests?
How do I know if I’m testing enough?
How does this fit into my development workflow?
What mistakes should I avoid?
How do I get started?

Refactor code

Collaborating on Data Science Projects

The power of pair programming
Leveraging AI in collaborative work
The science in data science projects
Effective work distribution
Pull requests, review, and what CI enforces
Handling merge conflicts without the drama
The art of managing unproductive patches
Scaling tacit knowledge

Use notebooks effectively

Choose Marimo over Jupyter for reactivity
Notebooks as prototyping tools, not production code
Data access best practices
Scratch pad vs. Report-style notebooks
Refactor with the help of AI
Publish notebooks and strip outputs before committing
Jupyter hygiene (if you must use Jupyter)

Working with AI tools

The speed of thought
The right kind of lazy
AI as a mirror of human capital
Effective patterns for AI interaction
Beyond code generation
See this in action
Moving forward

Building repository memory with `AGENTS.md`

What is AGENTS.md?
The two jobs of repository memory
Training an employee, not programming a bot
Durable norms and corrections
Examples worth spelling out
Bootstrap your AGENTS.md
See this in action

Skills as reusable playbooks

What is a skill?
Recommended installation
Starter pack
Tacit knowledge and the “ESL benefit”
The iteration loop
See this in action

Compounding agent improvement

The maturity model
Decision rule: AGENTS.md vs. Skills
Markdown as an executable language
The metacognition habit
Where this is going

Safe automation with coding agents

Auto-approve safe command line commands
Enable automatic web search
Know your emergency stop shortcuts
Correct agent behavior in real-time
Write prescriptive prompts for complex tasks
Use plan mode for complex tasks
Managing multiple background agents
See this in action

Making surgical changes to large codebases

The process
Why this works
Building your mental model
The constraint of three
See this in action

Advanced workflows with coding agents

Working with specialized tools
Working across repositories
See this in action

Agentic data science with marimo pair

What marimo pair gives you
Get set up
Frame the question before you start typing
Work in small, precise asks
When the loop breaks down
How this connects to the rest of the book

Looking back, moving forward

What we’ve covered together
The journey ahead
A personal note
Where to go from here
Final thoughts

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

You pay

Author earns

About

Share this book

Categories

Feedback

Author

Contents