Data Analysis for the Life Sciences
Data Analysis for the Life Sciences
Data Analysis for the Life Sciences

This book is 100% complete

Completed on 2015-09-23

About the Book

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. In the life sciences, data analysis is now part of practically every research project. Genomics, in particular, is being driven by new measurement technologies that permit us to observe certain molecular entities for the first time. These observations are leading to discoveries analogous to identifying microorganisms and other breakthroughs permitted by the invention of the microscope. Choice examples of these technologies are microarrays and next generation sequencing. This book will cover several of the statistical concepts and data analytic skills needed to succeed in data-driven life science research. We go from relatively basic concepts related to computing p-values to advanced topics related to analyzing high-throughput data. 

While statistics textbooks focus on mathematics, this book focuses on using a computer to perform data analysis. Instead of explaining the mathematics and theory, and then showing examples, we start by stating a practical data-related challenge. This book also includes the computer code that provides a solution to the problem and helps illustrate the concepts behind the solution. By running the code yourself, and seeing data generation and analysis happen live, you will get a better intuition for the concepts, the mathematics, and the theory. The book was created using the R markdown language and we make all this code available to the reader. This means that readers can replicate all the figures and analyses used to create the book.


Table of Contents

  • Acknowledgements
  • Introduction
    • What Does This Book Cover?
    • How Is This Book Different?
  • Getting Started
    • Installing R
    • Installing RStudio
    • Learn R Basics
    • Installing Packages
    • Importing Data into R
    • Brief Introduction to dplyr
    • Mathematical Notation
  • Inference
    • Introduction
    • Random Variables
    • The Null Hypothesis
    • Distributions
    • Probability Distribution
    • Normal Distribution
    • Populations, Samples and Estimates
    • Central Limit Theorem and t-distribution
    • Central Limit Theorem in Practice
    • t-tests in Practice
    • The t-distribution in Practice
    • Confidence Intervals
    • Power Calculations
    • Monte Carlo Simulation
    • Parametric Simulations for the Observations
    • Permutation Tests
    • Association Tests
  • Exploratory Data Analysis
    • Quantile Quantile Plots
    • Boxplots
    • Scatterplots And Correlation
    • Stratification
    • Bi-variate Normal Distribution
    • Plots To Avoid
    • Misunderstanding Correlation (Advanced)
    • Robust Summaries
    • Wilcoxon Rank Sum Test
  • Matrix Algebra
    • Motivating Examples
    • Matrix Notation
    • Solving System of Equations
    • Vectors, Matrices and Scalars
    • Matrix Operations
    • Examples
  • Linear Models
    • The Design Matrix
    • The Mathematics Behind lm()
    • Standard Errors
    • Interactions and Contrasts
    • Linear Model with Interactions
    • Analysis of variance
    • Co-linearity
    • Rank
    • Removing Confounding
    • The QR Factorization (Advanced)
    • Going Further
  • Inference For High Dimensional Data
    • Introduction
    • Inference in Practice
    • Procedures
    • Error Rates
    • The Bonferroni Correction
    • False Discovery Rate
    • Direct Approach to FDR and q-values (Advanced)
    • Basic Exploratory Data Analysis
  • Statistical Models
    • The Binomial Distribution
    • The Poisson Distribution
    • Maximum Likelihood Estimation
    • Distributions for Positive Continuous Values
    • Bayesian Statistics
    • Hierarchical Models
  • Distance and Dimension Reduction
    • Introduction
    • Euclidean Distance
    • Distance in High Dimensions
    • Dimension Reduction Motivation
    • Singular Value Decomposition
    • Projections
    • Rotations
    • Multi-Dimensional Scaling Plots
    • Principal Component Analysis
  • Basic Machine Learning
    • Clustering
    • Conditional Probabilities and Expectations
    • Smoothing
    • Bin Smoothing
    • Loess
    • Class Prediction
    • Cross-validation
  • Batch Effects
    • Confounding
    • Confounding: High-throughput Example
    • Discovering Batch Effects with EDA
    • Gene Expression Data
    • Motivation for Statistical Approaches
    • Adjusting for Batch Effects with Linear Models
    • Factor Analysis
    • Modeling Batch Effects with Factor Analysis

About the Authors

Rafael A Irizarry
Rafael A Irizarry

Rafael Irizarry is a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute and Biostatistics at the Harvard T.H. Chan School of Public Health . For the past 17 years, Dr. Irizarry’s research has focused on the analysis of genomics data. 

Michael I Love
Michael I Love

Michael Love is an Assistant Professor in the Departments of Biostatistics and Genetics at the University of North Carolina at Chapel Hill. Dr. Love uses statistical models to discover biologically relevant patterns in genomic datasets, and develops open-source statistical software for the Bioconductor Project.

About the Contributors

Alexandra Nones
Alexandra Nones
Alexandra proofread the book in its various stages.
Heather Sternshein
Heather Sternshein
Heather helped coordinate the online course that gave birth to this book.
Karl Broman
Karl Broman
Karl contributed the "plots to avoid" section.
Stephanie Hicks
Stephanie Hicks
Stephanie contributed some of the exercises.

The Leanpub 45-day 100% Happiness Guarantee

Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Write and Publish on Leanpub

Authors and publishers use Leanpub to publish amazing in-progress and completed ebooks, just like this one. You can use Leanpub to write, publish and sell your book as well! Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks. Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. It really is that easy.

Learn more about writing on Leanpub