###### Advanced Machine Learning Made Easy - Volume 1

### Advanced Machine Learning Made Easy - Volume 1

###### From Theory to Practice with NumPy and scikit-learn, Volume 1: Generalized Linear Models

# About the Book

Machine learning is a highly interdisciplinary topic and refers to a set of tools for modeling and understanding complex datasets. It is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" with data without being explicitly programmed. These three-volume book series cover a wide variety of topics in machine learning, focusing only on supervised and unsupervised learning intended for those who want to become a data scientist or an expert in machine learning. The first volume covers the generalized linear models which are simpler, yet constitute the basics of machine learning. This book series shall be seen as a compilation of the thousands of information chips gathered from different materials and put together to form a concise description of this emerging field.

The book series takes the approach of building up first the basic concepts and then providing the mathematical framework to derive each machine learner step-by-step. It can be also seen as a very concise description of the algorithms included in scikit-learn library, but the primary goal is to provide a good understanding of not just the API of scikit-learn (and statsmodels), but also to give a good comprehension about what is under the hood, how they are working and what are the pros and cons of each machine learning algorithm. No single approach will perform well in all possible applications - as no universal machine learning algorithm exists that works well on all problems - thus, without understanding all the cogs and their interaction inside the machine (learner), it is impossible to select the best algorithm for a particular problem. Each chapter is accompanied by lab exercises stored in https://github.com/FerencFarkasPhD/Advanced-Machine-Learning as Jupyter Notebooks.

This book series is intended for both undergraduate and graduate students, as well as software developers, experimental scientists, engineers, and financial professionals with strong math backgrounds who wish to improve their machine learning skills. Thus, some mathematical background, equivalent to a one-semester undergraduate course, in each of the following fields is preferred: linear algebra, multivariate differential calculus, probability theory, and statistics. It is also assumed that the reader does have some sort of basic knowledge of computer science and possess knowledge of the basic computer skills and principles, including, but not limited to, data structures and algorithms. Basic programming skills, some knowledge of Python programming, the SciPy stack, and Jupyter Notebook is also required from the reader to carry out the lab exercises accompanying the book.

Although reading the three-volume series requires a solid math background, those who lack the necessary math skill should not run away in panic. The author is an engineer and not a mathematician who sees the math only as a tool that serves as a way to deepen the understanding of a problem at hand and to find the optimal solution for a practical problem. Thus, all mathematical formulas used by machine learning algorithms are introduced by formulating the background and providing additional intuition beforehand. With that in mind, minimal mathematical knowledge might be also acceptable for an eager learner to understand the book. Moreover, the mathematical expression for each machine learning algorithm is derived step by step with clear explanations, intuitive examples, and supporting figures.

For those not possessing a deep mathematical background - but have some programming knowledge - should see the vectors and matrices used in the mathematical framework as the counterpart of multidimensional arrays used in computer programs, while the mathematical formulas as a sequence of array manipulations. Thus, the mathematical formulas presented in the book will be converted directly into a single line of Python code using array manipulations. This approach is supported by both the Appendix and lab exercises.

#### Table of Contents

- Preface
- Acknowledgments
- Chapter 1: Introduction
- 1.1 How to Use the Book
- 1.2 Definition of Machine Learning
- 1.3 Data Science and Machine Learning
- 1.4 Types of Machine Learning Algorithm
- 1.5 Data Science Methodology
- 1.5.1 Business Understanding
- 1.5.2 Analytic Approach
- 1.5.3 Data Requirements
- 1.5.4 Data Collection
- 1.5.5 Data Understanding
- 1.5.6 Data Preparation
- 1.5.7 Modeling
- 1.5.8 Evaluation
- 1.5.9 Deployment
- 1.5.10 Feedback

- 1.6 Open Source Tools for Data Science
- 1.6.1 Python
- 1.6.2 Jupyter Notebook and JupyterLab
- 1.6.3 SciPy
- 1.6.4 NumPy
- 1.6.5 Matplotlib
- 1.6.6 Seaborn
- 1.6.7 Pandas
- 1.6.8 Scikit-learn
- 1.6.9 StatsModels

- 1.7 Structure of the Book
- References

- Chapter 2: Simple Linear Regression
- 2.1 Introduction
- 2.2 Case Study: Is Height Hereditary?
- 2.3 Simple Dataset with One Explanatory Variable
- 2.4 Random Variables
- 2.4.1 Discrete Random Variables
- 2.4.2 Continuous Random Variables
- 2.4.2.1 Normal Distributions
- 2.4.2.2 The Central Limit Theorem

- 2.4.3 Monte Carlo Simulation

- 2.5 Population vs. Sample
- 2.5.1 Population
- 2.5.1.1 Population Attributes
- 2.5.1.2 Estimate of population attribute

- 2.5.2 Sample
- 2.5.2.1 Sample Mean
- 2.5.2.2 Sample Variance
- 2.5.2.3 Sample Covariance
- 2.5.2.4 Sample Correlation Coefficient

- 2.5.1 Population
- 2.6 The Simple Linear Regression Model
- 2.6.1 Fitting the Best Line
- 2.6.2 Assessing Goodness-of-Fit in a Regression Model
- 2.6.3 Regression Standard Error vs. Coefficient of Determination
- 2.6.4 Interpreting the Model Parameters

- 2.7 Assumptions of the Ordinary Least Squares
- 2.7.1 The Regression Model is Linear in the Coefficients
- 2.7.2 The Population Error has Zero Mean
- 2.7.3 All Independent Variables are Uncorrelated with the Error
- 2.7.4 The Error Terms are Uncorrelated with Each Other
- 2.7.5 The Error Term has a Constant Variance
- 2.7.6 No Perfect Multicollinearity Between Input Variables
- 2.7.7 The Error Term is Normally Distributed (Optional)

- 2.8 Outlier, Leverage, and Influential Point
- 2.8.1 Mean Absolute Error
- 2.8.2 Median Absolute Error

- 2.9 Graphical Analysis of Outliers
- 2.10 Sampling Error and Prediction Uncertainty
- 2.10.1 Variation Due to Sampling
- 2.10.2 Confidence Intervals
- 2.10.2.1 Confidence Interval for the Population Mean
- 2.10.2.2 Confidence Interval for the Regression Parameters
- 2.10.2.3 Confidence Interval for the Conditional Mean

- 2.10.3 Prediction Interval
- 2.10.4 Analysis of Variance
- 2.10.4.1 The One Way ANOVA Table
- 2.10.4.2 Significance Level, Statistical Power, and p-values

- 2.11 Data Preprocessing
- 2.12 The importance of graphing
- 2.13 Summary
- Lab Exercises
- References

- Chapter 3: Multiple Linear Regression
- 3.1 Introduction
- 3.2 Simple Dataset with Two Explanatory Variables
- 3.3 OLS for Multiple Linear Regressions
- 3.3.1 Analytical Solution
- 3.3.2 Adjusted Coefficient of Determination
- 3.3.3 Result of the Analysis of the Simple Dataset
- 3.3.4 Expected Values and Variances of OLS
- 3.3.5 Case Study: Advertising
- 3.3.6 Confidence Intervals and Regions
- 3.3.6.1 Confidence and Prediction Intervals
- 3.3.6.2 Confidence Regions for Regression Coefficients

- 3.3.7 Maximum Likelihood Estimator

- 3.4 Outliers, Leverages, and Influential Observations
- 3.4.1 Analyzing Residuals
- 3.4.2 Leverages
- 3.4.3 Identifying Influential Observations
- 3.4.3.1 DFFITS
- 3.4.3.2 DFBETAS
- 3.4.3.3 Cook’s Distances
- 3.4.3.4 Covariance ratio

- 3.4.4 Case Study: Salary
- 3.4.5 Outlier Masking and Swamping
- 3.4.6 Modified z-score
- 3.4.7 Graphical Diagnostics
- 3.4.7.1 Boxplot
- 3.4.7.2 Violinplot
- 3.4.7.3 Histogram
- 3.4.7.4 Scatter Plot Matrix
- 3.4.7.5 Forward Response Plot
- 3.4.7.6 Residual Plot
- 3.4.7.7 Normal Q-Q Plot
- 3.4.7.8 Leverage versus Squared Residual Plot
- 3.4.7.9 Influence Plot
- 3.4.7.10 Cook’s Distances vs. Index Plot

- 3.4.8 Strategy for Dealing with Problematic Data Points

- 3.5 Data Preprocessing
- 3.5.1 Centering
- 3.5.2 Standardization
- 3.5.3 Feature Scaling
- 3.5.3.1 Min-Max Scaling
- 3.5.3.2 Max-Abs Scaling
- 3.5.3.3 Robust Scaling

- 3.5.4 Summary of Scaling Transformations
- 3.5.5 Whitening
- 3.5.5.1 ZCA (Mahalanobis) Whitening
- 3.5.5.2 PCA Whitening
- 3.5.5.3 Cholesky Whitening
- 3.5.5.4 Summary of Whitening Transformations

- 3.5.6 Nonlinear Transformations
- 3.5.6.1 Logarithmic Transformations
- 3.5.6.2 Case Study: Mammal Species
- 3.5.6.3 Power Transformations
- 3.5.6.4 Quantile Transformation
- 3.5.6.5 Normalization

- 3.5.7 Generating Polynomial Features
- 3.5.8 Encoding Categorical Features
- 3.5.8.1 Binary Variables
- 3.5.8.2 Ordinal Variables
- 3.5.8.3 Nominal Variables and One-hot Encoding
- 3.5.8.4 Case Study: Salary Discrimination
- 3.5.8.5 Summary of Association Measures

- 3.5.9 Discretization
- 3.5.9.1 Binarization
- 3.5.9.2 Binning

- 3.5.10 Handling Missing Data
- 3.5.10.1 Deletion Methods
- 3.5.10.2 Imputation Methods

- 3.5.11 Feature Engineering

- 3.6 Moving Beyond the OLS Assumptions
- 3.6.1 Multicollinearity
- 3.6.1.1 Variance Inflation
- 3.6.1.2 Reducing Data-based Multicollinearity
- 3.6.1.3 Reducing Structural Multicollinearity
- 3.6.1.4 Principal Component Regression
- 3.6.1.5 Case Study: Body Fat

- 3.6.2 Removing the Additive Assumption
- 3.6.3 Polynomial Regression
- 3.6.4 Weighted Least Squares
- 3.6.5 Autocorrelation
- 3.6.5.1 Generalized Least Squares
- 3.6.5.2 Feasible Generalized Least Squares
- 3.6.5.3 Case Study: The Demand for Ice Cream

- 3.6.1 Multicollinearity
- 3.7 Learning Theory
- 3.7.1 The Bias-Variance Trade-off
- 3.7.2 Generalization Error vs. Training Error
- 3.7.3 Estimating Generalization Error in Practice
- 3.7.3.1 Hold-out method
- 3.7.3.2 Data Leakage
- 3.7.3.3 Validation Set
- 3.7.3.4 Cross-Validation
- 3.7.3.5 Nested Cross-Validation
- 3.7.3.6 Cross-Validation and Cross-Testing

- 3.7.4 Case Study: House Prices Prediction
- 3.7.4.1 Exploratory Data Analysis
- 3.7.4.2 Bias and Variance vs. Model Complexity
- 3.7.4.3 Training and Testing Error vs. Model Complexity
- 3.7.4.4 Estimation of Generalization Error in Practice
- 3.7.4.5 Model Assessment

- 3.7.5 Dataset Shift

- 3.8 Feature Selection
- 3.8.1 Selecting Features Based on Importance
- 3.8.2 Univariate Feature Selection
- 3.8.3.1 Low Variance
- 3.8.3.2 Chi-squared Test
- 3.8.3.3 F-test
- 3.8.3.4 Mutual Information

- 3.8.3 Wrapper Methods
- 3.8.4.1 Best Subset Selection
- 3.8.4.2 Model Selection Based on Information Criterion
- 3.8.4.3 Stepwise Selection
- 3.8.4.4 Stepwise Selection with Cross-Validation

- 3.9 Regularization
- 3.9.1 Ridge Regression
- 3.9.1.1 Problem formulation
- 3.9.1.2 Scale Variance of Ridge Regression
- 3.9.1.3 Bias and Variance of Ridge Estimator
- 3.9.1.4 Relation between Ridge Regression and PCA
- 3.9.1.5 Geometric Interpretation of the Ridge Estimator
- 3.9.1.6 Probabilistic Interpretation of Ridge Regression
- 3.9.1.7 Choice of Hyperparameter

- 3.9.2 Lasso Regression
- 3.9.2.1 Problem Formulation
- 3.9.2.2 Geometric Interpretation of the Lasso Regression
- 3.9.2.3 Probabilistic Interpretation of the Lasso Regression
- 3.9.2.4 Soft-Thresholding

- 3.9.3 Elastic Regression

- 3.9.1 Ridge Regression
- 3.10 Summary
- Lab Exercises
- References

- Chapter 4: Time Series Analysis
- 4.1 Introduction to Time Series
- 4.1.1 Correlogram

- 4.2 Basic Linear Models
- 4.2.1 White Noise
- 4.2.2 Random Walk
- 4.2.3 The Backshift Operator
- 4.2.4 Strict Stationarity

- 4.3 Moving Average (MA)
- References

- 4.1 Introduction to Time Series
- Chapter 5: Optimization Methods
- 5.1 Introduction to Optimization
- 5.1.1 Reason for Using Optimization in Machine Learning
- 5.1.2 Optimization Algorithms

- 5.2 Gradient Descent
- 5.2.1 Basics of Gradient Descent
- 5.2.2 Batch Gradient Descend
- 5.2.3 Stochastic Gradient Descend
- 5.2.4 Mini-Batch Gradient Descend
- 5.2.5 Simulated Annealing

- 5.3 Stochastic Average Gradient
- 5.4 Coordinate Descend
- 5.5 Newton’s Method and Quasi-Newton Methods
- 5.5.1 Conjugate Gradient Descent
- 5.5.2 Newton-CG method
- 5.5.3 LBFGS

- 5.6 Iteratively Reweighted Least Squares
- 5.7 Summary
- Lab Exercises
- References

- 5.1 Introduction to Optimization
- Chapter 6: Logistic Regression
- 6.1 Introduction
- 6.2 Why not approach classification through regression?
- 6.3 Binomial Logistic Regression
- 6.3.1 Problem Formulation
- 6.3.2 Cross-Entropy
- 6.3.3 Probabilistic Interpretation

- 6.4 Evaluation Metrics for Classifiers
- 6.4.1 Accuracy and Error Rate
- 6.4.2 Confusion Matrix
- 6.4.3 Sensitivity and Specificity
- 6.4.4 Receiver Operating Characteristics
- 6.4.5 Area Under the Curve

- 6.5 Handling Class Imbalances
- 6.6 Multinomial Logistic Regression
- 6.7 Learning Theory Revisited
- 6.7.1 Basic Concepts
- 6.7.2 Approximation Error vs. Estimation Error
- 6.7.2.1 Bayes risk and Bayes estimator
- 6.7.2.2 Empirical Risk Minimization
- 6.7.2.3 Excess Risk and Error Decomposition

- 6.7.3 Probably Approximately Correct
- 6.7.3.1 “No Free Lunch” Theorem

- 6.8 Case Study: Handwritten Digits Recognition

- Lab Exercises
- References

- Appendix A: scikit-learn API reference
- A.1 Introduction
- A.2 Common Methods
- A.3 Grid Search
- A.4 Pipelines and Composite Estimators
- A.5 Generalized Linear Models
- A.5.1 Linear Regression Models
- A.5.2 Metrics
- A.5.3 Data Preprocessing
- A.5.4 Linear Classifiers
- A.5.5 Classification Metrics
- A.5.6 Model Selection

- Appendix B: Brief Introduction to NumPy
- B.1 Introduction
- B.2 Basic Data Structure of NumPy
- B.3 Scalars
- B.4 Vectors
- B.4.1 Vector Initialization
- B.4.2 Accessing Vector Elements
- B.4.3 Adding and Removing Elements
- B.4.4 Operations on Vectors
- B.4.4.1 Arithmetic Operations
- B.4.4.2 Mathematical Functions
- B.4.4.3 Statistical Functions
- B.4.4.4 Vector Dot Product

- B.4.5 Sorting and Searching

- B.5 Matrices
- B.5.1 Matrix Initialization
- B.5.2 Accessing Matrix Elements
- B.5.3 Operations on Matrices
- B.5.4 Matrix Dot Product
- B.5.5 Linear Algebra

- B.6 Array manipulations
- B.6.1 Changing array shape
- B.6.2 Joining and Splitting Arrays

- Index

### The Leanpub 60 Day 100% Happiness Guarantee

Within **60 days of purchase** you can get a **100% refund** on any Leanpub purchase, in **two clicks**.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.*You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!*

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

### Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

#### We pay **80% royalties** on purchases of **$7.99 or more**, and **80% royalties minus a 50 cent flat fee** on purchases between **$0.99 and $7.98**. **You earn $8 on a $10 sale, and $16 on a $20 sale**. So, if we sell **5000 non-refunded copies of your book for $20**, you'll earn **$80,000**.

*(Yes, some authors have already earned much more than that on Leanpub.)*

In fact, authors have earnedover $13 millionwriting, publishing and selling on Leanpub.

**Learn more about writing on Leanpub**

### Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them