Exploratory Data Analysis and Feature Engineering

Before training any machine learning model, you need to understand your data. Exploratory Data Analysis (EDA) is the process of examining a dataset to summarize its main characteristics, find patterns, detect anomalies, and check assumptions. Feature engineering is the art of creating new input variables — or transforming existing ones — to improve model performance.

These steps often make the difference between a mediocre model and a good one. As the saying goes: “garbage in, garbage out.”

The requirements for this chapter are:

1 uv pip install scikit-learn pandas numpy

The examples for this chapter are in the directory source-code/data_analysis_and_feature_engineering.

We continue using the California Housing dataset from the previous chapter.

Exploratory Data Analysis

Loading and Inspecting the Data

The first thing to do with any dataset is to understand its shape, types, and basic statistics:

 1 import numpy as np
 2 import pandas as pd
 3 from sklearn.datasets import fetch_california_housing
 4 
 5 housing = fetch_california_housing()
 6 df = pd.DataFrame(housing.data, columns=housing.feature_names)
 7 df["MedHouseVal"] = housing.target
 8 
 9 print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
10 print(df.dtypes)

Running our eda.py script gives us:

 1 $ python eda.py
 2 === Dataset Overview ===
 3 Shape: 20640 rows × 9 columns
 4 
 5 Column types:
 6 MedInc         float64
 7 HouseAge       float64
 8 AveRooms       float64
 9 AveBedrms      float64
10 Population     float64
11 AveOccup       float64
12 Latitude       float64
13 Longitude      float64
14 MedHouseVal    float64

All columns are floating point. Next, summary statistics:

1 === Summary Statistics ===
2          MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup
3 count  20640.00  20640.00  20640.00   20640.00    20640.00  20640.00
4 mean       3.87     28.64      5.43       1.10     1425.48      3.07
5 std        1.90     12.59      2.47       0.47     1132.46     10.39
6 min        0.50      1.00      0.85       0.33        3.00      0.69
7 max       15.00     52.00    141.91      34.07    35682.00   1243.33

Notice the wide range differences: Population ranges from 3 to 35,682 while AveBedrms ranges from 0.33 to 34.07. This tells us we will need feature scaling before training most models.

Also notice the extreme maximum values for AveRooms (141.91) and AveOccup (1,243.33) — these are likely outliers or data quality issues.

Checking for Missing Values

Missing data can silently break your models or introduce bias. Always check:

1 === Missing Values ===
2 No missing values found.

This dataset is clean, but real-world data rarely is. We will practice handling missing values in the feature engineering section below.

Correlation Analysis

Understanding which features correlate with the target helps guide feature selection and engineering:

1 === Correlation with MedHouseVal ===
2   MedInc                +0.6881
3   AveRooms              +0.1519
4   Latitude              -0.1442
5   HouseAge              +0.1056
6   AveBedrms             -0.0467
7   Longitude             -0.0460
8   Population            -0.0246
9   AveOccup              -0.0237

MedInc (median income) stands out with a correlation of +0.69 — by far the strongest predictor. This aligns with what we saw from the regression coefficients in the previous chapter. The other features have relatively weak linear correlations, suggesting that non-linear relationships or feature combinations might be more informative.

Outlier Detection

The IQR (Interquartile Range) method flags values that fall more than 1.5 × IQR below Q1 or above Q3:

1 === Outlier Counts (IQR method) ===
2   MedInc                  681 outliers (3.3%)
3   AveRooms                511 outliers (2.5%)
4   AveBedrms              1424 outliers (6.9%)
5   Population             1196 outliers (5.8%)
6   AveOccup                711 outliers (3.4%)
7   MedHouseVal            1071 outliers (5.2%)

Nearly 7% of AveBedrms values are outliers. In practice, you would investigate whether these are genuine extreme values or data errors, and decide whether to clip, transform, or remove them depending on your use case.

Feature Engineering

Feature engineering is where domain knowledge meets data science. By creating new features that better represent the underlying patterns, we can significantly improve model performance — sometimes more than choosing a fancier algorithm.

Creating New Features

We can derive meaningful features by combining existing ones:

1 df["RoomsPerHousehold"] = df["AveRooms"] / df["AveOccup"]
2 df["BedroomRatio"] = df["AveBedrms"] / df["AveRooms"]
3 df["PopPerHousehold"] = df["Population"] / df["HouseAge"]
  • RoomsPerHousehold: a proxy for house size relative to occupancy.
  • BedroomRatio: what fraction of rooms are bedrooms (a measure of house layout).
  • PopPerHousehold: population growth rate proxy (newer areas with high population).

Encoding Categorical Features

Many real-world datasets contain categorical variables (e.g., “color”, “region”, “type”). Most ML algorithms require numerical inputs, so we need to encode these.

In our example, we create a categorical feature by binning latitude into California regions, then one-hot encode it:

1 df["Region"] = pd.cut(
2     df["Latitude"],
3     bins=[32, 35, 38, 42],
4     labels=["South", "Central", "North"]
5 )
6 
7 region_dummies = pd.get_dummies(df["Region"], prefix="Region", dtype=int)
8 df = pd.concat([df, region_dummies], axis=1)
1 === Region Distribution ===
2 Region
3 South      11294
4 Central     6331
5 North       3015

One-hot encoding creates a separate binary column for each category (Region_South, Region_Central, Region_North). This avoids imposing a false numerical ordering on the categories.

Handling Missing Data

Real datasets almost always have missing values. Common strategies include:

  • Drop rows: simple but loses data.
  • Fill with mean/median: preserves dataset size; median is more robust to outliers.
  • Fill with a model prediction: more sophisticated but adds complexity.

We demonstrate median imputation:

1 # Simulate 5% missing values
2 rng = np.random.default_rng(42)
3 mask = rng.random(len(df)) < 0.05
4 df.loc[mask, "PopPerHousehold"] = np.nan
5 
6 # Fill with median
7 median_val = df["PopPerHousehold"].median()
8 df["PopPerHousehold"] = df["PopPerHousehold"].fillna(median_val)
1 === Handling Missing Data ===
2 Introduced 1028 missing values in PopPerHousehold
3 Filled with median: 41.833
4 Remaining missing: 0

Feature Scaling

Features on vastly different scales cause problems for distance-based algorithms (K-NN, K-Means) and gradient-based optimizers. StandardScaler transforms each feature to have zero mean and unit variance:

1 === Feature Scaling Impact ===
2 Before scaling:
3   MedInc           mean=      3.87  std=      1.90
4   AveRooms         mean=      5.43  std=      2.47
5   Population       mean=   1425.48  std=   1132.46
6 After StandardScaler:
7   MedInc           mean=    0.0000  std=    1.0000
8   AveRooms         mean=    0.0000  std=    1.0000
9   Population       mean=   -0.0000  std=    1.0000

After scaling, all features are on the same footing. Remember to always fit the scaler on training data only and apply it to both train and test sets to prevent data leakage.

Measuring the Impact

The ultimate test of feature engineering is whether it improves model performance. We compare a Linear Regression model with the original 8 features against one with our 14 engineered features:

1 === Model Comparison: Original vs. Engineered Features ===
2   Original (8 features)            = 0.5758
3   Engineered (14 features)         = 0.6622

Our engineered features improved R² from 0.58 to 0.66 — a 15% improvement in explained variance, using the exact same algorithm. This demonstrates why feature engineering is often more valuable than model selection for improving results.

EDA and Feature Engineering Wrap-up

In this chapter we covered the essential data preparation skills that precede model training:

  • EDA helps you understand your data through summary statistics, correlation analysis, and outlier detection. Never skip this step.
  • Feature engineering transforms raw data into more informative inputs: creating derived features, encoding categories, handling missing values, and scaling.
  • The payoff is real: our engineered features produced a 15% improvement in model performance with zero algorithm changes.

These techniques apply to every machine learning project, whether you are using classic algorithms from scikit-learn or deep learning frameworks. In the next part of this book, we move into deep learning.