Data Analysis in Haskell with DataFrame
Haskell’s mathematical purity allows for highly performant and expressive data manipulation algorithms. In this chapter, we will explore the dataframe library (version 1.0.0.0), a powerful tool for tabular data analysis, akin to Pandas in Python but built on Haskell’s robust type system.
We will walk through an example project that analyzes a California housing dataset. We will cover how to read and write CSV files, compute summary statistics, derive new columns, and perform complex filtering, sorting, and aggregations.
Setup and Boilerplate
Before working directly with tabular data, we initialize our project with the required extensions and imports. We make use of OverloadedStrings for cleaner string literals and TemplateHaskell to securely typecheck column names at compile time.
1 {-# LANGUAGE OverloadedStrings #-}
2 {-# LANGUAGE TemplateHaskell #-}
3
4 module Main where
5
6 import qualified DataFrame as D
7 import qualified DataFrame.Functions as F
8 import DataFrame.Operators -- re-exports (|>), (.==), (.>=), etc.
9 import Data.Text (Text)
The dataframe library uses Template Haskell to read the CSV headers at compile time. This ensures type-safe column references without tedious manual definition:
1 -- Template-haskell: inspect the CSV at compile time and generate typed
2 -- column-reference bindings such as `total_rooms`, `households`, etc.
3 $(F.declareColumnsFromCsvFile "./data/housing.csv")
Loading Data and Summary Statistics
To get started, we read the CSV into a DataFrame layout. The dataframe library affords commands similar to familiar database and analysis tools:
1 main :: IO ()
2 main = do
3 -- Load the dataset
4 df <- D.readCsv "./data/housing.csv"
5
6 putStrLn "First 5 rows:"
7 print (D.take 5 df)
Once the dataframe is loaded, initial exploration is crucial. To quickly get a grasp of your data’s distribution and missing values, use summary capabilities:
1 putStrLn "\n=== Column Descriptions ==="
2 print (D.describeColumns df)
3
4 putStrLn "\n=== Summary Statistics ==="
5 print (D.summarize df)
Deriving New Columns
Often, your analysis requires data metrics not present in the original dataset. You can easily derive new characteristics by applying arithmetic on existing typed columns using the D.derive function and a custom pipe operator |> to sequence data transformations.
Let’s compute the number of rooms per household, population density per household, and bedrooms relative to total rooms:
1 -- Derive new columns: rooms_per_household, population_per_household,
2 -- bedrooms_per_room
3 let enriched =
4 df
5 |> D.derive "rooms_per_household"
6 (F.toDouble (F.col @Int "total_rooms") /
7 F.toDouble (F.col @Int "households"))
8 |> D.derive "population_per_household"
9 (F.toDouble (F.col @Int "population") /
10 F.toDouble (F.col @Int "households"))
11 |> D.derive "bedrooms_per_room"
12 (F.toDouble (F.col @Int "total_bedrooms") /
13 F.toDouble (F.col @Int "total_rooms"))
Filtering and Selection
For narrower views of data domains, the D.filter and D.select combinators slice via row bounds and column scopes. Let’s isolate the high-value properties along the coastline:
1 -- Filter: keep high-value coastal properties
2 let expensive =
3 enriched
4 |> D.filter (F.col @Double "median_house_value") (> 400000)
5 |> D.filter (F.col @Text "ocean_proximity")
6 (\p -> p `elem` ["NEAR BAY", "NEAR OCEAN", "<1H OCEAN"])
7
8 -- Select relevant columns only
9 let selected =
10 expensive
11 |> D.select ["ocean_proximity", "median_income", "median_house_value",
12 "rooms_per_household", "population_per_household"]
Sorting
Data representation often necessitates ordering by weight or date. To organize our dataset sequentially descending by value, D.sortBy applies the required transformation:
1 -- Sort by median house value descending
2 let sorted =
3 enriched
4 |> D.sortBy [D.Desc (F.col @Double "median_house_value")]
Grouping and Aggregation
Analyzing wide margins requires grouping data under unique traits and gathering statistical clusters.
For instance, we can group attributes by their proximity to the ocean and compute aggregate values. Notice we leverage aliases using F.as for these computed categories.
1 -- Group by ocean proximity, aggregate metrics
2 let byProximity =
3 enriched
4 |> D.groupBy ["ocean_proximity"]
5 |> D.aggregate
6 [ F.count @Double (F.col @Double "median_house_value")
7 `F.as` "num_districts"
8 , F.mean @Double (F.col @Double "median_house_value")
9 `F.as` "avg_house_value"
10 , F.mean @Double (F.col @Double "median_income")
11 `F.as` "avg_income"
12 , F.mean @Double (F.col @Double "rooms_per_household")
13 `F.as` "avg_rooms_per_hh"
14 ]
15 |> D.sortBy [D.Desc (F.col @Double "avg_house_value")]
Writing Data to Files
Finally, we use writeCsv to save the enriched dataframe into a fresh text representation ready for broader sharing.
1 -- Write enriched dataset to CSV
2 D.writeCsv "./data/housing_enriched.csv" enriched
3 putStrLn "\nEnriched dataset written to ./data/housing_enriched.csv"
The dataframe structure showcases Haskell’s elegant pipe-forward functional capability and statically typed robustness, letting us perform common Pandas-like analyses fluently, without leaving the type safety of Haskell behind.
Here is partial output from running this example (output that spans > 100 columns not show - run the example to see the full output):
1 $ cabal run dataframe_example
2 HEAD is now at 5790ef4 Fix Functions module compilation in ghc 9.10 (#194)
3 Configuration is affected by the following files:
4 - cabal.project
5
6 === Column Descriptions ===
7 ---------------------------------------------------------------
8 Column Name | # Non-null Values | # Null Values | Type
9 -------------------|-------------------|---------------|-------
10 Text | Int | Int | Text
11 -------------------|-------------------|---------------|-------
12 ocean_proximity | 50 | 0 | Text
13 median_house_value | 50 | 0 | Double
14 median_income | 50 | 0 | Double
15 households | 50 | 0 | Int
16 population | 50 | 0 | Int
17 total_bedrooms | 50 | 0 | Int
18 total_rooms | 50 | 0 | Int
19 housing_median_age | 50 | 0 | Int
20 latitude | 50 | 0 | Double
21 longitude | 50 | 0 | Double
22
23 ----------------------------------------------------------------------
24
25 === Average House Value by Ocean Proximity ===
26 --------------------------------------------------------------------------------------------
27 ocean_proximity | num_districts | avg_house_value | avg_income | avg_rooms_per_hh
28 ----------------|---------------|--------------------|-------------------|------------------
29 Text | Int | Double | Double | Double
30 ----------------|---------------|--------------------|-------------------|------------------
31 <1H OCEAN | 10 | 484000.0 | 5.88 | 5.891623561548036
32 BAY | 5 | 450000.0 | 5.5 | 5.679258351578158
33 NEAR OCEAN | 10 | 333000.0 | 4.3 | 5.738715647098
34 NEAR BAY | 10 | 314480.0 | 4.99608 | 5.736540211611564
35 INLAND | 15 | 212333.33333333334 | 3.266666666666667 | 5.655433028354371
36
37 ----------------------------------------------------------------------
38
39 === Inland Districts with Median Income >= $40k ===
40 -------------------------------------------------------------------------------
41 longitude | latitude | median_income | median_house_value | rooms_per_household
42 ----------|----------|---------------|--------------------|--------------------
43 Double | Double | Double | Double | Double
44 ----------|----------|---------------|--------------------|--------------------
45 -122.0 | 37.4 | 4.6 | 330000.0 | 5.6521739130434785
46 -119.8 | 36.65 | 4.5 | 240000.0 | 5.6716417910447765
47 -121.95 | 37.38 | 4.2 | 310000.0 | 5.641025641025641
48 -119.75 | 36.62 | 4.0 | 210000.0 | 5.660377358490566
49
50 ----------------------------------------------------------------------
51
52 Enriched dataset written to ./data/housing_enriched.csv
53 $