Data Analysis in Haskell with DataFrame
Haskell’s mathematical purity allows for highly performant and expressive data manipulation algorithms. In this chapter, we will explore the dataframe library (version 1.0.0.0), a powerful tool for tabular data analysis, akin to Pandas in Python but built on Haskell’s robust type system.
We will walk through an example project that analyzes a California housing dataset. We will cover how to read and write CSV files, compute summary statistics, derive new columns, and perform complex filtering, sorting, and aggregations.
Setup and Boilerplate
Before working directly with tabular data, we initialize our project with the required extensions and imports. We make use of OverloadedStrings for cleaner string literals and ScopedTypeVariables for explicit type applications on column references.
1 {-# LANGUAGE OverloadedStrings #-}
2 {-# LANGUAGE ScopedTypeVariables #-}
3
4 module Main where
5
6 import qualified DataFrame as D
7 import qualified DataFrame.Functions as F
8 import DataFrame.Operators -- re-exports (|>), (.==), (.>=), etc.
9 import Data.Text (Text)
10 import Control.Exception (IOException, try)
11 import System.Exit (exitFailure)
All column references in this example use string-based F.col @T "name" syntax. The dataframe library also supports a Template Haskell splice (F.declareColumnsFromCsvFile) that generates typed column-reference bindings at compile time; if you want that extra compile-time column safety, add {-# LANGUAGE TemplateHaskell #-} and uncomment the splice.

Loading Data and Summary Statistics
To get started, we read the CSV into a DataFrame layout. The dataframe library affords commands similar to familiar database and analysis tools. We wrap the load in try/catch so we get a clear error message if the CSV is missing:
1 main :: IO ()
2 main = do
3 -- Wrap in try/catch so we get a clear error message if the CSV is missing.
4 result <- try (D.readCsv "./data/housing.csv") :: IO (Either IOException D.DataFrame)
5 df <- case result of
6 Left err -> do
7 putStrLn $ "ERROR: Could not load CSV file: " <> show err
8 putStrLn "Make sure you are running from the dataframe_example project root."
9 exitFailure
10 Right v -> pure v
11
12 putStrLn "\n=== California Housing Dataset ==="
13 putStrLn "First 5 rows:"
14 print (D.take 5 df)
Once the dataframe is loaded, initial exploration is crucial. To quickly get a grasp of your data’s distribution and missing values, use summary capabilities:
1 putStrLn "\n=== Column Descriptions ==="
2 print (D.describeColumns df)
3
4 putStrLn "\n=== Summary Statistics ==="
5 print (D.summarize df)
Deriving New Columns
Often, your analysis requires data metrics not present in the original dataset. You can easily derive new characteristics by applying arithmetic on existing typed columns using the D.derive function and a custom pipe operator |> to sequence data transformations.
Let’s compute the number of rooms per household, population density per household, and bedrooms relative to total rooms:
1 -- rooms_per_household, bedrooms_per_room, population_per_household
2 let enriched =
3 df
4 |> D.derive "rooms_per_household"
5 (F.toDouble (F.col @Int "total_rooms") / F.toDouble (F.col @Int "households"))
6 |> D.derive "population_per_household"
7 (F.toDouble (F.col @Int "population") / F.toDouble (F.col @Int "households"))
8 |> D.derive "bedrooms_per_room"
9 (F.toDouble (F.col @Int "total_bedrooms") / F.toDouble (F.col @Int "total_rooms"))
Filtering and Selection
For narrower views of data domains, the D.filter and D.select combinators slice via row bounds and column scopes. Let’s isolate the high-value properties along the coastline:
1 -- Filter: keep high-value coastal properties
2 let expensive =
3 enriched
4 |> D.filter (F.col @Double "median_house_value") (> 400000)
5 |> D.filter (F.col @Text "ocean_proximity") (\p -> p `elem` ["NEAR BAY", "NEAR OCEAN", "<1H OCEAN"])
6
7 -- Select relevant columns only
8 let selected =
9 expensive
10 |> D.select ["ocean_proximity", "median_income", "median_house_value",
11 "rooms_per_household", "population_per_household"]
Sorting
Data representation often necessitates ordering by weight or date. To organize our dataset sequentially descending by value, D.sortBy applies the required transformation:
1 -- Sort by median house value descending
2 let sorted =
3 enriched
4 |> D.sortBy [D.Desc (F.col @Double "median_house_value")]
Grouping and Aggregation
Analyzing wide margins requires grouping data under unique traits and gathering statistical clusters.
For instance, we can group attributes by their proximity to the ocean and compute aggregate values. Notice we leverage aliases using F.as for these computed categories.
1 -- Group by ocean proximity, aggregate
2 let byProximity =
3 enriched
4 |> D.groupBy ["ocean_proximity"]
5 |> D.aggregate
6 [ F.count @Double (F.col @Double "median_house_value") `F.as` "num_districts"
7 , F.mean @Double (F.col @Double "median_house_value") `F.as` "avg_house_value"
8 , F.mean @Double (F.col @Double "median_income") `F.as` "avg_income"
9 , F.mean @Double (F.col @Double "rooms_per_household")`F.as` "avg_rooms_per_hh"
10 ]
11 |> D.sortBy [D.Desc (F.col @Double "avg_house_value")]
Writing Data to Files
Finally, we use writeCsv to save the enriched dataframe into a fresh text representation ready for broader sharing.
1 -- Write enriched dataset to CSV
2 D.writeCsv "./data/housing_enriched.csv" enriched
3 putStrLn "\nEnriched dataset written to ./data/housing_enriched.csv"
The dataframe structure showcases Haskell’s elegant pipe-forward functional capability and statically typed robustness, letting us perform common Pandas-like analyses fluently, without leaving the type safety of Haskell behind.
Here is partial output from running this example (output that spans > 100 columns not show - run the example to see the full output):
1 $ cabal run dataframe_example
2 HEAD is now at 5790ef4 Fix Functions module compilation in ghc 9.10 (#194)
3 Configuration is affected by the following files:
4 - cabal.project
5
6 === Column Descriptions ===
7 ---------------------------------------------------------------
8 Column Name | # Non-null Values | # Null Values | Type
9 -------------------|-------------------|---------------|-------
10 Text | Int | Int | Text
11 -------------------|-------------------|---------------|-------
12 ocean_proximity | 50 | 0 | Text
13 median_house_value | 50 | 0 | Double
14 median_income | 50 | 0 | Double
15 households | 50 | 0 | Int
16 population | 50 | 0 | Int
17 total_bedrooms | 50 | 0 | Int
18 total_rooms | 50 | 0 | Int
19 housing_median_age | 50 | 0 | Int
20 latitude | 50 | 0 | Double
21 longitude | 50 | 0 | Double
22
23 ----------------------------------------------------------------------
24
25 === Average House Value by Ocean Proximity ===
26 --------------------------------------------------------------------------------------------
27 ocean_proximity | num_districts | avg_house_value | avg_income | avg_rooms_per_hh
28 ----------------|---------------|--------------------|-------------------|------------------
29 Text | Int | Double | Double | Double
30 ----------------|---------------|--------------------|-------------------|------------------
31 <1H OCEAN | 10 | 484000.0 | 5.88 | 5.891623561548036
32 BAY | 5 | 450000.0 | 5.5 | 5.679258351578158
33 NEAR OCEAN | 10 | 333000.0 | 4.3 | 5.738715647098
34 NEAR BAY | 10 | 314480.0 | 4.99608 | 5.736540211611564
35 INLAND | 15 | 212333.33333333334 | 3.266666666666667 | 5.655433028354371
36
37 ----------------------------------------------------------------------
38
39 === Inland Districts with Median Income >= $40k ===
40 -------------------------------------------------------------------------------
41 longitude | latitude | median_income | median_house_value | rooms_per_household
42 ----------|----------|---------------|--------------------|--------------------
43 Double | Double | Double | Double | Double
44 ----------|----------|---------------|--------------------|--------------------
45 -122.0 | 37.4 | 4.6 | 330000.0 | 5.6521739130434785
46 -119.8 | 36.65 | 4.5 | 240000.0 | 5.6716417910447765
47 -121.95 | 37.38 | 4.2 | 310000.0 | 5.641025641025641
48 -119.75 | 36.62 | 4.0 | 210000.0 | 5.660377358490566
49
50 ----------------------------------------------------------------------
51
52 Enriched dataset written to ./data/housing_enriched.csv
53 $