Data Analysis in Haskell with DataFrame

Haskell’s mathematical purity allows for highly performant and expressive data manipulation algorithms. In this chapter, we will explore the dataframe library (version 1.0.0.0), a powerful tool for tabular data analysis, akin to Pandas in Python but built on Haskell’s robust type system.

We will walk through an example project that analyzes a California housing dataset. We will cover how to read and write CSV files, compute summary statistics, derive new columns, and perform complex filtering, sorting, and aggregations.

Setup and Boilerplate

Before working directly with tabular data, we initialize our project with the required extensions and imports. We make use of OverloadedStrings for cleaner string literals and ScopedTypeVariables for explicit type applications on column references.

 1 {-# LANGUAGE OverloadedStrings #-}
 2 {-# LANGUAGE ScopedTypeVariables #-}
 3 
 4 module Main where
 5 
 6 import qualified DataFrame          as D
 7 import qualified DataFrame.Functions as F
 8 import           DataFrame.Operators          -- re-exports (|>), (.==), (.>=), etc.
 9 import           Data.Text          (Text)
10 import           Control.Exception  (IOException, try)
11 import           System.Exit        (exitFailure)

All column references in this example use string-based F.col @T "name" syntax. The dataframe library also supports a Template Haskell splice (F.declareColumnsFromCsvFile) that generates typed column-reference bindings at compile time; if you want that extra compile-time column safety, add {-# LANGUAGE TemplateHaskell #-} and uncomment the splice.

Figure 22. DataFrame Analysis Pipeline Architecture

Loading Data and Summary Statistics

To get started, we read the CSV into a DataFrame layout. The dataframe library affords commands similar to familiar database and analysis tools. We wrap the load in try/catch so we get a clear error message if the CSV is missing:

 1 main :: IO ()
 2 main = do
 3   -- Wrap in try/catch so we get a clear error message if the CSV is missing.
 4   result <- try (D.readCsv "./data/housing.csv") :: IO (Either IOException D.DataFrame)
 5   df <- case result of
 6     Left err -> do
 7       putStrLn $ "ERROR: Could not load CSV file: " <> show err
 8       putStrLn "Make sure you are running from the dataframe_example project root."
 9       exitFailure
10     Right v  -> pure v
11 
12   putStrLn "\n=== California Housing Dataset ==="
13   putStrLn "First 5 rows:"
14   print (D.take 5 df)

Once the dataframe is loaded, initial exploration is crucial. To quickly get a grasp of your data’s distribution and missing values, use summary capabilities:

1   putStrLn "\n=== Column Descriptions ==="
2   print (D.describeColumns df)
3 
4   putStrLn "\n=== Summary Statistics ==="
5   print (D.summarize df)

Deriving New Columns

Often, your analysis requires data metrics not present in the original dataset. You can easily derive new characteristics by applying arithmetic on existing typed columns using the D.derive function and a custom pipe operator |> to sequence data transformations.

Let’s compute the number of rooms per household, population density per household, and bedrooms relative to total rooms:

1   -- rooms_per_household, bedrooms_per_room, population_per_household
2   let enriched =
3         df
4           |> D.derive "rooms_per_household"
5                (F.toDouble (F.col @Int "total_rooms") / F.toDouble (F.col @Int "households"))
6           |> D.derive "population_per_household"
7                (F.toDouble (F.col @Int "population") / F.toDouble (F.col @Int "households"))
8           |> D.derive "bedrooms_per_room"
9                (F.toDouble (F.col @Int "total_bedrooms") / F.toDouble (F.col @Int "total_rooms"))

Filtering and Selection

For narrower views of data domains, the D.filter and D.select combinators slice via row bounds and column scopes. Let’s isolate the high-value properties along the coastline:

 1   -- Filter: keep high-value coastal properties
 2   let expensive =
 3         enriched
 4           |> D.filter (F.col @Double "median_house_value") (> 400000)
 5           |> D.filter (F.col @Text   "ocean_proximity")    (\p -> p `elem` ["NEAR BAY", "NEAR OCEAN", "<1H OCEAN"])
 6 
 7   -- Select relevant columns only
 8   let selected =
 9         expensive
10           |> D.select ["ocean_proximity", "median_income", "median_house_value",
11                        "rooms_per_household", "population_per_household"]

Sorting

Data representation often necessitates ordering by weight or date. To organize our dataset sequentially descending by value, D.sortBy applies the required transformation:

1   -- Sort by median house value descending
2   let sorted =
3         enriched
4           |> D.sortBy [D.Desc (F.col @Double "median_house_value")]

Grouping and Aggregation

Analyzing wide margins requires grouping data under unique traits and gathering statistical clusters.

For instance, we can group attributes by their proximity to the ocean and compute aggregate values. Notice we leverage aliases using F.as for these computed categories.

 1   -- Group by ocean proximity, aggregate
 2   let byProximity =
 3         enriched
 4           |> D.groupBy ["ocean_proximity"]
 5           |> D.aggregate
 6                [ F.count @Double (F.col @Double "median_house_value") `F.as` "num_districts"
 7                , F.mean  @Double (F.col @Double "median_house_value") `F.as` "avg_house_value"
 8                , F.mean  @Double (F.col @Double "median_income")      `F.as` "avg_income"
 9                , F.mean  @Double (F.col @Double "rooms_per_household")`F.as` "avg_rooms_per_hh"
10                ]
11           |> D.sortBy [D.Desc (F.col @Double "avg_house_value")]

Writing Data to Files

Finally, we use writeCsv to save the enriched dataframe into a fresh text representation ready for broader sharing.

1   -- Write enriched dataset to CSV
2   D.writeCsv "./data/housing_enriched.csv" enriched
3   putStrLn "\nEnriched dataset written to ./data/housing_enriched.csv"

The dataframe structure showcases Haskell’s elegant pipe-forward functional capability and statically typed robustness, letting us perform common Pandas-like analyses fluently, without leaving the type safety of Haskell behind.

Here is partial output from running this example (output that spans > 100 columns not show - run the example to see the full output):

 1  $ cabal run dataframe_example
 2 HEAD is now at 5790ef4 Fix Functions module compilation in ghc 9.10 (#194)
 3 Configuration is affected by the following files:
 4 - cabal.project
 5 
 6 === Column Descriptions ===
 7 ---------------------------------------------------------------
 8    Column Name     | # Non-null Values | # Null Values |  Type 
 9 -------------------|-------------------|---------------|-------
10        Text        |        Int        |      Int      |  Text 
11 -------------------|-------------------|---------------|-------
12 ocean_proximity    | 50                | 0             | Text  
13 median_house_value | 50                | 0             | Double
14 median_income      | 50                | 0             | Double
15 households         | 50                | 0             | Int   
16 population         | 50                | 0             | Int   
17 total_bedrooms     | 50                | 0             | Int   
18 total_rooms        | 50                | 0             | Int   
19 housing_median_age | 50                | 0             | Int   
20 latitude           | 50                | 0             | Double
21 longitude          | 50                | 0             | Double
22 
23 ----------------------------------------------------------------------
24 
25 === Average House Value by Ocean Proximity ===
26 --------------------------------------------------------------------------------------------
27 ocean_proximity | num_districts |  avg_house_value   |    avg_income     | avg_rooms_per_hh 
28 ----------------|---------------|--------------------|-------------------|------------------
29      Text       |      Int      |       Double       |      Double       |      Double      
30 ----------------|---------------|--------------------|-------------------|------------------
31 <1H OCEAN       | 10            | 484000.0           | 5.88              | 5.891623561548036
32 BAY             | 5             | 450000.0           | 5.5               | 5.679258351578158
33 NEAR OCEAN      | 10            | 333000.0           | 4.3               | 5.738715647098   
34 NEAR BAY        | 10            | 314480.0           | 4.99608           | 5.736540211611564
35 INLAND          | 15            | 212333.33333333334 | 3.266666666666667 | 5.655433028354371
36 
37 ----------------------------------------------------------------------
38 
39 === Inland Districts with Median Income >= $40k ===
40 -------------------------------------------------------------------------------
41 longitude | latitude | median_income | median_house_value | rooms_per_household
42 ----------|----------|---------------|--------------------|--------------------
43  Double   |  Double  |    Double     |       Double       |       Double       
44 ----------|----------|---------------|--------------------|--------------------
45 -122.0    | 37.4     | 4.6           | 330000.0           | 5.6521739130434785 
46 -119.8    | 36.65    | 4.5           | 240000.0           | 5.6716417910447765 
47 -121.95   | 37.38    | 4.2           | 310000.0           | 5.641025641025641  
48 -119.75   | 36.62    | 4.0           | 210000.0           | 5.660377358490566  
49 
50 ----------------------------------------------------------------------
51 
52 Enriched dataset written to ./data/housing_enriched.csv
53 $

Optional Practice Problems

Write helper functions to filter the DataFrame rows dynamically based on custom query criteria (e.g., matching a minimum numeric value or string prefix).
Implement functions to compute summary statistics (such as mean, median, and standard deviation) for selected numeric columns within the DataFrame.

Up next

Haskell Program to Play the Blackjack Card Game