Anomaly Detection

Anomaly detection is a technique for finding data points that don’t fit the normal pattern. Unlike standard classification where we have roughly equal numbers of examples for each class, anomaly detection is designed for problems where “normal” examples vastly outnumber “abnormal” ones. Think of credit card fraud (millions of legitimate transactions vs. a handful of fraudulent ones), network intrusion detection, manufacturing defect spotting, or medical diagnosis.

The core idea is simple: learn what “normal” looks like, then flag anything that deviates too far from that model.

In this chapter we build a Gaussian anomaly detector from scratch in Common Lisp and apply it to the Wisconsin Diagnostic Breast Cancer dataset. The dataset contains 647 samples of cell measurements, each labeled as either benign (normal) or malignant (anomalous). Our goal is to train a model on mostly normal data and then use it to identify anomalies — samples that the model considers unusual.

The source code for this chapter is in the directory src/anomaly_detection.

What Is a Gaussian Distribution?

Before diving into the code, let’s understand the key idea behind our detector. A Gaussian distribution (also called a “bell curve” or “normal distribution”) describes how values of a measurement tend to cluster around an average value. Most measurements fall close to the average, and very few fall far away.

Every Gaussian distribution is characterized by two numbers:

Mean (μ) — the average value. This is the center of the bell curve.
Variance (σ²) — how spread out the values are. A small variance means values cluster tightly around the mean; a large variance means they are more spread out.

The Gaussian probability density function (PDF) tells us how likely a particular value is, given our distribution:

1 p(x) = (1 / (√(2π) · σ)) · exp(-(x - μ)² / (2σ²))

If we compute p(x) and get a high number, the value x fits well with our model of “normal.” If p(x) is very low, then x is far from the average — it’s an anomaly.

How the Detector Works

Our detector works in four steps:

Split the data into three groups: training (60%), cross-validation (28%), and test (12%). The training set is mostly normal examples. This mirrors how you would use the system in practice: train on data you believe to be normal, then test on new data.
Fit a Gaussian model to the training data. For each feature (measurement), we compute the mean μ and variance σ². This tells us what “normal” looks like for each feature independently.
Tune the epsilon threshold. For each data sample, we compute its average probability across all features. If this probability is below a threshold called epsilon, we classify it as an anomaly. But what value of epsilon works best? We try 200 different values and pick the one that makes the fewest mistakes on the cross-validation set. This is called hyperparameter tuning.
Evaluate on test data. With the best epsilon selected, we run the detector on previously unseen test data and report precision, recall, and the F1 score.

The Wisconsin Breast Cancer Dataset

The dataset contains 647 samples with 9 numeric features each (such as “Clump Thickness,” “Cell Size Uniformity,” and “Bare Nuclei”) scored on a 1–10 scale, plus a target label: 2 for benign and 4 for malignant. In our preprocessing we remap the target to 0 (benign/normal) and 1 (malignant/anomaly).

The raw features are integer-valued and skewed, which is not ideal for a Gaussian model. We apply two preprocessing steps (matching the original Java implementation):

Log transform — applying log(x + 1.2) to each feature makes the distribution more bell-shaped.
Min-max normalization — scaling each row’s features to the range [0, 1] so that all features contribute equally.

Project Structure

The code is split into two files:

anomaly-detection.lisp — the core detection algorithm (package, data structures, training, evaluation)
wisconsin.lisp — CSV loading, preprocessing, histograms, and the run-wisconsin entry point

Walking Through the Code

The Detector Data Structure

We use a defstruct to hold the model state:

 1 (defstruct detector
 2   "Gaussian anomaly detector.
 3    Fits per-feature mean and variance, then tunes an
 4    epsilon threshold via cross-validation."
 5   (num-features     0   :type fixnum)
 6   (mu               #() :type simple-vector)
 7   (sigma-sq         #() :type simple-vector)
 8   (best-eps         0.02d0 :type double-float)
 9   (training         #() :type simple-vector)
10   (cross-validation #() :type simple-vector)
11   (testing          #() :type simple-vector))

The mu slot holds the per-feature means, sigma-sq holds the per-feature variances, and best-eps is the learned epsilon threshold. The three data splits — training, cross-validation, and testing — are stored so we can use them during the training process.

Computing Mean and Variance

The compute-mu function calculates the average value of each feature across all training examples:

 1 (defun compute-mu (examples num-features)
 2   "Compute per-feature mean over EXAMPLES."
 3   (let ((n (length examples))
 4         (mu (make-array num-features
 5                         :initial-element 0.0d0)))
 6     (when (zerop n)
 7       (return-from compute-mu mu))
 8     (loop for ex across examples do
 9       (loop for f below num-features do
10         (incf (aref mu f) (aref ex f))))
11     (loop for f below num-features do
12       (setf (aref mu f) (/ (aref mu f) n)))
13     mu))

This is straightforward: sum all values for each feature, then divide by the number of examples. The inner loop iterates over features and accumulates into the mu array.

The compute-sigma-sq function calculates the variance that is how much each feature’s values deviate from the mean:

 1 (defun compute-sigma-sq (examples mu num-features)
 2   "Compute per-feature variance over EXAMPLES.
 3    Skips the last column (target label)."
 4   (let ((n (length examples))
 5         (sigma-sq (make-array num-features
 6                               :initial-element 0.0d0)))
 7     (loop for f below (1- num-features) do
 8       (let ((sum 0.0d0))
 9         (loop for ex across examples do
10           (let ((diff (- (aref ex f) (aref mu f))))
11             (incf sum (* diff diff))))
12         (setf (aref sigma-sq f)
13               (max (/ sum n) 1.0d-10))))
14     sigma-sq))

For each feature, we compute the sum of squared differences from the mean, then divide by n. The (max ... 1.0d-10) guard prevents division by zero later when a feature has identical values across all examples. Notice that we skip the last column ((1- num-features)) because that column is the target label, not a measurement.

The Gaussian PDF

The heart of the algorithm is the gaussian-p function. For a given data sample, it computes the Gaussian probability for each feature and returns their average:

 1 (defun gaussian-p (x mu sigma-sq num-features)
 2   "Compute average Gaussian PDF p(x) across features."
 3   (let ((sum 0.0d0))
 4     (loop for f below (1- num-features) do
 5       (let* ((s2 (aref sigma-sq f))
 6              (sigma (sqrt s2))
 7              (diff (- (aref x f) (aref mu f)))
 8              (exponent (/ (- (* diff diff))
 9                           (* 2.0d0 s2))))
10         (incf sum (* (/ 1.0d0 (* +sqrt-2-pi+ sigma))
11                      (exp exponent)))))
12     (/ sum num-features)))

Let’s trace through what happens for one feature:

We retrieve the variance s2 and compute the standard deviation sigma = √s2.
We compute diff — how far this sample’s value is from the mean.
The exponent is -(diff²) / (2 · σ²). When diff is large (the value is far from the mean), this exponent becomes a large negative number, making exp(exponent) very small.
We multiply by the normalization constant 1 / (√(2π) · σ) to get a proper probability density.
We sum these per-feature probabilities and divide by the number of features to get an average.

A normal sample will have values close to the mean across most features, producing a relatively high average probability. An anomalous sample will have unusual values in several features, producing a low average probability.

Training: Finding the Best Epsilon

The train function searches for the best epsilon threshold:

 1 (defun train (det)
 2   "Train the detector by sweeping epsilon values on
 3    cross-validation data, then evaluate on test data."
 4   (sb-int:with-float-traps-masked
 5       (:invalid :overflow :divide-by-zero)
 6     (let ((best-err  1d10)
 7           (best-e    0.001d0))
 8       (loop for i below 200
 9             for eps = (+ 0.001d0 (* 0.005d0 i))
10             for err = (train-helper det eps)
11             do (when (<= err best-err)
12                  (setf best-err err
13                        best-e eps)))
14       (format t "~%**** Best epsilon = ~,4F~%"
15               best-e)
16       (setf (detector-best-eps det) best-e)
17       (train-helper det best-e)
18       (test-model det best-e)
19       det)))

This tries 200 epsilon values from 0.001 to 1.0 in steps of 0.005. For each value, train-helper counts how many cross-validation examples are misclassified. The epsilon with the fewest errors wins. The <= comparison (rather than <) ensures that when multiple epsilon values tie, we pick the highest one — this gives the decision boundary more margin and tends to generalize better.

Note the sb-int:with-float-traps-masked wrapper. SBCL, unlike Java, traps floating-point exceptions by default. The log transform and Gaussian PDF computations can occasionally produce extreme values (especially during the first few epsilon trials), so we mask these traps to match Java’s behavior of silently propagating special float values.

Evaluating the Model

The test-model function evaluates the trained detector on held-out test data and computes three standard metrics:

Precision — of the samples flagged as anomalies, what fraction actually are? (Measures false alarm rate.)
Recall — of the actual anomalies, what fraction did we catch? (Measures miss rate.)
F1 score — the harmonic mean of precision and recall. An F1 of 1.0 means perfect detection with no false alarms.

Data Preprocessing

The preprocess-wisconsin function in wisconsin.lisp transforms the raw CSV data to make it suitable for Gaussian modeling:

 1 (defun preprocess-wisconsin (rows)
 2   "Apply log-transform, min-max scaling, and remap
 3    the target label to [0, 1]."
 4   (let ((result (make-array (length rows))))
 5     (loop for row in rows
 6           for idx from 0 do
 7       (let ((xs (make-array 10 :element-type t)))
 8         ;; copy and scale features by 0.1
 9         (loop for i below 10 do
10           (setf (aref xs i) (aref row i)))
11         (loop for i below 9 do
12           (setf (aref xs i) (* (aref xs i) 0.1d0)))
13         ;; log transform
14         (let ((mn 1.0d6) (mx -1.0d6))
15           (loop for i below 9 do
16             (setf (aref xs i)
17                   (log (+ (aref xs i) 1.2d0)))
18             (when (< (aref xs i) mn)
19               (setf mn (aref xs i)))
20             (when (> (aref xs i) mx)
21               (setf mx (aref xs i))))
22           ;; min-max normalise to [0, 1]
23           (let ((range (- mx mn)))
24             (if (< range 1.0d-10)
25                 (loop for i below 9 do
26                   (setf (aref xs i) 0.5d0))
27                 (loop for i below 9 do
28                   (setf (aref xs i)
29                         (/ (- (aref xs i) mn)
30                            range))))))
31         ;; remap target: [2, 4] -> [0, 1]
32         (setf (aref xs 9)
33               (* (- (aref xs 9) 2.0d0) 0.5d0))
34         (setf (aref result idx) xs)))
35     result))

Each row goes through three transformations:

Scale by 0.1 — the raw values are integers 1–10; dividing by 10 puts them in [0.1, 1.0].
Log transform — log(x + 1.2) pushes the distribution toward a bell shape. The offset 1.2 ensures we never take the log of zero.
Per-row min-max normalization — maps values to [0, 1] so that all features contribute equally. The guard for range < 1e-10 handles the rare case where all features in a single row have the same value (which would cause a division by zero).

Finally, the target label is remapped from {2, 4} to {0, 1}.

Running the Example

To run the full pipeline:

1 cd src/anomaly_detection
2 sbcl --noinform --load test.lisp

Here is typical output (the exact numbers vary due to the random data split):

 1 === Running anomaly detection test ===
 2 Loaded 648 examples.
 3 
 4 Training set:  284
 5 Cross-val set: 176
 6 Test set:      59
 7 
 8 **** Best epsilon = 0.9810
 9 
10 
11  -- best epsilon = 0.9810
12  -- test examples  = 59
13  -- false positives = 3
14  -- true positives  = 21
15  -- false negatives = 1
16  -- true negatives  = 34
17  -- precision = 0.8750
18  -- recall    = 0.9545
19  -- F1        = 0.9130
20 
21 Model parameters:
22   best epsilon = 0.9810
23   num features = 10
24 
25 --- Assertions ---
26 All assertions passed.
27 
28 === Test complete ===

The detector achieves an F1 score above 0.85 — meaning it correctly identifies most malignant samples while producing few false alarms. This is a strong result for such a simple model, and demonstrates that the Gaussian approach works well when the normal data is roughly bell-curve shaped (which our log transform helps ensure).

Using the API in Your Own Code

You can also use the anomaly detection module interactively from the REPL:

 1 (require :asdf)
 2 (push (truename "./") asdf:*central-registry*)
 3 (asdf:load-system "anomaly-detection")
 4 
 5 ;; Run the full Wisconsin pipeline:
 6 (anomaly-detection:run-wisconsin)
 7 
 8 ;; Or build and train on your own data:
 9 (let ((det (anomaly-detection:build-detector
10             my-data 10)))
11   (anomaly-detection:train det)
12   ;; Check if a new sample is an anomaly:
13   (anomaly-detection:anomaly-p det some-vector))

The anomaly-p function returns T if the detector considers the input vector anomalous and NIL otherwise. This makes it easy to integrate into a larger pipeline — for example, flagging suspicious transactions in a stream of financial data.

Understanding the Evaluation Metrics

If you are new to machine learning, the evaluation output deserves a closer look:

True positives (TP) — anomalies correctly identified as anomalies.
True negatives (TN) — normal samples correctly identified as normal.
False positives (FP) — normal samples incorrectly flagged as anomalies (false alarms).
False negatives (FN) — anomalies that slipped through undetected (misses).

In medical diagnosis, false negatives are especially dangerous — a malignant tumor classified as benign could delay treatment. Our detector’s high recall (above 0.95) means it catches nearly all anomalies, at the cost of a few false alarms. This is generally the right tradeoff for safety-critical applications.

Wrap Up

We built a Gaussian anomaly detector from scratch, using nothing but Common Lisp’s built-in math and array operations. The approach is simple yet effective:

Model each feature as a Gaussian distribution (mean + variance).
Score new samples by how well they fit the model.
Use cross-validation to find the best decision threshold.

This technique works best when you have many normal examples and relatively few anomalies — exactly the scenario where traditional classification struggles because there aren’t enough positive examples to learn from.

The code in this chapter was ported from a Java implementation and corrected to use the proper Gaussian PDF formula. The same algorithm can be applied to any numeric dataset where you need to detect outliers or unusual patterns.

Up next

Semantic Web and Linked Data