Anomaly Detection Machine Learning Example

Anomaly detection models are used in one very specific class of use cases: when you have many negative (non-anomaly) examples and relatively few positive (anomaly) examples. We can refer to this as an unbalanced training set. For training we will ignore positive examples, create a model of “how things should be,” and hopefully be able to detect anomalies different from the original negative examples.

If you have a large training set of both negative and positive examples then do not use anomaly detection models. If your training examples are balanced then use a classification model as we will see later in the chapter Deep Learning Using Deeplearning4j.

Motivation for Anomaly Detection

There are two other examples in this book using the University of Wisconsin cancer data. These other examples are supervised learning. Anomaly detection as we do it in this chapter is, more or less, unsupervised learning.

When should we use anomaly detection? You should use supervised learning algorithms like neural networks and logistic classification when there are roughly equal number of available negative and positive examples in the training data. The University of Wisconsin cancer data set is fairly evenly split between negative and positive examples.

Anomaly detection should be used when you have many negative (“normal”) examples and relatively few positive (“anomaly”) examples. For the example in this chapter we will simulate scarcity of positive (“anomaly”) results by preparing the data using the Wisconsin cancer data as follows:

  • We will split the data into training (60%), cross validation (20%) and testing (20%).
  • For the training data, we will discard all but two positive (“anomaly”) examples. We do this to simulate the real world test case where some positive examples are likely to end up in the training data in spite of the fact that we would prefer the training data to only contain negative (“normal”) examples.
  • We will use the cross validation data to find a good value for the epsilon meta parameter.
  • After we find a good epsilon value, we will calculate the F1 measurement for the model.

Math Primer for Anomaly Detection

We are trying to model “normal” behavior and we do this by taking each feature and fitting a Gaussian (bell curve) distribution to each feature. The learned parameters for a Gaussian distribution are the mean of the data (where the bell shaped curve is centered) and the variance. You might be more familiar with the term standard deviation, Code Test. Variance is defined as Code Test.

We will need to calculate the probability: Code Test where Code Test is the mean and Code Test is the squared variance:

Code Test

where Code Test are the samples and we can calculate the squared variance as:

Code Test

We calculate the parameters of Code Test and Code Test for each feature. A bell shaped distribution in two dimensions is easy to visualize as is an inverted bowl shape in three dimensions. What if we have many features? Well, the math works and don’t worry about not being able to picture it in your mind.

AnomalyDetection Utility Class

The class AnomalyDetection developed in this section is fairly general purpose. It processes a set of training examples and for each feature calculates Code Test and Code Test. We are also training for a third parameter: an epsilon “cutoff” value: if for a given input vector if Code Test evaluates to a value greater than epsilon then the input vector is “normal”, less than epsilon implies that the input vector is an “anomaly.” The math for calculating these three features from training data is fairly easy but the code is not: we need to organize the training data and search for a value of epsilon that minimizes the error for a cross validation data set.

To be clear: we separate the input examples into three separate sets of training, cross validation, and testing data. We use the training data to set the model parameters, use the cross validation data to learn an epsilon value, and finally use the testing data to get precision, recall, and F1 scores that indicate how well the model detects anomalies in data not used for training and cross validation.

I present the example program as one long listing, with more code explanation after the listing. Please note the long loop over each input training example starting at line 28 and ending on line 74. The code in lines 25 through 44 processes the input training data sample into three disjoint sets of training, cross validation, and testing data. Then the code in lines 45 through 63 copies these three sets of data to Java arrays.

The code in lines 65 through 73 calculates, for a training example, the value of $\mu$ (the variable mu in the code).

Please note in the code example that I prepend class variables used in methods with “this.” even when it is not required. I do this for legibility and is a personal style.

The following UML class diagram will give you an overview before we dive into the code:

UML class diagram for the anomaly detection example
Architecture diagram
  1 package com.markwatson.anomaly_detection;
  2 
  3 import java.util.ArrayList;
  4 import java.util.List;
  5 import java.util.Random;
  6 
  7 /**
  8  * AnomalyDetection is a general purpose class for building anomaly detection
  9  * models. You should use this type of model when you have mostly negative
 10  * training examples with relatively few positive examples and you need a model
 11  * that detects positive (anomaly) inputs.
 12  *
 13  * <p>The algorithm computes feature means and variances from training data,
 14  * then uses a multivariate Gaussian probability density to score new examples.
 15  * An epsilon threshold (tuned on cross-validation data) separates normal
 16  * examples from anomalies.</p>
 17  */
 18 public class AnomalyDetection {
 19 
 20   /**
 21    * Structured container for evaluation metrics returned by {@link #test}.
 22    */
 23   public record TestResult(
 24       double precision,
 25       double recall,
 26       double f1,
 27       int truePositives,
 28       int falsePositives,
 29       int trueNegatives,
 30       int falseNegatives
 31   ) {}
 32 
 33   /** Seed for reproducible train/cross-validation/test splits. */
 34   private static final long DEFAULT_SEED = 42L;
 35 
 36   private static final double SQRT_2_PI = Math.sqrt(2.0 * Math.PI);
 37 
 38   private double bestEpsilon = 0.02;
 39   private double[] mu;            // [num_features]
 40   private double[] sigmaSquared;  // [num_features]
 41   private int numFeatures;
 42 
 43   private double[][] trainingExamples;
 44   private double[][] crossValidationExamples;
 45   private double[][] testingExamples;
 46   private int numTrainingExamples;
 47   private int numCrossValidationExamples;
 48   private int numTestingExamples;
 49 
 50   public AnomalyDetection() { }
 51 
 52   /**
 53    * Build an anomaly detection model from the supplied training data.
 54    *
 55    * <p>The data is split into training (60 %), cross-validation, and test
 56    * sets. Only normal (negative) examples go into the training set, with
 57    * ~10 % of positive examples intentionally leaked in to make the model
 58    * robust.</p>
 59    *
 60    * @param numFeatures          number of features (including the target column)
 61    * @param numTrainingExamples  number of rows in {@code trainingExamples}
 62    * @param trainingExamples     [numTrainingExamples][numFeatures]
 63    */
 64   public AnomalyDetection(int numFeatures, int numTrainingExamples, double[][] trainingExamples) {
 65     var training = new ArrayList<double[]>();
 66     var crossValidation = new ArrayList<double[]>();
 67     var testing = new ArrayList<double[]>();
 68     int outcomeIndex = numFeatures - 1; // index of target outcome
 69     var rng = new Random(DEFAULT_SEED);
 70 
 71     for (int i = 0; i < numTrainingExamples; i++) {
 72       if (rng.nextDouble() < 0.6) { // 60% training
 73         // Only keep normal (negative) examples in training, except
 74         // remember the reason for using this algorithm is that it
 75         // works with many more negative examples than positive
 76         // examples, and that the algorithm works even with some positive
 77         // examples mixed into the training set. The random test is to
 78         // allow about 10% positive examples to get into the training set:
 79         if (trainingExamples[i][outcomeIndex] < 0.5 || rng.nextDouble() < 0.1)
 80           training.add(trainingExamples[i]);
 81       } else if (rng.nextDouble() < 0.7) {
 82           crossValidation.add(trainingExamples[i]);
 83       } else {
 84         testing.add(trainingExamples[i]);
 85       }
 86     }
 87     this.numTrainingExamples = training.size();
 88     this.numCrossValidationExamples = crossValidation.size();
 89     this.numTestingExamples = testing.size();
 90 
 91     this.trainingExamples = training.toArray(new double[0][]);
 92     this.crossValidationExamples = crossValidation.toArray(new double[0][]);
 93     this.testingExamples = testing.toArray(new double[0][]);
 94 
 95     this.mu = new double[numFeatures];
 96     this.sigmaSquared = new double[numFeatures];
 97     this.numFeatures = numFeatures;
 98     for (int nf = 0; nf < this.numFeatures; nf++) {
 99       double sum = 0;
100       for (int nt = 0; nt < this.numTrainingExamples; nt++) sum += this.trainingExamples[nt][nf];
101       this.mu[nf] = sum / this.numTrainingExamples;
102     }
103   }
104 
105   public double[] muValues()     { return mu; }
106   public double[] sigmaSquared() { return sigmaSquared; }
107   public double bestEpsilon()    { return bestEpsilon; }
108 
109   /**
110    * Train the model using a range of epsilon values. Epsilon is a
111    * hyper parameter that we want to find a value that
112    * minimizes the error count.
113    */
114   public void train() {
115     double bestErrorCount = 1e10;
116     for (int epsilonLoop = 0; epsilonLoop < 200; epsilonLoop++) {
117       double epsilon = 0.001 + 0.005 * epsilonLoop;
118       double errorCount = trainHelper(epsilon);
119       if (errorCount <= bestErrorCount) {
120         bestErrorCount = errorCount;
121         bestEpsilon = epsilon;
122       }
123     }
124     System.out.println("\n**** Best epsilon value = " + bestEpsilon);
125 
126     // retrain for the best epsilon setting the maximum likelihood parameters
127     // which are now in epsilon, mu[] and sigmaSquared[]:
128     trainHelper(bestEpsilon);
129 
130     // finally, we are ready to see how the model performs with test data:
131     test(bestEpsilon);
132   }
133 
134   /**
135    * Calculate probability p(x; mu, sigma_squared) using the
136    * Gaussian PDF:  p(x_i) = (1 / (sqrt(2*pi) * sigma))
137    *                         * exp(-(x_i - mu)^2 / (2 * sigma^2))
138    *
139    * @param x - example vector
140    * @return average probability across features
141    */
142   private double p(double[] x) {
143     double sum = 0;
144     // use (numFeatures - 1) to skip target output:
145     for (int nf = 0; nf < numFeatures - 1; nf++) {
146       double sigma = Math.sqrt(sigmaSquared[nf]);
147       double exponent = -(x[nf] - mu[nf]) * (x[nf] - mu[nf]) / (2.0 * sigmaSquared[nf]);
148       sum += (1.0 / (SQRT_2_PI * sigma)) * Math.exp(exponent);
149     }
150     return sum / numFeatures;
151   }
152 
153   /**
154    * Returns {@code true} if the input vector is classified as an anomaly.
155    *
156    * @param x example vector
157    * @return true when the Gaussian probability falls below the best epsilon threshold
158    */
159   public boolean isAnomaly(double[] x) {
160     double sum = 0;
161     // use (numFeatures - 1) to skip target output:
162     for (int nf = 0; nf < numFeatures - 1; nf++) {
163       double sigma = Math.sqrt(sigmaSquared[nf]);
164       double exponent = -(x[nf] - mu[nf]) * (x[nf] - mu[nf]) / (2.0 * sigmaSquared[nf]);
165       sum += (1.0 / (SQRT_2_PI * sigma)) * Math.exp(exponent);
166     }
167     return (sum / numFeatures) < bestEpsilon;
168   }
169 
170   /**
171    * @deprecated Use {@link #isAnomaly(double[])} instead (fixed spelling).
172    */
173   @Deprecated(since = "1.1", forRemoval = true)
174   public boolean isAnamoly(double[] x) {
175     return isAnomaly(x);
176   }
177 
178   private double trainHelper(double epsilon) {
179     int outcomeIndex = numFeatures - 1;
180     // use (numFeatures - 1) to skip target output:
181     for (int nf = 0; nf < numFeatures - 1; nf++) {
182       double sum = 0;
183       for (int nt = 0; nt < numTrainingExamples; nt++) {
184         sum += (trainingExamples[nt][nf] - mu[nf]) * (trainingExamples[nt][nf] - mu[nf]);
185       }
186       sigmaSquared[nf] = sum / numTrainingExamples;
187     }
188     double crossValidationErrorCount = 0;
189     for (int nt = 0; nt < numCrossValidationExamples; nt++) {
190       var x = crossValidationExamples[nt];
191       double calculatedValue = p(x);
192       if (x[outcomeIndex] > 0.5) { // target training output is ANOMALY
193         // if the calculated value is greater than epsilon
194         // then this x vector is not an anomaly (e.g., it
195         // is a normal sample):
196         if (calculatedValue > epsilon) crossValidationErrorCount += 1;
197       } else { // target training output is NORMAL
198         if (calculatedValue < epsilon) crossValidationErrorCount += 1;
199       }
200     }
201     System.out.println("   cross_validation_error_count = " + crossValidationErrorCount + " for epsilon = " + epsilon);
202     return crossValidationErrorCount;
203   }
204 
205   private TestResult test(double epsilon) {
206     int outcomeIndex = numFeatures - 1;
207     int numFalsePositives = 0, numFalseNegatives = 0;
208     int numTruePositives = 0, numTrueNegatives = 0;
209     for (int nt = 0; nt < numTestingExamples; nt++) {
210       var x = testingExamples[nt];
211       double calculatedValue = p(x);
212       if (x[outcomeIndex] > 0.5) { // target training output is ANOMALY
213         if (calculatedValue > epsilon) numFalseNegatives++;
214         else                           numTruePositives++;
215       } else { // target training output is NORMAL
216         if (calculatedValue < epsilon) numFalsePositives++;
217         else                           numTrueNegatives++;
218       }
219     }
220     double precision = numTruePositives / (double) (numTruePositives + numFalsePositives);
221     double recall = numTruePositives / (double) (numTruePositives + numFalseNegatives);
222     double f1 = 2.0 * precision * recall / (precision + recall);
223 
224     System.out.println("""
225 
226          -- best epsilon = %s
227          -- number of test examples = %d
228          -- number of false positives = %d
229          -- number of true positives = %d
230          -- number of false negatives = %d
231          -- number of true negatives = %d
232          -- precision = %.4f
233          -- recall = %.4f
234          -- F1 = %.4f""".formatted(
235         bestEpsilon, numTestingExamples,
236         numFalsePositives, numTruePositives,
237         numFalseNegatives, numTrueNegatives,
238         precision, recall, f1));
239 
240     return new TestResult(precision, recall, f1,
241         numTruePositives, numFalsePositives, numTrueNegatives, numFalseNegatives);
242   }
243 }

Once the training data and the values of {$$}\mu{/$$} (the varible mu in the code) are defined for each feature we can define the method train in lines 86 through 104 that calculated the best epsilon “cutoff” value for the training data set using the method train_helper defined in lines 138 through 165. We use the “best” epsilon value by testing with the separate cross validation data set; we do this by calling the method test that is defined in lines 167 through 198.

Example Using the University of Wisconsin Cancer Data

The example in this section loads the University of Wisconsin data and uses the class AnomalyDetection developed in the last section to find anomalies, which for this example will be input vectors that represented malignancy in the original data.

 1 package com.markwatson.anomaly_detection;
 2 
 3 import java.io.IOException;
 4 import java.nio.file.Files;
 5 import java.nio.file.Path;
 6 import java.util.List;
 7 
 8 /**
 9  * Train a deep belief network on the University of Wisconsin Cancer Data Set.
10  */
11 public class WisconsinAnomalyDetection {
12 
13   private static final boolean PRINT_HISTOGRAMS = true;
14   private static final int NUM_HISTOGRAM_BINS = 5;
15 
16   public static void main(String[] args) throws IOException {
17 
18     var lines = Files.readAllLines(Path.of("data/cleaned_wisconsin_cancer_data.csv"));
19     var trainingData = new double[lines.size()][];
20     int lineCount = 0;
21 
22     for (var line : lines) {
23       var sarr = line.strip().split(",");
24       var xs = new double[10];
25       for (int i = 0; i < 10; i++) xs[i] = Double.parseDouble(sarr[i]);
26       for (int i = 0; i < 9; i++) xs[i] *= 0.1;
27 
28       // make the data look more like a Gaussian (bell curve shaped) distribution:
29       double min = 1.e6, max = -1.e6;
30       for (int i = 0; i < 9; i++) {
31         xs[i] = Math.log(xs[i] + 1.2);
32         if (xs[i] < min) min = xs[i];
33         if (xs[i] > max) max = xs[i];
34       }
35       for (int i = 0; i < 9; i++) xs[i] = (xs[i] - min) / (max - min);
36 
37       xs[9] = (xs[9] - 2) * 0.5; // make target output be [0,1] instead of [2,4]
38       trainingData[lineCount++] = xs;
39     }
40 
41     if (PRINT_HISTOGRAMS) {
42       PrintHistogram.histogram("Clump Thickness", trainingData, 0, 0.0, 1.0, NUM_HISTOGRAM_BINS);
43       PrintHistogram.histogram("Uniformity of Cell Size", trainingData, 1, 0.0, 1.0, NUM_HISTOGRAM_BINS);
44       PrintHistogram.histogram("Uniformity of Cell Shape", trainingData, 2, 0.0, 1.0, NUM_HISTOGRAM_BINS);
45       PrintHistogram.histogram("Marginal Adhesion", trainingData, 3, 0.0, 1.0, NUM_HISTOGRAM_BINS);
46       PrintHistogram.histogram("Single Epithelial Cell Size", trainingData, 4, 0.0, 1.0, NUM_HISTOGRAM_BINS);
47       PrintHistogram.histogram("Bare Nuclei", trainingData, 5, 0.0, 1.0, NUM_HISTOGRAM_BINS);
48       PrintHistogram.histogram("Bland Chromatin", trainingData, 6, 0.0, 1.0, NUM_HISTOGRAM_BINS);
49       PrintHistogram.histogram("Normal Nucleoli", trainingData, 7, 0.0, 1.0, NUM_HISTOGRAM_BINS);
50       PrintHistogram.histogram("Mitoses", trainingData, 8, 0.0, 1.0, NUM_HISTOGRAM_BINS);
51     }
52 
53     var detector = new AnomalyDetection(10, lineCount - 1, trainingData);
54 
55     // the train method will print training results like
56     // precision, recall, and F1:
57     detector.train();
58 
59     // get best model parameters:
60     double bestEpsilon = detector.bestEpsilon();
61     var meanValues = detector.muValues();
62     var sigmaSquared = detector.sigmaSquared();
63     System.out.println("\nModel parameters:");
64     System.out.println("  best epsilon = " + bestEpsilon);
65     System.out.println("  num features = " + meanValues.length);
66   }
67 }

Data used by an anomaly detecton model should have (roughly) a Gaussian (bell curve shape) distribution. What form does the cancer data have? Unfortunately, each of the data features seems to either have a greater density at the lower range of feature values or large density at the extremes of the data feature ranges. This will cause our model to not perform as well as we would like. Here are the inputs displayed as five-bin histograms:

 1 Clump Thickness
 2 0   177
 3 1   174
 4 2   154
 5 3   63
 6 4   80
 7 
 8 Uniformity of Cell Size
 9 0   393
10 1   86
11 2   54
12 3   43
13 4   72
14 
15 Uniformity of Cell Shape
16 0   380
17 1   92
18 2   58
19 3   54
20 4   64
21 
22 Marginal Adhesion
23 0   427
24 1   85
25 2   42
26 3   37
27 4   57
28 
29 Single Epithelial Cell Size
30 0   394
31 1   117
32 2   76
33 3   28
34 4   33
35 
36 Bare Nuclei
37 0   409
38 1   44
39 2   33
40 3   27
41 4   135
42 
43 Bland Chromatin
44 0   298
45 1   182
46 2   41
47 3   97
48 4   30
49 
50 Normal Nucleoli
51 0   442
52 1   56
53 2   39
54 3   37
55 4   74
56 
57 Mitoses
58 0   567
59 1   42
60 2   8
61 3   17
62 4   14

I won’t do it in this example, but the feature “Bare Nuclei” should be removed because it is not even close to being a bell-shaped distribution. Another thing that you can do (recommended by Andrew Ng in his Coursera Machine Learning class) is to take the log of data and otherwise transform it to something that looks more like a Gaussian distribution. In the class WisconsinAnomalyDetection, you could for example, transform the data using something like:

1   // make the data look more like a Gaussian (bell curve shaped) distribution:
2   double min = 1.e6, max = -1.e6;
3   for (int i=0; i<9; i++) {
4     xs[i] = Math.log(xs[i] + 1.2);
5     if (xs[i] < min) min = xs[i];
6     if (xs[i] > max) max = xs[i];
7   }
8   for (int i=0; i<9; i++) xs[i] = (xs[i] - min) / (max - min);

The constant 1.2 in line 4 is a tuning parameter that I got by trial and error by iterating on adjusting the factor and looking at the data histograms.

In a real application you would drop features that you can not transform to something like a Gaussian distribution.

Here are the results of running the code as it is in the github repository for this book:

 1  -- best epsilon = 0.28
 2  -- number of test examples = 63
 3  -- number of false positives = 0.0
 4  -- number of true positives = 11.0
 5  -- number of false negatives = 8.0
 6  -- number of true negatives = 44.0
 7  -- precision = 1.0
 8  -- recall = 0.5789473684210527
 9  -- F1 = 0.7333333333333334
10 
11 
12 Using the trained model:
13 malignant result = true, benign result = false

How do we evaluate these results? The precision value of 1.0 means that there were no false positives. False positives are predictions of a true result when it should have been false. The value 0.578 for recall means that of all the samples that should have been classified as positive, we only predicted about 57.8% of them. The F1 score is calculated as two times the product of precision and recall, divided by the sum of precision plus recall.