Anomaly Detection Machine Learning Example
Anomaly detection models are used in one very specific class of use cases: when you have many negative (non-anomaly) examples and relatively few positive (anomaly) examples. We can refer to this as an unbalanced training set. For training we will ignore positive examples, create a model of “how things should be,” and hopefully be able to detect anomalies different from the original negative examples.
If you have a large training set of both negative and positive examples then do not use anomaly detection models. If your training examples are balanced then use a classification model as we will see later in the chapter Deep Learning Using Deeplearning4j.
Motivation for Anomaly Detection
There are two other examples in this book using the University of Wisconsin cancer data. These other examples are supervised learning. Anomaly detection as we do it in this chapter is, more or less, unsupervised learning.
When should we use anomaly detection? You should use supervised learning algorithms like neural networks and logistic classification when there are roughly equal number of available negative and positive examples in the training data. The University of Wisconsin cancer data set is fairly evenly split between negative and positive examples.
Anomaly detection should be used when you have many negative (“normal”) examples and relatively few positive (“anomaly”) examples. For the example in this chapter we will simulate scarcity of positive (“anomaly”) results by preparing the data using the Wisconsin cancer data as follows:
- We will split the data into training (60%), cross validation (20%) and testing (20%).
- For the training data, we will discard all but two positive (“anomaly”) examples. We do this to simulate the real world test case where some positive examples are likely to end up in the training data in spite of the fact that we would prefer the training data to only contain negative (“normal”) examples.
- We will use the cross validation data to find a good value for the epsilon meta parameter.
- After we find a good epsilon value, we will calculate the F1 measurement for the model.
Math Primer for Anomaly Detection
We are trying to model “normal” behavior and we do this by taking each feature and fitting a Gaussian (bell curve) distribution to each feature. The learned parameters for a Gaussian distribution are the mean of the data (where the bell shaped curve is centered) and the variance.
You might be more familiar with the term standard deviation,
.
Variance is defined as
.
We will need to calculate the probability:
where
is the mean and
is the squared variance:

where
are the samples and we can calculate the squared variance as:

We calculate the parameters of
and
for each feature. A bell shaped distribution in two dimensions is easy to visualize as is an inverted bowl shape in three dimensions. What if we have many features? Well, the math works and don’t worry about not being able to picture it in your mind.
AnomalyDetection Utility Class
The class AnomalyDetection developed in this section is fairly general purpose.
It processes a set of training examples and for each feature calculates
and
. We are also training for a third parameter: an epsilon “cutoff” value: if for a given input vector if
evaluates
to a value greater than epsilon then the input vector is “normal”, less than epsilon implies that the input vector is an “anomaly.” The math for calculating these three features from training data is fairly easy but the code is not: we need to organize the training data and search for a value of epsilon that minimizes the error for a cross validation data set.
To be clear: we separate the input examples into three separate sets of training, cross validation, and testing data. We use the training data to set the model parameters, use the cross validation data to learn an epsilon value, and finally use the testing data to get precision, recall, and F1 scores that indicate how well the model detects anomalies in data not used for training and cross validation.
I present the example program as one long listing, with more code explanation after the listing. Please note the long loop over each input training example starting at line 28 and ending on line 74. The code in lines 25 through 44 processes the input training data sample into three disjoint sets of training, cross validation, and testing data. Then the code in lines 45 through 63 copies these three sets of data to Java arrays.
The code in lines 65 through 73 calculates, for a training example, the value of $\mu$ (the variable mu in the code).
Please note in the code example that I prepend class variables used in methods with “this.” even when it is not required. I do this for legibility and is a personal style.
The following UML class diagram will give you an overview before we dive into the code:
1 package com.markwatson.anomaly_detection;
2
3 import java.util.ArrayList;
4 import java.util.List;
5 import java.util.Random;
6
7 /**
8 * AnomalyDetection is a general purpose class for building anomaly detection
9 * models. You should use this type of model when you have mostly negative
10 * training examples with relatively few positive examples and you need a model
11 * that detects positive (anomaly) inputs.
12 *
13 * <p>The algorithm computes feature means and variances from training data,
14 * then uses a multivariate Gaussian probability density to score new examples.
15 * An epsilon threshold (tuned on cross-validation data) separates normal
16 * examples from anomalies.</p>
17 */
18 public class AnomalyDetection {
19
20 /**
21 * Structured container for evaluation metrics returned by {@link #test}.
22 */
23 public record TestResult(
24 double precision,
25 double recall,
26 double f1,
27 int truePositives,
28 int falsePositives,
29 int trueNegatives,
30 int falseNegatives
31 ) {}
32
33 /** Seed for reproducible train/cross-validation/test splits. */
34 private static final long DEFAULT_SEED = 42L;
35
36 private static final double SQRT_2_PI = Math.sqrt(2.0 * Math.PI);
37
38 private double bestEpsilon = 0.02;
39 private double[] mu; // [num_features]
40 private double[] sigmaSquared; // [num_features]
41 private int numFeatures;
42
43 private double[][] trainingExamples;
44 private double[][] crossValidationExamples;
45 private double[][] testingExamples;
46 private int numTrainingExamples;
47 private int numCrossValidationExamples;
48 private int numTestingExamples;
49
50 public AnomalyDetection() { }
51
52 /**
53 * Build an anomaly detection model from the supplied training data.
54 *
55 * <p>The data is split into training (60 %), cross-validation, and test
56 * sets. Only normal (negative) examples go into the training set, with
57 * ~10 % of positive examples intentionally leaked in to make the model
58 * robust.</p>
59 *
60 * @param numFeatures number of features (including the target column)
61 * @param numTrainingExamples number of rows in {@code trainingExamples}
62 * @param trainingExamples [numTrainingExamples][numFeatures]
63 */
64 public AnomalyDetection(int numFeatures, int numTrainingExamples, double[][] trainingExamples) {
65 var training = new ArrayList<double[]>();
66 var crossValidation = new ArrayList<double[]>();
67 var testing = new ArrayList<double[]>();
68 int outcomeIndex = numFeatures - 1; // index of target outcome
69 var rng = new Random(DEFAULT_SEED);
70
71 for (int i = 0; i < numTrainingExamples; i++) {
72 if (rng.nextDouble() < 0.6) { // 60% training
73 // Only keep normal (negative) examples in training, except
74 // remember the reason for using this algorithm is that it
75 // works with many more negative examples than positive
76 // examples, and that the algorithm works even with some positive
77 // examples mixed into the training set. The random test is to
78 // allow about 10% positive examples to get into the training set:
79 if (trainingExamples[i][outcomeIndex] < 0.5 || rng.nextDouble() < 0.1)
80 training.add(trainingExamples[i]);
81 } else if (rng.nextDouble() < 0.7) {
82 crossValidation.add(trainingExamples[i]);
83 } else {
84 testing.add(trainingExamples[i]);
85 }
86 }
87 this.numTrainingExamples = training.size();
88 this.numCrossValidationExamples = crossValidation.size();
89 this.numTestingExamples = testing.size();
90
91 this.trainingExamples = training.toArray(new double[0][]);
92 this.crossValidationExamples = crossValidation.toArray(new double[0][]);
93 this.testingExamples = testing.toArray(new double[0][]);
94
95 this.mu = new double[numFeatures];
96 this.sigmaSquared = new double[numFeatures];
97 this.numFeatures = numFeatures;
98 for (int nf = 0; nf < this.numFeatures; nf++) {
99 double sum = 0;
100 for (int nt = 0; nt < this.numTrainingExamples; nt++) sum += this.trainingExamples[nt][nf];
101 this.mu[nf] = sum / this.numTrainingExamples;
102 }
103 }
104
105 public double[] muValues() { return mu; }
106 public double[] sigmaSquared() { return sigmaSquared; }
107 public double bestEpsilon() { return bestEpsilon; }
108
109 /**
110 * Train the model using a range of epsilon values. Epsilon is a
111 * hyper parameter that we want to find a value that
112 * minimizes the error count.
113 */
114 public void train() {
115 double bestErrorCount = 1e10;
116 for (int epsilonLoop = 0; epsilonLoop < 200; epsilonLoop++) {
117 double epsilon = 0.001 + 0.005 * epsilonLoop;
118 double errorCount = trainHelper(epsilon);
119 if (errorCount <= bestErrorCount) {
120 bestErrorCount = errorCount;
121 bestEpsilon = epsilon;
122 }
123 }
124 System.out.println("\n**** Best epsilon value = " + bestEpsilon);
125
126 // retrain for the best epsilon setting the maximum likelihood parameters
127 // which are now in epsilon, mu[] and sigmaSquared[]:
128 trainHelper(bestEpsilon);
129
130 // finally, we are ready to see how the model performs with test data:
131 test(bestEpsilon);
132 }
133
134 /**
135 * Calculate probability p(x; mu, sigma_squared) using the
136 * Gaussian PDF: p(x_i) = (1 / (sqrt(2*pi) * sigma))
137 * * exp(-(x_i - mu)^2 / (2 * sigma^2))
138 *
139 * @param x - example vector
140 * @return average probability across features
141 */
142 private double p(double[] x) {
143 double sum = 0;
144 // use (numFeatures - 1) to skip target output:
145 for (int nf = 0; nf < numFeatures - 1; nf++) {
146 double sigma = Math.sqrt(sigmaSquared[nf]);
147 double exponent = -(x[nf] - mu[nf]) * (x[nf] - mu[nf]) / (2.0 * sigmaSquared[nf]);
148 sum += (1.0 / (SQRT_2_PI * sigma)) * Math.exp(exponent);
149 }
150 return sum / numFeatures;
151 }
152
153 /**
154 * Returns {@code true} if the input vector is classified as an anomaly.
155 *
156 * @param x example vector
157 * @return true when the Gaussian probability falls below the best epsilon threshold
158 */
159 public boolean isAnomaly(double[] x) {
160 double sum = 0;
161 // use (numFeatures - 1) to skip target output:
162 for (int nf = 0; nf < numFeatures - 1; nf++) {
163 double sigma = Math.sqrt(sigmaSquared[nf]);
164 double exponent = -(x[nf] - mu[nf]) * (x[nf] - mu[nf]) / (2.0 * sigmaSquared[nf]);
165 sum += (1.0 / (SQRT_2_PI * sigma)) * Math.exp(exponent);
166 }
167 return (sum / numFeatures) < bestEpsilon;
168 }
169
170 /**
171 * @deprecated Use {@link #isAnomaly(double[])} instead (fixed spelling).
172 */
173 @Deprecated(since = "1.1", forRemoval = true)
174 public boolean isAnamoly(double[] x) {
175 return isAnomaly(x);
176 }
177
178 private double trainHelper(double epsilon) {
179 int outcomeIndex = numFeatures - 1;
180 // use (numFeatures - 1) to skip target output:
181 for (int nf = 0; nf < numFeatures - 1; nf++) {
182 double sum = 0;
183 for (int nt = 0; nt < numTrainingExamples; nt++) {
184 sum += (trainingExamples[nt][nf] - mu[nf]) * (trainingExamples[nt][nf] - mu[nf]);
185 }
186 sigmaSquared[nf] = sum / numTrainingExamples;
187 }
188 double crossValidationErrorCount = 0;
189 for (int nt = 0; nt < numCrossValidationExamples; nt++) {
190 var x = crossValidationExamples[nt];
191 double calculatedValue = p(x);
192 if (x[outcomeIndex] > 0.5) { // target training output is ANOMALY
193 // if the calculated value is greater than epsilon
194 // then this x vector is not an anomaly (e.g., it
195 // is a normal sample):
196 if (calculatedValue > epsilon) crossValidationErrorCount += 1;
197 } else { // target training output is NORMAL
198 if (calculatedValue < epsilon) crossValidationErrorCount += 1;
199 }
200 }
201 System.out.println(" cross_validation_error_count = " + crossValidationErrorCount + " for epsilon = " + epsilon);
202 return crossValidationErrorCount;
203 }
204
205 private TestResult test(double epsilon) {
206 int outcomeIndex = numFeatures - 1;
207 int numFalsePositives = 0, numFalseNegatives = 0;
208 int numTruePositives = 0, numTrueNegatives = 0;
209 for (int nt = 0; nt < numTestingExamples; nt++) {
210 var x = testingExamples[nt];
211 double calculatedValue = p(x);
212 if (x[outcomeIndex] > 0.5) { // target training output is ANOMALY
213 if (calculatedValue > epsilon) numFalseNegatives++;
214 else numTruePositives++;
215 } else { // target training output is NORMAL
216 if (calculatedValue < epsilon) numFalsePositives++;
217 else numTrueNegatives++;
218 }
219 }
220 double precision = numTruePositives / (double) (numTruePositives + numFalsePositives);
221 double recall = numTruePositives / (double) (numTruePositives + numFalseNegatives);
222 double f1 = 2.0 * precision * recall / (precision + recall);
223
224 System.out.println("""
225
226 -- best epsilon = %s
227 -- number of test examples = %d
228 -- number of false positives = %d
229 -- number of true positives = %d
230 -- number of false negatives = %d
231 -- number of true negatives = %d
232 -- precision = %.4f
233 -- recall = %.4f
234 -- F1 = %.4f""".formatted(
235 bestEpsilon, numTestingExamples,
236 numFalsePositives, numTruePositives,
237 numFalseNegatives, numTrueNegatives,
238 precision, recall, f1));
239
240 return new TestResult(precision, recall, f1,
241 numTruePositives, numFalsePositives, numTrueNegatives, numFalseNegatives);
242 }
243 }
Once the training data and the values of {$$}\mu{/$$} (the varible mu in the code) are defined for each feature we can define the method train in lines 86 through 104 that calculated the best epsilon “cutoff” value for the training data set using the method train_helper defined in lines 138 through 165. We use the “best” epsilon value by testing with the separate cross validation data set; we do this by calling the method test that is defined in lines 167 through 198.
Example Using the University of Wisconsin Cancer Data
The example in this section loads the University of Wisconsin data and uses the class AnomalyDetection developed in the last section to find anomalies, which for this example will be input vectors that represented malignancy in the original data.
1 package com.markwatson.anomaly_detection;
2
3 import java.io.IOException;
4 import java.nio.file.Files;
5 import java.nio.file.Path;
6 import java.util.List;
7
8 /**
9 * Train a deep belief network on the University of Wisconsin Cancer Data Set.
10 */
11 public class WisconsinAnomalyDetection {
12
13 private static final boolean PRINT_HISTOGRAMS = true;
14 private static final int NUM_HISTOGRAM_BINS = 5;
15
16 public static void main(String[] args) throws IOException {
17
18 var lines = Files.readAllLines(Path.of("data/cleaned_wisconsin_cancer_data.csv"));
19 var trainingData = new double[lines.size()][];
20 int lineCount = 0;
21
22 for (var line : lines) {
23 var sarr = line.strip().split(",");
24 var xs = new double[10];
25 for (int i = 0; i < 10; i++) xs[i] = Double.parseDouble(sarr[i]);
26 for (int i = 0; i < 9; i++) xs[i] *= 0.1;
27
28 // make the data look more like a Gaussian (bell curve shaped) distribution:
29 double min = 1.e6, max = -1.e6;
30 for (int i = 0; i < 9; i++) {
31 xs[i] = Math.log(xs[i] + 1.2);
32 if (xs[i] < min) min = xs[i];
33 if (xs[i] > max) max = xs[i];
34 }
35 for (int i = 0; i < 9; i++) xs[i] = (xs[i] - min) / (max - min);
36
37 xs[9] = (xs[9] - 2) * 0.5; // make target output be [0,1] instead of [2,4]
38 trainingData[lineCount++] = xs;
39 }
40
41 if (PRINT_HISTOGRAMS) {
42 PrintHistogram.histogram("Clump Thickness", trainingData, 0, 0.0, 1.0, NUM_HISTOGRAM_BINS);
43 PrintHistogram.histogram("Uniformity of Cell Size", trainingData, 1, 0.0, 1.0, NUM_HISTOGRAM_BINS);
44 PrintHistogram.histogram("Uniformity of Cell Shape", trainingData, 2, 0.0, 1.0, NUM_HISTOGRAM_BINS);
45 PrintHistogram.histogram("Marginal Adhesion", trainingData, 3, 0.0, 1.0, NUM_HISTOGRAM_BINS);
46 PrintHistogram.histogram("Single Epithelial Cell Size", trainingData, 4, 0.0, 1.0, NUM_HISTOGRAM_BINS);
47 PrintHistogram.histogram("Bare Nuclei", trainingData, 5, 0.0, 1.0, NUM_HISTOGRAM_BINS);
48 PrintHistogram.histogram("Bland Chromatin", trainingData, 6, 0.0, 1.0, NUM_HISTOGRAM_BINS);
49 PrintHistogram.histogram("Normal Nucleoli", trainingData, 7, 0.0, 1.0, NUM_HISTOGRAM_BINS);
50 PrintHistogram.histogram("Mitoses", trainingData, 8, 0.0, 1.0, NUM_HISTOGRAM_BINS);
51 }
52
53 var detector = new AnomalyDetection(10, lineCount - 1, trainingData);
54
55 // the train method will print training results like
56 // precision, recall, and F1:
57 detector.train();
58
59 // get best model parameters:
60 double bestEpsilon = detector.bestEpsilon();
61 var meanValues = detector.muValues();
62 var sigmaSquared = detector.sigmaSquared();
63 System.out.println("\nModel parameters:");
64 System.out.println(" best epsilon = " + bestEpsilon);
65 System.out.println(" num features = " + meanValues.length);
66 }
67 }
Data used by an anomaly detecton model should have (roughly) a Gaussian (bell curve shape) distribution. What form does the cancer data have? Unfortunately, each of the data features seems to either have a greater density at the lower range of feature values or large density at the extremes of the data feature ranges. This will cause our model to not perform as well as we would like. Here are the inputs displayed as five-bin histograms:
1 Clump Thickness
2 0 177
3 1 174
4 2 154
5 3 63
6 4 80
7
8 Uniformity of Cell Size
9 0 393
10 1 86
11 2 54
12 3 43
13 4 72
14
15 Uniformity of Cell Shape
16 0 380
17 1 92
18 2 58
19 3 54
20 4 64
21
22 Marginal Adhesion
23 0 427
24 1 85
25 2 42
26 3 37
27 4 57
28
29 Single Epithelial Cell Size
30 0 394
31 1 117
32 2 76
33 3 28
34 4 33
35
36 Bare Nuclei
37 0 409
38 1 44
39 2 33
40 3 27
41 4 135
42
43 Bland Chromatin
44 0 298
45 1 182
46 2 41
47 3 97
48 4 30
49
50 Normal Nucleoli
51 0 442
52 1 56
53 2 39
54 3 37
55 4 74
56
57 Mitoses
58 0 567
59 1 42
60 2 8
61 3 17
62 4 14
I won’t do it in this example, but the feature “Bare Nuclei” should be removed because it is not even close to being a bell-shaped distribution. Another thing that you can do (recommended by Andrew Ng in his Coursera Machine Learning class) is to take the log of data and otherwise transform it to something that looks more like a Gaussian distribution. In the class WisconsinAnomalyDetection, you could for example, transform the data using something like:
1 // make the data look more like a Gaussian (bell curve shaped) distribution:
2 double min = 1.e6, max = -1.e6;
3 for (int i=0; i<9; i++) {
4 xs[i] = Math.log(xs[i] + 1.2);
5 if (xs[i] < min) min = xs[i];
6 if (xs[i] > max) max = xs[i];
7 }
8 for (int i=0; i<9; i++) xs[i] = (xs[i] - min) / (max - min);
The constant 1.2 in line 4 is a tuning parameter that I got by trial and error by iterating on adjusting the factor and looking at the data histograms.
In a real application you would drop features that you can not transform to something like a Gaussian distribution.
Here are the results of running the code as it is in the github repository for this book:
1 -- best epsilon = 0.28
2 -- number of test examples = 63
3 -- number of false positives = 0.0
4 -- number of true positives = 11.0
5 -- number of false negatives = 8.0
6 -- number of true negatives = 44.0
7 -- precision = 1.0
8 -- recall = 0.5789473684210527
9 -- F1 = 0.7333333333333334
10
11
12 Using the trained model:
13 malignant result = true, benign result = false
How do we evaluate these results? The precision value of 1.0 means that there were no false positives. False positives are predictions of a true result when it should have been false. The value 0.578 for recall means that of all the samples that should have been classified as positive, we only predicted about 57.8% of them. The F1 score is calculated as two times the product of precision and recall, divided by the sum of precision plus recall.