Anomaly Detection Machine Learning Example
Anomaly detection models are used in one very specific class of use cases: when you have many negative (non-anomaly) examples and relatively few positive (anomaly) examples. We can refer to this as an unbalanced training set. For training we will ignore positive examples, create a model of “how things should be,” and hopefully be able to detect anomalies different from the original negative examples.
If you have a large training set of both negative and positive examples then do not use anomaly detection models. If your training examples are balanced then use a classification model as we will see later in the chapter Deep Learning Using Deeplearning4j.
Motivation for Anomaly Detection
There are two other examples in this book using the University of Wisconsin cancer data. These other examples are supervised learning. Anomaly detection as we do it in this chapter is, more or less, unsupervised learning.
When should we use anomaly detection? You should use supervised learning algorithms like neural networks and logistic classification when there are roughly equal number of available negative and positive examples in the training data. The University of Wisconsin cancer data set is fairly evenly split between negative and positive examples.
Anomaly detection should be used when you have many negative (“normal”) examples and relatively few positive (“anomaly”) examples. For the example in this chapter we will simulate scarcity of positive (“anomaly”) results by preparing the data using the Wisconsin cancer data as follows:
- We will split the data into training (60%), cross validation (20%) and testing (20%).
- For the training data, we will discard all but two positive (“anomaly”) examples. We do this to simulate the real world test case where some positive examples are likely to end up in the training data in spite of the fact that we would prefer the training data to only contain negative (“normal”) examples.
- We will use the cross validation data to find a good value for the epsilon meta parameter.
- After we find a good epsilon value, we will calculate the F1 measurement for the model.
Math Primer for Anomaly Detection
We are trying to model “normal” behavior and we do this by taking each feature and fitting a Gaussian (bell curve) distribution to each feature. The learned parameters for a Gaussian distribution are the mean of the data (where the bell shaped curve is centered) and the variance. You might be more familiar with the term standard deviation, {$$}\sigma{/$$}. Variance is defined as {$$} \sigma ^2{/$$}.
We will need to calculate the probability of a value x given the mean and variance of a probability distribution: {$$}P(x : \mu, \sigma ^2){/$$} where {$$}\mu{/$$} is the mean and {$$} \sigma ^2{/$$} is the squared variance:
P(x : \mu, \sigma 2) = \frac{1}{{\sigma \sqrt {2\pi } }}e{{{ - \left( {x - \mu } \right)2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}
where {$$}x_i{/$$} are the samples and we can calculate the squared variance as:
\sigma2 = \frac{\displaystyle\sum_{i=1}{m}(x_i - \mu)^2} {m}
We calculate the parameters of {$$}\mu{/$$} and {$$} \sigma ^2{/$$} for each feature. A bell shaped distribution in two dimensions is easy to visualize as is an inverted bowl shape in three dimentions. What if we have many features? Well, the math works and don’t worry about not being able to picture it in your mind.
AnomalyDetection Utility Class
The class AnomalyDetection developed in this section is fairly general purpose. It processes a set of training examples and for each feature calculates {$$}\mu{/$$} and {$$} \sigma ^2{/$$}. We are also training for a third parameter: an epsilon “cutoff” value: if for a given input vector if {$$}P(x : \mu, \sigma ^2){/$$} evaluates to a value greater than epsilon then the input vector is “normal”, less than epsilon implies that the input vector is an “anomaly.” The math for calulating these three features from training data is fairly easy but the code is not: we need to organize the training data and search for a value of epsilon that minimizes the error for a cross validaton data set.
To be clear: we separate the input examples into three separate sets of training, cross validation, and testing data. We use the training data to set the model parameters, use the cross validation data to learn an epsilon value, and finally use the testing data to get precision, recall, and F1 scores that indicate how well the model detects anomalies in data not used for training and cross validation.
I present the example program as one long listing, with more code explanation after the listing. Please note the long loop over each input training example starting at line 28 and ending on line 74. The code in lines 25 through 44 processes the input training data sample into three disjoint sets of training, cross validation, and testing data. Then the code in lines 45 through 63 copies these three sets of data to Java arrays.
The code in lines 65 through 73 calculates, for a training example, the value of {$$}\mu{/$$} (the varible mu in the code).
Please note in the code example that I prepend class variables used in methods with “this.” even when it is not required. I do this for legibility and is a personal style.
The following UML class diagram will give you an overview before we dive into the code:
1 package com.markwatson.anomaly_detection;
2
3 import java.util.ArrayList;
4 import java.util.List;
5
6 /**
7 * Created by markw on 10/7/15.
8 */
9 public class AnomalyDetection {
10
11 public AnomalyDetection() { }
12
13 /**
14 * AnomalyDetection is a general purpose class for building anomaly detection
15 * models. You should use this type of mdel when you have mostly negative
16 * training examples with relatively few positive examples and you need a model
17 * that detects postive (anomaly) inputs.
18 *
19 * @param num_features
20 * @param num_training_examples
21 * @param training_examples [num_training_examples][num_features]
22 */
23 public AnomalyDetection(int num_features, int num_training_examples,
24 double [][] training_examples) {
25 List<double[]> training = new ArrayList<>();
26 List<double []> cross_validation = new ArrayList<>();
27 List<double []> testing = new ArrayList<>();
28 int outcome_index = num_features - 1; // index of target outcome
29 for (int i=0; i<num_training_examples; i++) {
30 if (Math.random() < 0.6) { // 60% training
31 // Only keep normal (negative) examples in training, except
32 // remember the reason for using this algorithm is that it
33 // works with many more negative examples than positive
34 // examples, and that the algorithm works even with some positive
35 // examples mixed into the training set. The random test is to
36 // allow about 10% positive examples to get into the training set:
37 if (training_examples[i][outcome_index] < 0.5 || Math.random() < 0.1)
38 training.add(training_examples[i]);
39 } else if (Math.random() < 0.7) {
40 cross_validation.add(training_examples[i]);
41 } else {
42 testing.add(training_examples[i]);
43 }
44 }
45 this.num_training_examples = training.size();
46 this.num_cross_validation_examples = cross_validation.size();
47 this.num_testing_examples = testing.size();
48
49 this.training_examples = new double[this.num_training_examples][];
50 for (int k=0; k<this.num_training_examples; k++) {
51 this.training_examples[k] = training.get(k);
52 }
53
54 this.cross_validation_examples = new double[num_cross_validation_examples][];
55 for (int k=0; k<num_cross_validation_examples; k++) {
56 this.cross_validation_examples[k] = cross_validation.get(k);
57 }
58
59 this.testing_examples = new double[num_testing_examples][];
60 for (int k=0; k<num_testing_examples; k++) {
61 // dimensions of [num_training_examples][num_features]:
62 this.testing_examples[k] = testing.get(k);
63 }
64
65 this.mu = new double[num_features];
66 this.sigma_squared = new double[num_features];
67 this.num_features = num_features;
68 for (int nf = 0; nf < this.num_features; nf++) { //
69 double sum = 0;
70 for (int nt = 0; nt < this.num_training_examples; nt++)
71 sum += training_examples[nt][nf];
72 this.mu[nf] = sum / this.num_training_examples;
73 }
74 }
75
76 public double [] muValues() { return mu; }
77 public double [] sigmaSquared() { return sigma_squared; }
78 public double bestEpsilon() { return best_epsilon; }
79
80 /**
81 *
82 * Train the model using a range of epsilon values. Epsilon is a
83 * hyper parameter that we want to find a value that
84 * minimizes the error count.
85 */
86 public void train() {
87 double best_error_count = 1e10;
88 for (int epsilon_loop=0; epsilon_loop<40; epsilon_loop++) {
89 double epsilon = 0.05 + 0.01 * epsilon_loop;
90 double error_count = train(epsilon);
91 if (error_count < best_error_count) {
92 best_error_count = error_count;
93 best_epsilon = epsilon;
94 }
95 }
96 System.out.println("\n**** Best epsilon value = " + best_epsilon );
97
98 // retrain for the best epsilon setting the maximum likelyhood parameters
99 // which are now in epsilon, mu[] and sigma_squared[]:
100 train_helper(best_epsilon);
101
102 // finally, we are ready to see how the model performs with test data:
103 test(best_epsilon);
104 }
105
106 /**
107 * calculate probability p(x; mu, sigma_squared)
108 *
109 * @param x - example vector
110 * @return
111 */
112 private double p(double [] x) {
113 double sum = 0;
114 // use (num_features - 1) to skip target output:
115 for (int nf=0; nf<num_features - 1; nf++) {
116 sum += (1.0 / (SQRT_2_PI * sigma_squared[nf])) *
117 Math.exp(- (x[nf] - mu[nf]) * (x[nf] - mu[nf]));
118 }
119 return sum / num_features;
120 }
121
122 /**
123 * returns 1 if input vector is an anoomaly
124 *
125 * @param x - example vector
126 * @return
127 */
128 public boolean isAnamoly(double [] x) {
129 double sum = 0;
130 // use (num_features - 1) to skip target output:
131 for (int nf=0; nf<num_features - 1; nf++) {
132 sum += (1.0 / (SQRT_2_PI * sigma_squared[nf])) *
133 Math.exp(- (x[nf] - mu[nf]) * (x[nf] - mu[nf]));
134 }
135 return (sum / num_features) < best_epsilon;
136 }
137
138 private double train_helper_(double epsilon) {
139 // use (num_features - 1) to skip target output:
140 for (int nf = 0; nf < this.num_features - 1; nf++) {
141 double sum = 0;
142 for (int nt=0; nt<this.num_training_examples; nt++) {
143 sum += (this.training_examples[nt][nf] - mu[nf]) *
144 (this.training_examples[nt][nf] - mu[nf]);
145 }
146 sigma_squared[nf] = (1.0 / num_features) * sum;
147 }
148 double cross_validation_error_count = 0;
149 for (int nt=0; nt<this.num_cross_validation_examples; nt++) {
150 double[] x = this.cross_validation_examples[nt];
151 double calculated_value = p(x);
152 if (x[9] > 0.5) { // target training output is ANOMALY
153 // if the calculated value is greater than epsilon
154 // then this x vector is not an anomaly (e.g., it
155 // is a normal sample):
156 if (calculated_value > epsilon) cross_validation_error_count += 1;
157 } else { // target training output is NORMAL
158 if (calculated_value < epsilon) cross_validation_error_count += 1;
159 }
160 }
161 System.out.println(" cross_validation_error_count = " +
162 cross_validation_error_count +
163 " for epsilon = " + epsilon);
164 return cross_validation_error_count;
165 }
166
167 private double test(double epsilon) {
168 double num_false_positives = 0, num_false_negatives = 0;
169 double num_true_positives = 0, num_true_negatives = 0;
170 for (int nt=0; nt<this.num_testing_examples; nt++) {
171 double[] x = this.testing_examples[nt];
172 double calculated_value = p(x);
173 if (x[9] > 0.5) { // target training output is ANOMALY
174 if (calculated_value > epsilon) num_false_negatives++;
175 else num_true_positives++;
176 } else { // target training output is NORMAL
177 if (calculated_value < epsilon) num_false_positives++;
178 else num_true_negatives++;
179 }
180 }
181 double precision = num_true_positives /
182 (num_true_positives + num_false_positives);
183 double recall = num_true_positives /
184 (num_true_positives + num_false_negatives);
185 double F1 = 2 * precision * recall / (precision + recall);
186
187 System.out.println("\n\n -- best epsilon = " + this.best_epsilon);
188 System.out.println(" -- number of test examples = " +
189 this.num_testing_examples);
190 System.out.println(" -- number of false positives = " + num_false_positives);
191 System.out.println(" -- number of true positives = " + num_true_positives);
192 System.out.println(" -- number of false negatives = " + num_false_negatives);
193 System.out.println(" -- number of true negatives = " + num_true_negatives);
194 System.out.println(" -- precision = " + precision);
195 System.out.println(" -- recall = " + recall);
196 System.out.println(" -- F1 = " + F1);
197 return F1;
198 }
199
200 double best_epsilon = 0.02;
201 private double [] mu; // [num_features]
202 private double [] sigma_squared; // [num_features]
203 private int num_features;
204 private static double SQRT_2_PI = 2.50662827463;
205
206
207 private double[][] training_examples; // [num_features][num_training_examples]
208 // [num_features][num_training_examples]:
209 private double[][] cross_validation_examples;
210 private double[][] testing_examples; // [num_features][num_training_examples]
211 private int num_training_examples;
212 private int num_cross_validation_examples;
213 private int num_testing_examples;
214
215 }
Once the training data and the values of {$$}\mu{/$$} (the varible mu in the code) are defined for each feature we can define the method train in lines 86 through 104 that calculated the best epsilon “cutoff” value for the training data set using the method train_helper defined in lines 138 through 165. We use the “best” epsilon value by testing with the separate cross validation data set; we do this by calling the method test that is defined in lines 167 through 198.
Example Using the University of Wisconsin Cancer Data
The example in this section loads the University of Wisconsin data and uses the class AnomalyDetection developed in the last section to find anomalies, which for this example will be input vectors that represented malignancy in the original data.
1 package com.markwatson.anomaly_detection;
2
3 import java.io.*;
4 import org.apache.commons.io.FileUtils;
5
6 /**
7 * Train a deep belief network on the University of Wisconsin Cancer Data Set.
8 */
9 public class WisconsinAnomalyDetection {
10
11 private static boolean PRINT_HISTOGRAMS = true;
12
13 public static void main(String[] args) throws Exception {
14
15 String [] lines =
16 FileUtils.readFileToString(
17 new File("data/cleaned_wisconsin_cancer_data.csv")).split("\n");
18 double [][] training_data = new double[lines.length][];
19 int line_count = 0;
20 for (String line : lines) {
21 String [] sarr = line.split(",");
22 double [] xs = new double[10];
23 for (int i=0; i<10; i++) xs[i] = Double.parseDouble(sarr[i]);
24 for (int i=0; i<9; i++) xs[i] *= 0.1;
25 xs[9] = (xs[9] - 2) * 0.5; // make target output be [0,1] instead of [2,4]
26 training_data[line_count++] = xs;
27 }
28
29 if (PRINT_HISTOGRAMS) {
30 PrintHistogram.historam("Clump Thickness", training_data, 0, 0.0, 1.0, 10);
31 PrintHistogram.historam("Uniformity of Cell Size", training_data,
32 1, 0.0, 1.0, 10);
33 PrintHistogram.historam("Uniformity of Cell Shape", training_data,
34 2, 0.0, 1.0, 10);
35 PrintHistogram.historam("Marginal Adhesion", training_data,
36 3, 0.0, 1.0, 10);
37 PrintHistogram.historam("Single Epithelial Cell Size", training_data,
38 4, 0.0, 1.0, 10);
39 PrintHistogram.historam("Bare Nuclei", training_data, 5, 0.0, 1.0, 10);
40 PrintHistogram.historam("Bland Chromatin", training_data, 6, 0.0, 1.0, 10);
41 PrintHistogram.historam("Normal Nucleoli", training_data, 7, 0.0, 1.0, 10);
42 PrintHistogram.historam("Mitoses", training_data, 8, 0.0, 1.0, 10);
43 }
44
45 AnomalyDetection detector = new AnomalyDetection(10, line_count - 1,
46 training_data);
47
48 // the train method will print training results like
49 // precision, recall, and F1:
50 detector.train();
51
52 // get best model parameters:
53 double best_epsilon = detector.bestEpsilon();
54 double [] mean_values = detector.muValues();
55 double [] sigma_squared = detector.sigmaSquared();
56
57 // to use this model, us the method AnomalyDetection.isAnamoly(double []):
58
59 double [] test_malignant = new double[] {0.5,1,1,0.8,0.5,0.5,0.7,1,0.1};
60 double [] test_benign = new double[] {0.5,0.4,0.5,0.1,0.8,0.1,0.3,0.6,0.1};
61 boolean malignant_result = detector.isAnamoly(test_malignant);
62 boolean benign_result = detector.isAnamoly(test_benign);
63 System.out.println("\n\nUsing the trained model:");
64 System.out.println("malignant result = " + malignant_result +
65 ", benign result = " + benign_result);
66 }
67 }
Data used by an anomaly detecton model should have (roughly) a Gaussian (bell curve shape) distribution. What form does the cancer data have? Unfortunately, each of the data features seems to either have a greater density at the lower range of feature values or large density at the extremes of the data feature ranges. This will cause our model to not perform as well as we would like. Here are the inputs displayed as five-bin histograms:
1 Clump Thickness
2 0 177
3 1 174
4 2 154
5 3 63
6 4 80
7
8 Uniformity of Cell Size
9 0 393
10 1 86
11 2 54
12 3 43
13 4 72
14
15 Uniformity of Cell Shape
16 0 380
17 1 92
18 2 58
19 3 54
20 4 64
21
22 Marginal Adhesion
23 0 427
24 1 85
25 2 42
26 3 37
27 4 57
28
29 Single Epithelial Cell Size
30 0 394
31 1 117
32 2 76
33 3 28
34 4 33
35
36 Bare Nuclei
37 0 409
38 1 44
39 2 33
40 3 27
41 4 135
42
43 Bland Chromatin
44 0 298
45 1 182
46 2 41
47 3 97
48 4 30
49
50 Normal Nucleoli
51 0 442
52 1 56
53 2 39
54 3 37
55 4 74
56
57 Mitoses
58 0 567
59 1 42
60 2 8
61 3 17
62 4 14
I won’t do it in this example, but the feature “Bare Nuclei” should be removed because it is not even close to being a bell-shaped distribution. Another thing that you can do (recommended by Andrew Ng in his Coursera Machine Learning class) is to take the log of data and otherwise transform it to something that looks more like a Gaussian distribution. In the class WisconsinAnomalyDetection, you could for example, transform the data using something like:
1 // make the data look more like a Gaussian (bell curve shaped) distribution:
2 double min = 1.e6, max = -1.e6;
3 for (int i=0; i<9; i++) {
4 xs[i] = Math.log(xs[i] + 1.2);
5 if (xs[i] < min) min = xs[i];
6 if (xs[i] > max) max = xs[i];
7 }
8 for (int i=0; i<9; i++) xs[i] = (xs[i] - min) / (max - min);
The constant 1.2 in line 4 is a tuning parameter that I got by trial and error by iterating on adjusting the factor and looking at the data histograms.
In a real application you would drop features that you can not transform to something like a Gaussian distribution.
Here are the results of running the code as it is in the github repository for this book:
1 -- best epsilon = 0.28
2 -- number of test examples = 63
3 -- number of false positives = 0.0
4 -- number of true positives = 11.0
5 -- number of false negatives = 8.0
6 -- number of true negatives = 44.0
7 -- precision = 1.0
8 -- recall = 0.5789473684210527
9 -- F1 = 0.7333333333333334
10
11
12 Using the trained model:
13 malignant result = true, benign result = false
How do we evaluate these results? The precision value of 1.0 means that there were no false positives. False positives are predictions of a true result when it should have been false. The value 0.578 for recall means that of all the samples that should have been classified as positive, we only predicted about 57.8% of them. The F1 score is calculated as two times the product of precision and recall, divided by the sum of precision plus recall.