Anomaly Detection Machine Learning Example

Anomaly detection models are used in one very specific class of use cases: when you have many negative (non-anomaly) examples and relatively few positive (anomaly) examples. We can refer to this as an unbalanced training set. For training we will ignore positive examples, create a model of “how things should be,” and hopefully be able to detect anomalies different from the original negative examples.

If you have a large training set of both negative and positive examples then do not use anomaly detection models. If your training examples are balanced then use a classification model as we will see later in the chapter Deep Learning Using Deeplearning4j.

Motivation for Anomaly Detection

There are two other examples in this book using the University of Wisconsin cancer data. These other examples are supervised learning. Anomaly detection as we do it in this chapter is, more or less, unsupervised learning.

When should we use anomaly detection? You should use supervised learning algorithms like neural networks and logistic classification when there are roughly equal number of available negative and positive examples in the training data. The University of Wisconsin cancer data set is fairly evenly split between negative and positive examples.

Anomaly detection should be used when you have many negative (“normal”) examples and relatively few positive (“anomaly”) examples. For the example in this chapter we will simulate scarcity of positive (“anomaly”) results by preparing the data using the Wisconsin cancer data as follows:

  • We will split the data into training (60%), cross validation (20%) and testing (20%).
  • For the training data, we will discard all but two positive (“anomaly”) examples. We do this to simulate the real world test case where some positive examples are likely to end up in the training data in spite of the fact that we would prefer the training data to only contain negative (“normal”) examples.
  • We will use the cross validation data to find a good value for the epsilon meta parameter.
  • After we find a good epsilon value, we will calculate the F1 measurement for the model.

Math Primer for Anomaly Detection

We are trying to model “normal” behavior and we do this by taking each feature and fitting a Gaussian (bell curve) distribution to each feature. The learned parameters for a Gaussian distribution are the mean of the data (where the bell shaped curve is centered) and the variance. You might be more familiar with the term standard deviation, {$$}\sigma{/$$}. Variance is defined as {$$} \sigma ^2{/$$}.

We will need to calculate the probability of a value x given the mean and variance of a probability distribution: {$$}P(x : \mu, \sigma ^2){/$$} where {$$}\mu{/$$} is the mean and {$$} \sigma ^2{/$$} is the squared variance:

P(x : \mu, \sigma 2) = \frac{1}{{\sigma \sqrt {2\pi } }}e{{{ - \left( {x - \mu } \right)2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}

where {$$}x_i{/$$} are the samples and we can calculate the squared variance as:

\sigma2 = \frac{\displaystyle\sum_{i=1}{m}(x_i - \mu)^2} {m}

We calculate the parameters of {$$}\mu{/$$} and {$$} \sigma ^2{/$$} for each feature. A bell shaped distribution in two dimensions is easy to visualize as is an inverted bowl shape in three dimentions. What if we have many features? Well, the math works and don’t worry about not being able to picture it in your mind.

AnomalyDetection Utility Class

The class AnomalyDetection developed in this section is fairly general purpose. It processes a set of training examples and for each feature calculates {$$}\mu{/$$} and {$$} \sigma ^2{/$$}. We are also training for a third parameter: an epsilon “cutoff” value: if for a given input vector if {$$}P(x : \mu, \sigma ^2){/$$} evaluates to a value greater than epsilon then the input vector is “normal”, less than epsilon implies that the input vector is an “anomaly.” The math for calulating these three features from training data is fairly easy but the code is not: we need to organize the training data and search for a value of epsilon that minimizes the error for a cross validaton data set.

To be clear: we separate the input examples into three separate sets of training, cross validation, and testing data. We use the training data to set the model parameters, use the cross validation data to learn an epsilon value, and finally use the testing data to get precision, recall, and F1 scores that indicate how well the model detects anomalies in data not used for training and cross validation.

I present the example program as one long listing, with more code explanation after the listing. Please note the long loop over each input training example starting at line 28 and ending on line 74. The code in lines 25 through 44 processes the input training data sample into three disjoint sets of training, cross validation, and testing data. Then the code in lines 45 through 63 copies these three sets of data to Java arrays.

The code in lines 65 through 73 calculates, for a training example, the value of {$$}\mu{/$$} (the varible mu in the code).

Please note in the code example that I prepend class variables used in methods with “this.” even when it is not required. I do this for legibility and is a personal style.

The following UML class diagram will give you an overview before we dive into the code:

UML class diagram for the anomaly detection example
  1 package com.markwatson.anomaly_detection;
  2 
  3 import java.util.ArrayList;
  4 import java.util.List;
  5 
  6 /**
  7  * Created by markw on 10/7/15.
  8  */
  9 public class AnomalyDetection {
 10 
 11   public AnomalyDetection() { }
 12 
 13   /**
 14    * AnomalyDetection is a general purpose class for building anomaly detection
 15    * models. You should use this type of mdel when you have mostly negative
 16    * training examples with relatively few positive examples and you need a model
 17    * that detects postive (anomaly) inputs.
 18    *
 19    * @param num_features
 20    * @param num_training_examples
 21    * @param training_examples [num_training_examples][num_features]
 22    */
 23   public AnomalyDetection(int num_features, int num_training_examples,
 24                           double [][] training_examples) {
 25     List<double[]> training = new ArrayList<>();
 26     List<double []> cross_validation = new ArrayList<>();
 27     List<double []> testing = new ArrayList<>();
 28     int outcome_index = num_features - 1; // index of target outcome
 29     for (int i=0; i<num_training_examples; i++) {
 30       if (Math.random() < 0.6) { // 60% training
 31         // Only keep normal (negative) examples in training, except
 32         // remember the reason for using this algorithm is that it
 33         // works with many more negative examples than positive
 34         // examples, and that the algorithm works even with some positive
 35         // examples mixed into the training set. The random test is to
 36         // allow about 10% positive examples to get into the training set:
 37         if (training_examples[i][outcome_index] < 0.5 || Math.random() < 0.1)
 38           training.add(training_examples[i]);
 39       } else if (Math.random() < 0.7) {
 40           cross_validation.add(training_examples[i]);
 41       } else {
 42         testing.add(training_examples[i]);
 43       }
 44     }
 45     this.num_training_examples = training.size();
 46     this.num_cross_validation_examples = cross_validation.size();
 47     this.num_testing_examples = testing.size();
 48 
 49     this.training_examples = new double[this.num_training_examples][];
 50     for (int k=0; k<this.num_training_examples; k++) {
 51       this.training_examples[k] = training.get(k);
 52     }
 53 
 54     this.cross_validation_examples = new double[num_cross_validation_examples][];
 55     for (int k=0; k<num_cross_validation_examples; k++) {
 56       this.cross_validation_examples[k] = cross_validation.get(k);
 57     }
 58 
 59     this.testing_examples = new double[num_testing_examples][];
 60     for (int k=0; k<num_testing_examples; k++) {
 61       // dimensions of [num_training_examples][num_features]:
 62       this.testing_examples[k] = testing.get(k);
 63     }
 64 
 65     this.mu = new double[num_features];
 66     this.sigma_squared = new double[num_features];
 67     this.num_features = num_features;
 68     for (int nf = 0; nf < this.num_features; nf++) { //
 69       double sum = 0;
 70       for (int nt = 0; nt < this.num_training_examples; nt++)
 71         sum += training_examples[nt][nf];
 72       this.mu[nf] = sum / this.num_training_examples;
 73     }
 74   }
 75 
 76   public double [] muValues()     { return mu; }
 77   public double [] sigmaSquared() { return sigma_squared; }
 78   public double bestEpsilon()     { return best_epsilon; }
 79 
 80   /**
 81    *
 82    * Train the model using a range of epsilon values. Epsilon is a
 83    * hyper parameter that we want to find a value that
 84    * minimizes the error count.
 85    */
 86   public void train() {
 87     double best_error_count = 1e10;
 88     for (int epsilon_loop=0; epsilon_loop<40; epsilon_loop++) {
 89       double epsilon = 0.05 + 0.01 * epsilon_loop;
 90       double error_count = train(epsilon);
 91       if (error_count < best_error_count) {
 92         best_error_count = error_count;
 93         best_epsilon = epsilon;
 94       }
 95     }
 96     System.out.println("\n**** Best epsilon value = " + best_epsilon );
 97 
 98     // retrain for the best epsilon setting the maximum likelyhood parameters
 99     // which are now in epsilon, mu[] and sigma_squared[]:
100     train_helper(best_epsilon);
101 
102     // finally, we are ready to see how the model performs with test data:
103     test(best_epsilon);
104   }
105 
106   /**
107    * calculate probability p(x; mu, sigma_squared)
108    *
109    * @param x - example vector
110    * @return
111    */
112   private double p(double [] x) {
113     double sum = 0;
114     // use (num_features - 1) to skip target output:
115     for (int nf=0; nf<num_features - 1; nf++) {
116       sum += (1.0 / (SQRT_2_PI * sigma_squared[nf])) *
117              Math.exp(- (x[nf] - mu[nf]) * (x[nf] - mu[nf]));
118     }
119     return sum / num_features;
120   }
121 
122   /**
123    * returns 1 if input vector is an anoomaly
124    *
125    * @param x - example vector
126    * @return
127    */
128   public boolean isAnamoly(double [] x) {
129     double sum = 0;
130     // use (num_features - 1) to skip target output:
131     for (int nf=0; nf<num_features - 1; nf++) {
132       sum += (1.0 / (SQRT_2_PI * sigma_squared[nf])) *
133              Math.exp(- (x[nf] - mu[nf]) * (x[nf] - mu[nf]));
134     }
135     return (sum / num_features) < best_epsilon;
136   }
137 
138   private double train_helper_(double epsilon) {
139     // use (num_features - 1) to skip target output:
140     for (int nf = 0; nf < this.num_features - 1; nf++) {
141       double sum = 0;
142       for (int nt=0; nt<this.num_training_examples; nt++) {
143         sum += (this.training_examples[nt][nf] - mu[nf]) *
144                (this.training_examples[nt][nf] - mu[nf]);
145       }
146       sigma_squared[nf] = (1.0 / num_features) * sum;
147     }
148     double cross_validation_error_count = 0;
149     for (int nt=0; nt<this.num_cross_validation_examples; nt++) {
150       double[] x = this.cross_validation_examples[nt];
151       double calculated_value = p(x);
152       if (x[9] > 0.5) { // target training output is ANOMALY
153         // if the calculated value is greater than epsilon
154         // then this x vector is not an anomaly (e.g., it
155         // is a normal sample):
156         if (calculated_value > epsilon) cross_validation_error_count += 1;
157       } else { // target training output is NORMAL
158         if (calculated_value < epsilon) cross_validation_error_count += 1;
159       }
160     }
161     System.out.println("   cross_validation_error_count = " +
162                        cross_validation_error_count +
163                        " for epsilon = " + epsilon);
164     return cross_validation_error_count;
165   }
166 
167   private double test(double epsilon) {
168     double num_false_positives = 0, num_false_negatives = 0;
169     double num_true_positives = 0, num_true_negatives = 0;
170     for (int nt=0; nt<this.num_testing_examples; nt++) {
171       double[] x = this.testing_examples[nt];
172       double calculated_value = p(x);
173       if (x[9] > 0.5) { // target training output is ANOMALY
174         if (calculated_value > epsilon) num_false_negatives++;
175         else                            num_true_positives++;
176       } else { // target training output is NORMAL
177         if (calculated_value < epsilon) num_false_positives++;
178         else                            num_true_negatives++;
179       }
180     }
181     double precision = num_true_positives /
182                        (num_true_positives + num_false_positives);
183     double recall = num_true_positives /
184                     (num_true_positives + num_false_negatives);
185     double F1 = 2 * precision * recall / (precision + recall);
186 
187     System.out.println("\n\n -- best epsilon = " + this.best_epsilon);
188     System.out.println(" -- number of test examples = " +
189                        this.num_testing_examples);
190     System.out.println(" -- number of false positives = " + num_false_positives);
191     System.out.println(" -- number of true positives = " + num_true_positives);
192     System.out.println(" -- number of false negatives = " + num_false_negatives);
193     System.out.println(" -- number of true negatives = " + num_true_negatives);
194     System.out.println(" -- precision = " + precision);
195     System.out.println(" -- recall = " + recall);
196     System.out.println(" -- F1 = " + F1);
197     return F1;
198   }
199 
200   double best_epsilon = 0.02;
201   private double [] mu;  // [num_features]
202   private double [] sigma_squared; // [num_features]
203   private int num_features;
204   private static double SQRT_2_PI = 2.50662827463;
205 
206 
207   private double[][] training_examples; // [num_features][num_training_examples]
208   // [num_features][num_training_examples]:
209   private double[][] cross_validation_examples; 
210   private double[][] testing_examples; // [num_features][num_training_examples]
211   private int num_training_examples;
212   private int num_cross_validation_examples;
213   private int num_testing_examples;
214 
215 }

Once the training data and the values of {$$}\mu{/$$} (the varible mu in the code) are defined for each feature we can define the method train in lines 86 through 104 that calculated the best epsilon “cutoff” value for the training data set using the method train_helper defined in lines 138 through 165. We use the “best” epsilon value by testing with the separate cross validation data set; we do this by calling the method test that is defined in lines 167 through 198.

Example Using the University of Wisconsin Cancer Data

The example in this section loads the University of Wisconsin data and uses the class AnomalyDetection developed in the last section to find anomalies, which for this example will be input vectors that represented malignancy in the original data.

 1 package com.markwatson.anomaly_detection;
 2 
 3 import java.io.*;
 4 import org.apache.commons.io.FileUtils;
 5 
 6 /**
 7  * Train a deep belief network on the University of Wisconsin Cancer Data Set.
 8  */
 9 public class WisconsinAnomalyDetection {
10 
11   private static boolean PRINT_HISTOGRAMS = true;
12 
13   public static void main(String[] args) throws Exception {
14 
15     String [] lines =
16       FileUtils.readFileToString(
17          new File("data/cleaned_wisconsin_cancer_data.csv")).split("\n");
18     double [][] training_data = new double[lines.length][];
19     int line_count = 0;
20     for (String line : lines) {
21       String [] sarr = line.split(",");
22       double [] xs = new double[10];
23       for (int i=0; i<10; i++) xs[i] = Double.parseDouble(sarr[i]);
24       for (int i=0; i<9; i++) xs[i] *= 0.1;
25       xs[9] = (xs[9] - 2) * 0.5; // make target output be [0,1] instead of [2,4]
26       training_data[line_count++] = xs;
27     }
28 
29     if (PRINT_HISTOGRAMS) {
30       PrintHistogram.historam("Clump Thickness", training_data, 0, 0.0, 1.0, 10);
31       PrintHistogram.historam("Uniformity of Cell Size", training_data,
32                               1, 0.0, 1.0, 10);
33       PrintHistogram.historam("Uniformity of Cell Shape", training_data,
34                               2, 0.0, 1.0, 10);
35       PrintHistogram.historam("Marginal Adhesion", training_data,
36                               3, 0.0, 1.0, 10);
37       PrintHistogram.historam("Single Epithelial Cell Size", training_data,
38                               4, 0.0, 1.0, 10);
39       PrintHistogram.historam("Bare Nuclei", training_data, 5, 0.0, 1.0, 10);
40       PrintHistogram.historam("Bland Chromatin", training_data, 6, 0.0, 1.0, 10);
41       PrintHistogram.historam("Normal Nucleoli", training_data, 7, 0.0, 1.0, 10);
42       PrintHistogram.historam("Mitoses", training_data, 8, 0.0, 1.0, 10);
43     }
44     
45     AnomalyDetection detector = new AnomalyDetection(10, line_count - 1,
46                                                      training_data);
47 
48     // the train method will print training results like
49     // precision, recall, and F1:
50     detector.train();
51 
52     // get best model parameters:
53     double best_epsilon = detector.bestEpsilon();
54     double [] mean_values = detector.muValues();
55     double [] sigma_squared = detector.sigmaSquared();
56 
57     // to use this model, us the method AnomalyDetection.isAnamoly(double []):
58 
59     double [] test_malignant = new double[] {0.5,1,1,0.8,0.5,0.5,0.7,1,0.1};
60     double [] test_benign = new double[] {0.5,0.4,0.5,0.1,0.8,0.1,0.3,0.6,0.1};
61     boolean malignant_result = detector.isAnamoly(test_malignant);
62     boolean benign_result = detector.isAnamoly(test_benign);
63     System.out.println("\n\nUsing the trained model:");
64     System.out.println("malignant result = " + malignant_result +
65                        ", benign result = " + benign_result);
66   }
67 }

Data used by an anomaly detecton model should have (roughly) a Gaussian (bell curve shape) distribution. What form does the cancer data have? Unfortunately, each of the data features seems to either have a greater density at the lower range of feature values or large density at the extremes of the data feature ranges. This will cause our model to not perform as well as we would like. Here are the inputs displayed as five-bin histograms:

 1 Clump Thickness
 2 0   177
 3 1   174
 4 2   154
 5 3   63
 6 4   80
 7 
 8 Uniformity of Cell Size
 9 0   393
10 1   86
11 2   54
12 3   43
13 4   72
14 
15 Uniformity of Cell Shape
16 0   380
17 1   92
18 2   58
19 3   54
20 4   64
21 
22 Marginal Adhesion
23 0   427
24 1   85
25 2   42
26 3   37
27 4   57
28 
29 Single Epithelial Cell Size
30 0   394
31 1   117
32 2   76
33 3   28
34 4   33
35 
36 Bare Nuclei
37 0   409
38 1   44
39 2   33
40 3   27
41 4   135
42 
43 Bland Chromatin
44 0   298
45 1   182
46 2   41
47 3   97
48 4   30
49 
50 Normal Nucleoli
51 0   442
52 1   56
53 2   39
54 3   37
55 4   74
56 
57 Mitoses
58 0   567
59 1   42
60 2   8
61 3   17
62 4   14

I won’t do it in this example, but the feature “Bare Nuclei” should be removed because it is not even close to being a bell-shaped distribution. Another thing that you can do (recommended by Andrew Ng in his Coursera Machine Learning class) is to take the log of data and otherwise transform it to something that looks more like a Gaussian distribution. In the class WisconsinAnomalyDetection, you could for example, transform the data using something like:

1   // make the data look more like a Gaussian (bell curve shaped) distribution:
2   double min = 1.e6, max = -1.e6;
3   for (int i=0; i<9; i++) {
4     xs[i] = Math.log(xs[i] + 1.2);
5     if (xs[i] < min) min = xs[i];
6     if (xs[i] > max) max = xs[i];
7   }
8   for (int i=0; i<9; i++) xs[i] = (xs[i] - min) / (max - min);

The constant 1.2 in line 4 is a tuning parameter that I got by trial and error by iterating on adjusting the factor and looking at the data histograms.

In a real application you would drop features that you can not transform to something like a Gaussian distribution.

Here are the results of running the code as it is in the github repository for this book:

 1  -- best epsilon = 0.28
 2  -- number of test examples = 63
 3  -- number of false positives = 0.0
 4  -- number of true positives = 11.0
 5  -- number of false negatives = 8.0
 6  -- number of true negatives = 44.0
 7  -- precision = 1.0
 8  -- recall = 0.5789473684210527
 9  -- F1 = 0.7333333333333334
10 
11 
12 Using the trained model:
13 malignant result = true, benign result = false

How do we evaluate these results? The precision value of 1.0 means that there were no false positives. False positives are predictions of a true result when it should have been false. The value 0.578 for recall means that of all the samples that should have been classified as positive, we only predicted about 57.8% of them. The F1 score is calculated as two times the product of precision and recall, divided by the sum of precision plus recall.