(Almost) Everything You Really Need To Know About Machine Learning

In this chapter we will look at modern best practices for machine learning. I will cover some theory with simple implementations that you can use for experimenting and for small or medium scale machine learning tasks. For larger scale problems we will cover Google’s TensorFlow system in later. This is a practical chapter for you to learn some theory and some practical techniques and also later use as a reference. I will use the notation that Andrew Ng uses in his online Machine Learning Class at Coursera. I highly recommend this class!

TBD: update above to more modern courses

Introduction to Supervised and Unsupervised Learning

When we train a model, what are we really doing? We start with training cases that represent data similar to what we might need to process in the future and we want to use existing data to build a model or prescription for processing similar data.

In supervised learning the representative training data includes both example inputs and the desired output for each example input. An important concept is building a generalized model, and not simply a memory system that remembers the training inputs and outputs. A model that simply remembers the training examples and does not generalize is said to suffer from over fitting. In later chapters we will look at TensorFlow examples that optionally can use two state of the art techniques for reducing over fitting: regularization and drop out but in this chapter we will prevent over fitting by limiting the number of parameters in a model and by training with a sufficient amount of data. In the case of simple neural networks, model parameters are the weights of connections between neurons. For more complicated deep learning neural networks, we also have hyper-parameters that specify the number of neuron layers, the degree connections (weights) between neurons in adjacent layers, etc.

Models have specific learned parameters. Most of the machine learning examples in this book involve neural networks and in this case the learned models are the weight values in the neural network covered in the first chapter that introduced cognitive computing. Other types of learned parameters are the hypothesis coefficients in linear regression models.

In all types of learned models, regularization refers to penalizing models with large valued learned parameters. When we train a model, we use a cost function to measure “how far off” a model is from the desired behavior, with smaller values of the cost function being better (in the sense of improving accuracy). If learned parameters are large, the cost function is increased by using a constant called the regularization constant and a measurement for how large (in absolute value) the learned parameters are.

The idea of using dropouts is specific to more complex deep learning neural networks. As you know from the introduction to cognitive computing in the first chapter, neural networks consist or simulated neurons and the connections between them. We randomly turn off a subset of neurons during training and also not modifying connection weights between “dropped” neurons and other neurons. Using dropouts allows us to use very complex networks (both in the number of layers and the number of neurons in each layer) without over fitting the training data.

Even though we will not cover unsupervised learning in this book, I wanted to mention situations where you may want to process training data that is only inputs with no desired outputs. This might seem counterintuitive to you so let’s look at an example of unsupervised learning: clustering unlabelled data into different sets. Data might be a set of all PDF documents in a company’s content management system. If the documents are unlabelled as to clasifiction we can still automatically cluster them into sets of similar documents. An interesting feature of unsupervised clustering, a “feature” that is sometimes a problem, is that the clusters do not have human recognizable names or identities. It might be that the PDF files in a content management system might be in rough categories HR, Sales, and Finance. Further, hopefully an unsupervised learning system would separate documents into three different unnamed categories. After automatic clustering occurs, a human reader can look at the auto-generated document sets and label then with tags “HR”, etc. When there is labelled training data available, it is more intuitive to use supervised learning. We will primarily use supervised learning in this book. If you are interested in unsupervised clustering, I have KMeans clustering examples you can work with in my book Power Java.

Feature Engineering

A large advantage of deep learning is automatically determining features that are important in data for any specific task. Neural network architecture is still important, but much of the manual effort of feature selection is handled fair automatically by deep learning networks. I will use an example of something I worked on as an example of the “old way” of manual feature selection and precalculating features: time delay neural networks (recurrent neural networks) to recognize phonemes in digitized audio of a human speaker. In the 1980s when I did this work, I preprocessed the data from sound wave data by sliding a window through the data, taking a Fast forier Transfom (FFT) of the data, squaring the data to form a power spectral density, and selecting the frequency bands that might be must useful for phoneme recognition. Even being guided from the results of a paper by Alex Waibel on the subject, determining which features to use and tuning parameters like window size, etc. was a lot of effort. Recent work has used a deep learning network with input of raw time series audio data. The network learns effectively how to do a fourier transform and other feature definition all on its own. If I had modern libraries like TensorFlow in the 1980s I would have saved a lot of development time and achieved better results.

I still believe that it is extremely important for machine learning practitioners to be skilled at feature identification, extraction, and general data preparation. Don’t rely solely on deep learning. In this section we review some examples and techniques.

Understanding your data is key to using machine learning techniques. Understanding the features in data is a key part of data cleaning activities. Regardless of whether you use techniques that we will later use like maximum entropy classification models, neural networks for recognizing images or performing natural language processing, etc., once you learn how to use available open source machine learning libraries, you will find that much of your time will be spend understanding your data, cleaning your data, organizing data for reuse and multiple uses, etc.

Feature Engineering for Text Data

There are several ways to look at text data: character by character, tokenized into a stream of words, a stream of stemmed words, segmented into individual sentences, keep extra punctuation like commas, semicolons, keep surrounding quotation marks or remove them, etc. Text collections are often converted to a matrix notation using the following scheme:

  • Maintain a dictionary of unique words where each word is mapped to an integer index from zero to the number of unique words minus one.
  • Input text is stored with documents represented as rows in the matrix while the column values are zero if the word index at a column index rows not appear in a document or as a value of 1 if the document does contain the word mapped to a column index.

Other representations of inputs texts include a “bag of words” or “bag of ngrams.” A “bag of words” (BOW) is the set of unique words in an input document. Ngrams are instances of adjacent words in text. A 1gram is a single word, a 2gram is two words that appear adjacent to each other, a 3gram is three adjacent words, etc.

Feature Engineering for Numeric Data

Here we will be concerned with data represented as a one-dimensional vector of numeric values. For data at each index in an input vector, it is common to:

  • Normalize data to a specific range ([0.0, 1.0] or [-1.0, 1.0] are common ranges)
  • Subtract the mean value at each index (e.g., data at a specific index that occurs in the range [1000, 2000] with a mean value of 1200 might have the mean subtracted from each input value with the resulting data normalized to the range [0.0, 1.0])
  • Take the log of data. This technique is often used when developing anomly detection systems for any parmeters that do not have a Gausian distribution.

It might seem that considering one-dimensional data is limiting, but even image data is usually treated as a one-dimensional vector. For example, a black and white image that is 64x64 pixels might be “linearized” to a one-dimension vector with 4096 elements and the value of each pixel might be mapped to the range [0.0, 1.0] where 0.0 represents black pixels, 1.0 represents white pixels, with grey-scale values following between 0.0 and 1.0.