Table of Contents
- Part I - A Dive into Human Cognition and Cognitive Science
- An Introduction to Cognitive Computing
- An Overview of Linguistics
- Part II - Practical Applications of Cognitive Computing
- (Almost) Everything You Really Need To Know About Machine Learning
- Using TensorFlow for Implementing Deep Neural Networks
Tools and Techniques for Natural Language Processing
- Installation and Introduction to the spaCy NLP Library
- Using spaCY for Assigning Part of Speech Tags
- Using spaCY for Entity Recognition
- Introduction to the OpenNLP Library and Installation Notes
- Classification Example Using the OpenNLP Library
- Training a New Categorization Model using Facebook’s fasttext
- NLP Wrap Up
- Book Wrap Up
- Appendix A - Installing TensorFlow
- Appendix B - Installing OpenNLP
- Appendix C - Installing spaCy
- Appendix D - Using Cloud Computing Platforms
Cognitive Computing is an experimental process and this book will help you set up an effective laboratory to experiment with deep learning neural networks, general machine learning, and natural language processing. Cognitive computing is closely associated and has significant overlap with the study of artificial intelligence, knowledge representation, linguistics, psychology, and neuroscience. We will also take an excursion into the science of cognition and I hope you find this book both interesting and deeply practical.
The purpose of this short book is to serve as both an introduction to cognitive science and cover a few practical modern techniques. For the examples, there is an emphasis on machine learning and on natural language processing.
Author Biography and Experience
I am a software developer specializing generally in artificial intelligence, and specifically in natural language processing (NLP), text analytics, neural networks, knowledge representation, and expert systems. I started my nearly life long jouney exploying artificial intelligence technologies by learning the Lisp family of programming languages and teaching lunch time classes at my company SAIC. In 1982 we acquired a hardware Lisp machine and I started attending AI conferences and working on two commercial products (expert system tools for the Macintosh and Xerox Lisp Machines, and a neural network library).
My purpose in writing this book is to provide you, dear reader, with an introduction and understanding of cognition (human and artificial) and to teach you some techniques that I believe will likely be useful in your own exploration of cognition and artificial intelligence. Writing this book is an opportunity for me to renew my own interests in cognitive computing and take you along for the ride. I hope that you enjoy the ride!
Programming Examples in this Book
In addition to studying the theory, this book is also grounded in practicality. We will look at different strategies for setting up a “laboratory” for the software experiments in this book. Depending on what I am working on, I use a laptop, a leased large memory server (or VPS), or use cloud computing services which particularly make sense when working with deep learning neural networks that perform much better on servers with one or more GPU processors.
To make the material more approachable and to get you started with your own experiments, this book provides the author’s current projects involving natural language processing (NLP), machine learning, deep neural networks, and in general approaches big data problems from the perspective of automating data extraction, processing, and curation with automated systems that complement and also potentially replace human labor.
For these examples, I use a variety of programming languages:
- Python for the TensorFlow deep learning examples and for the spaCY Natural Language Processing (NLP) examples
- Java for Maximum entropy classification models
- TypeScript for two neural network examples
You can reuse the examples in this book under either the Apache 2 or GPL v3 license - your choice. I prefer, and use, a dual licensing approach, basically letting people use whichever license works best for them. If you improve the code examples in any way, please send me github pull requests so that other readers and I can benefit from your contributions.
Why Use the Cloud?
Most of the data and software experiments we will cover in this book can be done on your laptop assuming that you have sufficient memory and disk space. However it is my preference to run even small experiments in the cloud because I can organize and reuse data, have open source software installed that I use, and have my own development projects in one place. I currently own two Mac laptops, two Linux laptops, and a Chromebook. By keeping my research and development assets in the cloud I have easy access to my work from any local computing environment I am using.
I use both a laptop and a large 60 GB memory and 16-core VPS on the Google Cloud Platform that I can turn off an on in a few minutes, as I need it. I find it less expensive to pay about $0.60/hour to rent a server than to buy a maxed-out computer for my office.
However for the purposes of this book you can use a laptop (the more memory the better) or a VPS. Nothing we will do in the examples will require special hardware but the deep learning examples will run faster with a GPU if you have one available. I chose small data sets for the deep learning examples and the example code does not require a GPU.
A Request from the Author
I spent time writing this book to help you, dear reader. I release this book under the Creative Commons “share and share alike, no modifications, no commercial reuse” license and set the minimum purchase price to $4.00 in order to reach the most readers. Under this license you can share a PDF version of this book with your friends and coworkers. If you found this book on the web (or it was given to you) and if it provides value to you then please consider doing one of the following to support my future writing efforts and also to support future updates to this book:
- Purchase a copy of this book at https://leanpub.com/cognitive-computing
- Buy one of my other eBooks available at https://leanpub.com/mark-watson
- Hire me as a consultant
I enjoy writing and your support helps me write new editions and updates for my books and to develop new book projects. Thank you!
I would like to thank the following people for their help:
Carol Watson: structural and copy editing. Her copy editing services can be located at ModernEditing.com if you would like help on your own writing projects.
Part I - A Dive into Human Cognition and Cognitive Science
This section of the book will introduce you to the basics of the science that forms the foundation of Cognition Technology and Artificial Intelligence (AI). We will cover Philosophy, Neuroscience, Knowledge Management and Knowledge Representation, Linguistics, and later a broad range of practical technologys that I hope you find useful in your own projects.
An Introduction to Cognitive Computing
We will start with a broad overview of Cognitive Computing before proceeding to more detailed information on human brain science and a more formal treatment of cognitive science. I have been using artificial neural networks to solve practical engineering problems since the 1980s and I find inspiring the recent spectacular breakthroughs in the capabilities of deep learning neural networks for close to human level performance in image and speech recognition, as well as human level expertise at the difficult game of Go. We will look at an overview of distributed neural networks and symbolic approaches to cognitive computing in this chapter.
The word cognitive refers to thinking. When we talk about cognitive computing we intend to understand how we can model the human brain in sufficient detail to write programs that simulate processing input into an internal representation and generate output. In later chapters we will start with training simple feed forward neural network classification models using both short standalone examples written in TypeScript and the TensorFlow library. This is a relatively easy problem to train a model for since there is no memory of events (episodic memory) which will require us to use more complicated neural network architectures. Later examples will use deep learning neural network architectures to model complex relationships between different system inputs. When we speak of learning a model we mean the process of learning a set of parameters that map input patterns to desired outputs. In the case of neural networks, the parameters of the numeric values of the connection weights in the neural network. Modern neural network architectures mimic the human brain closely enough to make studying human cognition a practical addition to the math and software implementation techniques for simulated neural networks.
There are two extremes to modeling the human brain: abstracting away the biological details of the brain, like how neurons fire, with the goal of solving practical problems, or, create detailed (and computationally expensive) low level biological models. We will use the first approach. That said, we want to understand how our brains work, largely because it is intellectually stimulating to understand how we can quickly learn new things, process episodic data and keep it in isolation in the hippocampus, chain together deductions, etc.
The following diagram shows Cognitive Pscychologist George Miller’s foundation for the field of cognitive science:
Here, we are most interested in:
- Philosophy: to study of Ontologies and Knowledge Representation
- Artificial Intelligence (AI): to use machine learning to achieve human level (or better) performance on tasks previously considered to be the dominion of human experts
- Linguistics: understanding language will help us to better extract useful information from English language text
- Neuroscience: to better understand at least how our minds work
In later chapters we will use different open source projects to tackle the practical AI and computational linguistics aspects of cognitive science.
An Overview of How the Human Brain Works
In the following subsections we will look at neural network models of memory: artificial and biological neural networks.
The following figure shows a representation of human neurons:
We will use this figure to discuss a model for biologic neurons collecting activation, spiking, and firing. When a neuron fires it sends a signal outwards through a synapse that in turn connects to the inputs (dendrites) of other neurons. Dendrites branch out from a neuron, like branches on a tree to collect the output from other neurons when they fire.
Inputs to neurons usually carry a positive (excitatory) signal but they may also carry a negative (inhibitory) signal. Later when we look at artificial neural networks, these types of connections will be called positive and negative weights. Neurons also get small input from leaking from nearby tissue. Later when we use artificial neural networks, this leakage signal will be called a “bias” input.
This process of activiation energy flowing through a real (human or animal) neural network is largely “feed forward” in one direction and performs pattern matching, memory storage, and reactive control. The human brain can also retain episodic memories: sequences of events. The hippocampus is largely responsible for these time sequenced memories. Later when we look at artificial neural networks, we will sometimes use feedback loops that can remember sequences in addition to patterns that a feed forward network can remember. Artificial neural networks with feedback loops are called recurrent neural networks that mimic processes in our brains. Current state of the art practice uses Long Short Term Memory (LSTM) which is a type of recurrent neural network that is capable of recognizing time sequenced patterns that are separted in time. The Tensorflow library has support for experimenting with LSTM. An anology that I like is recognition of a recurring theme when listening to a symphony. The hippocampus provides neural loops that learn specific patterns and can recognize these patterns in the future.
The chemical processes in a human brain are fairly well understood and certainly much more complicated than the relatively simple models for artificial neurons and the connections between them, but I find it reassuring that these simple artificial neural network models that are so effective for solving engineering problems are also similar in spirit to how our brains work.
Have you wondered how your brain can do such a good job at pattern recognition? We can recognize people in different clothing or hair styles after years of not seeing them and so on. This flexibility is possible because neurons and the connections between them in the neocortex form hierarchical memory structures. At the top of these hierarchical clusters, closest to input signals, collections of neurons with their interconnections learn to recognize fine grained features, like shape of elements of face (nose, eyes, mouth, etc.), hair color, beards and mustaches, and eye color. As we proceed down this hierarchy, neural clusters recognize increasingly abstract patterns; for example, is the person in front of us a man or woman. The main point here is that the memory structures in our brain that provide this flexibility are both distributed and hierarchical. Another key feature of our brains is that neuron activation tends to be sparse; that is, relatively few neurons are firing in response to vision, hearing, and touch at any instant in time and those that are, are firing in response to either these direct external inputs or by the signals propagated by connected neurons that are firing from early external stimulus. Later we will use deep learning neural networks that learn how to self-organize similar hierarchical clusters that recognize specific features in data. Deep learning neural networks often use sparse connections between simulated neurons.
One defining feature of human brains is what scientist David Chalmers calls “the hard problem” which is explaining why we have inner thoughts (qualia) that can be daydreams or other thoughts that are independent of external stimulous from our environment. As I write this in February 2017 I know of no AI systems that have such qualia.
Introduction to Artificial Neural Networks
The following figure shows the simplest type of artificial neural networks which from now on I will simply call “neural networks.” Two input signals X1 and X2 are used to set the activation energy of the two neurons at the bottom of the figure. The connection between the input neuron on the left and the single output neuron is characterized by a weight value W1 and the other input neuron is connected by a weight with a strength W2. To calculate the activation of the output neuron, we multiply the first input neuron’s activation by W1 and add this value from multiplying the second input neuron’s activation by the value W2. In order to keep activation values bounded, the output of a neuron (like the output neuron in this figure) is often gated by a “squashing” function.
The following figure shows the shape of a typical Sigmoid “squashing function.” Details vary, here the range is [0,1] but some people use a range of [-0.5, 0.5]:
We will use the Sigmoid function for simple feed forward neural networks. Later we will also use the hyperbolic tangent function as a squashing function but the general shape and effect is similar. We will also be using the derivative of the Sigmoid function for learning the connection weights. These functions are defined in the source file neural_networks/neural_net.ts in the github repository for this book. Here are the definitions of these two functions:
We will look at the example code in neural_networks/neural_net.ts in detail in a later chapter when we see how technique called back-propagation lets us train the weights in a network from training examples. For simple networks with no hidden layers (like the first example in this section), the network forms a linear decision surface. In this example, the neural network inputs X1 and X2 form a two-dimensional area; the network with no hidden neurons has a straight line decision surface that divides this two-dimensional surface into two sets of points where the output value of the squashing function is less than 0.5 and points for which the value is greater or equal to 0.5. What does this mean? Simply, it means that assuming we consider the network to output an on/off (or true/false) value from the output neuron, that the weights W1 and W2 define a straight line on this two-dimensional plane with all points on one side of the line corresponding to an “on” output and points on the other side of the line correspond to an “off” state.
Imagine a simple neural network (no hidden neurons) with three inputs X1, X2, and X3. These three values define the location of a point in a three-dimensional space, and the “decision surface” is now a two-dimensional plane through this three-dimensional space. The fun begins when we have many inputs. For example, if we have 20 inputs, that defines a 20-dimensional space and if there are no hidden layer neurons then the “decision surface” is a 19-dimensional hyperplane. Don’t worry if you have a difficult time visualizing this - everyone does!
If we add a layer of hidden neurons, with two input neurons forming a two-dimensional space, then the decision surface is still a line but the line can be curved.
The following figure shows 3 input neurons, 4 “hidden” neurons, and 1 output neuron:
In this figure we have three input neurons on the bottom, each input neuron is connected to each of four hidden layer neurons, and each of the four hidden layer neurons is connected to the single output neuron.
This neural model is defined by 16 parameters: the values that are learned for each weight in the neural network. Later when we use TensorFlow to work with deep learning we will use neural networks with many layers and thousands or tens of thousands of parameters. For work I train networks with millions of parameters while companies like Google and Microsoft use networks with billions of parameters for tasks like speech recognition and generating natural language descriptions of input images.
The following shows a recurrent neural network, similar to neural structures in the hippocampus in our brains:
The feedback loops, like the the sructures of the hippocampus in our brains, allow a network to remember patterns that occur closely together in time. Recurrent neural networks can be implemented using Tesnsorflow, although no examples are provided in this book of recurrent networks.
Symbolic Representation of Facts and Rules (Expert Systems)
Early artificial intelligence (AI) systems used symbolic representations of the world. Symbolic representations refer to using a symbolic label like “car” to represent cars in the real world. We attempt to represent knowledge with these placeholder symbols and instead of reasoning about real world objects and events we reason with the abstract symbol representations of the real world.
As a simple and concrete example, a “person” represents the class of human beings and we might associate attributes like age, weight, and name with instances of this class of entities. In a rule based system, we might define a class “driver” which is a subclass of “person” and a rule that “an instance of class ‘driver’ must have a value greater or equal to 16 for the attribute ‘age’. If we have an instance of “person” named Sam and Sam is 14 years old, then this rule could be used to reason that Sam can not be a driver.
Expert systems are built with manually written rules, a storage system for “facts” in the system, and an interpreter for matching rules with facts in a system, running (or “firing”) rules which can add, delete, or modify facts in the system.
Historically one of the most important production system interpreters was OPS5 that I used frequently in my work in the 1980s. Another production system is Soar that is a cognitive architecture that includes a rule based system.
OPS5 and Soar are not much used now but they have great historic importance in the field of AI. I spent several years in the 1980s working on symbolic expert systems, symbolic knowledge representation schemes like Conceptual dependency theory, and symbolic Natural Language Processing (NLP).
Symbolic Models for Natural Language Processing
Much of the interesting research and practical applications of Natural Language Processing (NLP) are now done using statistical models and deep learning models. We will cover both of these technologies with code examples in some depth in later chapters. Here, for historic reasons, I would like to briefly cover symbolic approaches to NLP.
In my work in the 1980s I used Augmented transition networks (ATNs). ATNs were used to parse text using a network where individual nodes can have attached memory to remember things like the subject of a sentence, attributes of nouns used in a sentence, etc. ATNs have the advantage of being fairly simple to write, and a disadvantage that they need to crafted for limited vocabularies. A lot of work can be involved trying to extend them to vocabularies for more than simple topics or applications.
In this diagram, the node “preposition” will recognize a word that is a preposition, remove the word from the word sequence, and attempt to pass the remaining part of the word sequence to each node connected with an outgoing arrow. When any node sucessfully recognizes the last node in the word sequence, then the trail of nodes represents the parse of the sentence. Consider the following example:
Here the ATN assigns the part of speech “preposition” to the word “the,” part of speech “noun” to “dog” and the part of speech “verb” to the word “ran.” ATN based systems get very complex when they need to support large vocabularies and they are difficult to maintain.
The Public’s Perception of Cognitive Computing and Artificial Intelligence
I started working in the field of AI in the early 1980s and there were relatively few of us back then. Now various subfields of AI and cogntive computing are some of the fastest growing segments of the tech sector. For the general public there are at least three topics that have attracted attention:
- IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997
- Google/DeepMind’s AlphaGo defeating Lee Sedol who is one of the strongest Go players in the world in 2016
- AI and automation “taking people’s jobs”
The game of Go is of particular interest to me personally because I wrote the world’s first commercially available Go playing program “Honinbo Warrior” for the Apple II computer. I wrote this program in UCSD Pascal. Around the same time I also had the opportunity to play the national Go champian of South Korea and the women’s world Go champion. I am a big fan of the game and I argue that it is much more complex than chess. AlphaGo’s victory over Lee Sedol in the spring of 2016 was a watershed moment in AI and I watched the games played live on the Internet. I felt that I was living through an important moment in history. Exciting times!
The issue of AI and cognitive computing systems replacing most human workers is almost a certainty and the sociological challenge of dealing with unemployment is probably greater than the technical challenges for accomplishing this. It is currently an open debate whether future AIs will pose a danger to the human race but in spite of rapid advances in our field, I don’t worry about this problem - at least not yet.
An Overview of Linguistics
Linguistics is the science of human languages. Although our brains are very general purpose “devices” there is ample evidence that the brain is partially pre-wired to learn languages. Different human languages differ substantially in structure and for simplicity we will consider English.
We learn to speak at an early age by interacting with older people. During this process a model of language is formed containing understanding of individual words, words that occur frequently together, and eventually sentence sructure, parts of speech, and grammar. Language is so ingrained into how we think that we don’t realize how much we know about language and how it maps to the real physical world and events. At the end of the last chapter we discussed symbols representing things in the real world, but the difference with the human brain is that there is a distributed representation of things stored in neurons and the connections between neurons. When we study deep learning, we will see how these representations are stored in artificial neural networks.
Linguistics deals with the study of sounds, the syntactic structure of language, the underlying meanings of words (semantics), and the association of language with background world knowledge that we all have (pragmatics).
Computational linguistics is covered in some detail in the next chapter. Here we cover the main ideas of linguistics and define terms that will be useful later. We will use the context of looking at the requirements for a robot that can carry on conversations to take a high level view of the science of linguistics.
A robot needs to:
- Listen to all sounds in a room.
- Separate out sounds that are probably human speech in the environment.
- Separate out just the speech of the person the robot is talking with.
- Perform some form of Fourier analysis to convert frequency information in speech to a power spectral density (think of the volume display on your hifi that shows the volume of bass, mid-range, and higher frequency).
- Recognize sub-word sounds (phonemes) and isolate and recognize individual words.
- Use an understanding of the structure of sentences and statistics of which words tend occur together to correct errors in word identification.
- Understand the context of the human being conversed with and attempt to match incoming words with commands, requests for information, etc.
- Build a conversation model from input words, real world understanding, and multiple hypothesis of what the human might want.
- Generate speech for the human listener and update the model for what the conversation is about.
When we listen to speech, our ears process sound waves that are characterized by power (sound level) and frequency. The ear drum, curved passages in the inner ear, choclea, and choclea nerves capture this information for specific areas of the brain that process sound.
Sounds are thus transformed to a distributed representation of phonemes (standard building blocks for constructing words) and sequences of phonemes are transformed into a representation of words, then sentences that we hear.
Morphology is the construction of words from sub-parts called morphemes. Consider the word “unhappy” that consists of morphemes “un” and “happy.” The meaning of this word depends on understanding what each constituant morpheme contributes to the meaning of the word.
Syntax is different for different languages like Engish, German, Spanish, Chinese, etc. We looked briefly in the last chapter at an ATN that recognized the syntax for a very small subset of English, namely the ability to recognize patterns prepositions -> noun -> verb and prepositions -> adjective -> noun -> verb. When we hear a sentence that does not make sense it might be because we can’t “parse” the syntax or the meaning (semantics) of the sentence makes no sense to us.
As an example, “car red go” makes no sense to use syntactically, but viewing the sentence as a “bag of words” (BOW) where we don’t care about word order, we get the general idea the a red car is moving.
As another example, “the tree ran fast” does not make sense semantically because in our mental model of the world a tree does not move except for slow growtrh and its branches moving in the wind. A tree can only run in a cartoon.
Semantics is the meaning in language. Semantics relies on the understanding of syntax and also on our model for the world (pragmatics).
Coupled with semantics is the “grounding problem.” We interpret the color blue through our experiences of seeing a blue sky, blue articles of clothing, etc. An artist thinks more deeply about a color than a typical person. It is an open question whether an AI without a body and the ability to interact with the physical world could ever be a general purpose intelligence.
Pragmatics is a refinement of semantics: the understanding of language in a specific context. This can involve understanding who is in a conversation and what we know about them, and a model of what they may know or not know, and what topics are being discussed.
Pragmatics is understanding, or at least partially understanding language.
While cognitive computing deals with very specific tasks like adding a calendar entry from a speech request to your cellphone or recognizing what animals are in a picture, general AI (often called AGI - artificial general intelligence) is the science of combining knowlege of the world, language, knowlege of performing tasks, etc. into integrated systems.
NLP, including pragmatics, is a good step towards development of general AIs.
Linguistics Wrap Up And Practical Applications
We just took an overview of the field of linguistics and we studied how linguistics fits into the study of cognition. Later we will ground the theory seen here with practical examples of processing natural language text. I use the terms Natural Language Processing (NLP) and Computational Linguistics to mean almost the same thing with the difference being that NLP is more a collection of software engineering techniques and practices to process and extract information from text while Computational Linguistics deals more with the science of understanding linguistics and thus to some degree cognition.
We will later look at several practical examples of NLP:
- Using Convolutional Neural Networks for classifying text
- Using the spaCY library for part of speech tagging, and entity detection
- Using the OpenNLP library for classifying text
- Using the OpenNLP library for Recognizing Entity Names in Text
Part II - Practical Applications of Cognitive Computing
We use Deep Learning Neural Networks for classification, Knowledge Representation, and Natural Language Processing.
When I started working in the field of AI in 1982 everything I did was symbolic AI and this lasted until 1988 when I started using neural networks to solve practical engineering problems at work, participated in a one year DARPA neural netowrk advisory panel, and wrote the first version of the SAIC ANSim neural network toolkit. Since then most of my work has involved machine learning and specifically neural networks but I was motivated to include the overview material on symbolic AI in this book because I don’t personally believe that distributed neural network machine learning is the final word in cognitive computing and AI. I think that the next major advancement in our field will be hybrid symbolic and neural systems so mentioning symbolic AI seems appropriate.
We will now dive deeper into practical applications of Natural Language Processing and neural networks.
One broad topic I will not be covering is Knowledge Representation (Ontologies and and practical examples using DBPedia) and the practical study of the broad field of intelligent Knowledge Management. While I spend a fair amount of my time working on knowledge management technologies (including my work at Compass Labs, Google, and Webmind Corporation), I have written about knowledge management in my previous books, including free PDF versions of Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition, as well as my books on Leanpub Practical Artificial Intelligence Programming With Java and Power Java.
I will cover in the remainder of this book technologies that I believe will be most useful to you: natural language processing and neural networks (including deep learning neural networks).
(Almost) Everything You Really Need To Know About Machine Learning
In this chapter we will look at modern best practices for machine learning. I will cover some theory with simple implementations that you can use for experimenting and for small or medium scale machine learning tasks. For larger scale problems we will cover Google’s TensorFlow system in later. This is a practical chapter for you to learn some theory and some practical techniques and also later use as a reference. I will use the notation that Andrew Ng uses in his online Machine Learning Class at Coursera. I highly recommend this class!
Introduction to Supervised and Unsupervised Learning
When we train a model, what are we really doing? We start with training cases that represent data similar to what we might need to process in the future and we want to use existing data to build a model or prescription for processing similar data.
In supervised learning the representative training data includes both example inputs and the desired output for each example input. An important concept is building a generalized model, and not simply a memory system that remembers the training inputs and outputs. A model that simply remembers the training examples and does not generalize is said to suffer from over fitting. In later chapters we will look at Tensorflow examples that optionally can use two state of the art techniques for reducing over fitting: regularization and drop out but in this chapter we will prevent over fitting by limiting the number of parameters in a model and by training with a sufficient amount of data. In the case of simple neural networks, model parameters are the weights of connections between neurons. For more complicated deep learning neural networks, we also have hyper-parameters that specify the number of neuron layers, the degree connections (weights) between neurons in adjacent layers, etc.
Models have specific learned parameters. Most of the machine learning examples in this book involve neural networks and in this case the learned models are the weight values in the neural network covered in the first chapter that introduced cognitive computing. Other types of learned parameters are the hypothesis coefficients in linear regression models.
In all types of learned models, regularization refers to penalizing models with large valued learned parameters. When we train a model, we use a cost function to measure “how far off” a model is from the desired behavior, with smaller values of the cost function being better (in the sense of improving accuracy). If learned parameters are large, the cost function is increased by using a constant called the regularization constant and a measurement for how large (in absolute value) the learned parameters are.
The idea of using dropouts is specific to more complex deep learning neural networks. As you know from the introduction to cognitive computing in the first chapter, neural networks consist or simulated neurons and the connections between them. We randomly turn off a subset of neurons during training and also not modifying connection weights between “dropped” neurons and other neurons. Using dropouts allows us to use very complex networks (both in the number of layers and the number of neurons in each layer) without over fitting the training data.
Even though we will not cover unsupervised learning in this book, I wanted to mention situations where you may want to process training data that is only inputs with no desired outputs. This might seem counterintuitive to you so let’s look at an example of unsupervised learning: clustering unlabelled data into different sets. Data might be a set of all PDF documents in a company’s content management system. If the documents are unlabelled as to clasifiction we can still automatically cluster them into sets of similar documents. An interesting feature of unsupervised clustering, a “feature” that is sometimes a problem, is that the clusters do not have human recognizable names or identities. It might be that the PDF files in a content management system might be in rough categories HR, Sales, and Finance. Further, hopefully an unsupervised learning system would separate documents into three different unnamed categories. After automatic clustering occurs, a human reader can look at the auto-generated document sets and label then with tags “HR”, etc. When there is labelled training data available, it is more intuitive to use supervised learning. We will primarily use supervised learning in this book. If you are interested in unsupervised clustering, I have examples you can work with in my book Power Java.
Linear regression is used for data with one or more independent variables that we denote with Xi where i is the index of the independent variable. There is one dependent variable that we denote as Y. We will use Tom Alexander’s library regression-js in this section.
Working with a set of training data we build a model that can predict an output value Y given values for the independent variables Xi. We will work with the general case of multiple independent variables (also commonly refered to as features).
Consider the following figures showing a scatter plot of independent X values with the cooresponding dependent Y values:
Using Tom Alexander’s regression.js library we can fit a straight line through these points (see code in cognitive_computing_book_examples/linear_regression):
In general, whenever data is linearly separable, that is that it can be divided by a stright line, then linear regresion is the most efficient means of building a predictive model. In this case, the model is an equation that given a value for X, the model will predict the cooresponding Y value.
It the example file test_linear.js we specify a starigh line fit:
You could also fit a quadratic curve using the thrird argument value of 2, cubic equation using a value of 3, etc.
While linear regression is often adequate and something that you want to “try first” it only handles cases that are linearly separable - this is very similar to neural networks that have no hidden neurons (only input neurons directly connected to one or more output neurons).
A More Complex models using Google’s Neural Network Playground
You will want to have a web browser open to the Playground while working through this section. You should see a display similar to what is in the following figure:
We will use the neural network library from the Google Playground for several examples and you can find the code in the directory google_typescript_neural_network.
Using one of the simplest examples for classifying data into two sets where the data is linearly separable:
In the following figure, we have trained the model that separates the data into two output categories:
In addition to selecting one of four possible input data sets, there are several options that you can (and should!) experiment with:
- Learning rate: choosing a value that is too large can prevent finding a model that reduces the error (cost) function. Choosing a value that is too small makes the learnng process take longer.
- Activation: we have seen plots of the Sigmoid activaction function. Other choices are rectilinear, hyperbolic tangent, or linear.
- Regularization: used to increase to cost function for large model parameters and is used to prevent over fitting of data. L1 regularization zeroes out model mparameters while L2 regularization scales all model parameters closer to zero.
- Features: the Neural Network Playground offers several types of features you can try, both individually and in combination: values of X1 and X2 inpits, X1 input squared, X2 input squared, X1 time X2, sin(X1), and sin(X2).
In the next section we will develop a simple neural network library “from scratch” that used backpropagation. In the following section, we will use the library that Googe wrote for the Neural Network Playground, but we will use it without any UI code (examples run from the command line).
Backpropagation Neural Networks
We will start with a simple implementation in the Typescript language of one of the most popular neural models. In the introduction to cognitive science we saw the sigmoid and derivative of sigmoid functions defined in the file neural_networks/neural_net.ts. The following example has the benefit of no external dependencies and brevity. The following listing contains this entire example.
There are two sets of weights, those connecting the input neurons to the hidden neurons and those connecting the hidden neurons to the output neurons. The network is fully connected between layers: each input neuron is connected to each hidden neuron and each hidden neuron is connected to each output neuron. In lines 8 and 9 we set the learning rates for both sets of weights. If the learning rates are too large then it is possible to miss finding a set of weights for the training data. If these learning rates are too low then it can take a very long time to train a network.
The following example code imlements an unusual feature: if the network does not reach a low error threshold after many training iterations then the weights in the network are automatically reset to small random values. I usually add this feature when I implement backproagation neural networks. The decision to reset the weights and to restart the training process is made in lines 170-173.
Ouput from running the example looks like:
A More Complex Neural Network Example
We will now use a featureful, if not high performance, example by Google that we saw earlier in the Google Neural Network Playground web demo.
In the git repository directory tensorflow_playground I have a copy of the Google example deep learning neural network library in the file nn.ts used in their web-based demo. We will use this utility library in this section and in the next section on recurrent networks.
The example file testXOR.ts uses Googles demo library:
The output looks like (some output removed for brevity, also, each time you run the program you see different results because initial weights are assigned randomly):
Google’s Typescript code in the file nn.ts is an excellent tool for experimenting with small or even medium size data sets because it is easy to change the network architecture and training options. In the next chapter we will use Google’s open source Tensorflow library that can scale to handling very large problems.
A large advantage of deep learning is automatically determining features that are important in data for any specific task. Neural network architecture is still important, but much of the manual effort of feature selection is handled fair automatically by deep learning networks. I will use an example of something I worked on as an example of the “old way” of manual feature selection and precalculating features: time delay neural networks (recurrent neural networks) to recognize phonemes in digitized audio of a human speaker. In the 1980s when I did this work, I preprocessed the data from sound wave data by sliding a window through the data, taking a Fast forier Transfom (FFT) of the data, squaring the data to form a power spectral density, and selecting the frequency bands that might be must useful for phoneme recognition. Even being guided from the results of a paper by Alex Waibel on the subject, determining which features to use and tuning parameters like window size, etc. was a lot of effort. Recent work has used a deep learning network with input of raw time series audio data. The network learns effectively how to do a fourier transform and other feature definition all on its own. If I had modern libraries like Tensorflow in the 1980s I would have saved a lot of development time and acheived better results.
I still believe that it is extremely important for machine learning practitioners to be skilled at feature identification, extraction, and general data preparation. Don’t rely solely on deep learning. In this section we review some examples and techniques.
Understanding your data is key to using machine learning techniques. Understanding the features in data is a key part of data cleaning activities. Regardless of whether you use techniques that we will later use like maximum entropy classification models, neural networks for recognizing images or performing natural language processing, etc., once you learn how to use available open source machine learning libraries, you will find that much of your time will be spend understanding your data, cleaning your data, organizing data for reuse and multiple uses, etc.
Feature Engineering for Text Data
There are several ways to look at text data: character by character, tokenized into a stream of words, a stream of stemmed words, segmented into individual sentences, keep extra punctuation like commas, semicolons, keep surrounding quotation marks or remove them, etc. Text collections are often converted to a matrix notation using the following scheme:
- Maintain a dictionary of unique words where each word is mapped to an integer index from zero to the number of unique words minus one.
- Input text is stored with documents represented as rows in the matrix while the column values are zero if the word index at a column index dows not appear in a document or as a value of 1 if the document does contain the word mapped to a column index.
Other representations of inputs texts include a “bag of words” or “bag of ngrams.” A “bag of words” (BOW) is the set of unique words in an input document. Ngrams are instances of adjacent words in text. A 1gram is a single word, a 2gram is two words that appear adjacent to each other, a 3gram is three adjacent words, etc.
Feature Engineering for Numeric Data
Here we will be concerned with data represented as a one-dimensional vector of numeric values. For data at each index in an input vector, it is common to:
- Normalize data to a specific range ([0.0, 1.0] or [-1.0, 1.0] are common ranges)
- Subtract the mean value at each index (e.g., data at a specific index that occurs in the range [1000, 2000] with a mean value of 1200 might have the mean subtracted from each input value with the resulting data normalized to the range [0.0, 1.0])
- Take the log of data. This technique is often used when developing anomly detection systems for any parmeters that do not have a Gausian distribution.
It might seem that considering one-dimensional data is limiting, but even image data is usually treated as a one-dimensional vector. For example, a black and white image that is 64x64 pixels might be “linearized” to a one-dimension vector with 4096 elements and the value of each pixel might be mapped to the range [0.0, 1.0] where 0.0 represents black pixels, 1.0 represents white pixels, with grey-scale values following between 0.0 and 1.0.
Using TensorFlow for Implementing Deep Neural Networks
There are several open source libraries and frameworks for implementing deep learning neural networks. We will look at two examples using Tensorflow in this chapter: processing numeric cancer data to build a predictive model and for using a convolutional network for text classification. Many deep learning examples you might see on the web use image recognition as an example application but I allow myself to concentrate here on two types of problems that I use deep learning networks in my own work.
It is worth mentioning other good deep learning toolkits:
- Keras is a high level library for specifying deep networks and can use the lower level Tensorflow and Theano as back ends. Keras is especially good for rapidly prototyping different network architectures.
- Theano is a lower level library that supports both CPU and GPU execution models.
- Caffe is a widely used C++ library for using deep neural netowkrs for image recognition.
- Deeplearning4j is a library written for the JVM. I covered Deeplearning4j in my book Power Java with examples for deep belief networks.
The key feature of deep learning neural networks is the use of many layers of neurons with different characteristics. Unlike the simpler backprop neural networks that we have looked at that were densely connected, the connections between neurons and layers of neurons in deep learning networks tend to be sparse. Another key feature is the use of dropout where for a learning cycle, a large fraction of neurons in the network are randomly turned off.
We will use both Python 2.7 and Python 3.5 for examples in this book. Check Appendix A for detailed installation instructions. After working through the examples in this chapter, you may want to read through and experiment with Google’s Tensorflow examples.
Processing Cancer Data With a BackPropagation Network
This example is found in the subdirectory tensorflow_examples/cancer_deep_learning_model in the github repository for this book. We will use the Universary of Wisconsin Cancer Data that is in the file train.csv that has the following format:
We will use a feedforward network with three hidden layers, each hidden layer containing twelve fully connected neurons. In the following listing, lines 1 and 2 import the Tensorflow and numpy libraires. The numpy library supplies data types and functionality for handling matrix operations.
We will train the network using the file train.csv and then test the trained network with different data defined in the file test.csv. We load these two datafiles using the contributed library for Tensorflow in lines 4-9.
In line 11 we define the input data. Each value will be treated as a real (floating point) value, and there are 9 inputs per training sample. We build a classifier in lines 13-15 using the feature columns defined in line 11, specfying the number and size of hidden neuron larers, and spcify that there are two output classes (not malignant or malignant).
The classifier used the test.csv test set in line 23 to calculate the accuracy over all samples in the test data (Accuracy: 0.960265) and in lines 26-30 I show you how to use the trained model with arbitrary new test data.
I assume that you have TensorFlow installed. Run the example using:
We can use the tensorbard utility (installed with Tensorflow) to look at the status of the loss function during training:
tensorboard –logdir=/var/folders/zp/mzt6r26x6ks4h3th294xt6th0000gn/T/tmpziGJ0n/ –port=8080
Using Convolutional Deep Learning Networks for Text Classification
Convolutional networks are feed forward networks with several differences to the backpropagation networks we have been using:
- Neurons are sparesly connected by weights
- Convolutional neuron layers closest to the inputs share weights and are used to learn local features
- Rectilinear neuron response functions are usually used, unlike using a Sigmoid response function that we have seen earlier
- It is common to use dropouts which is a process of randomly choosing a set of neurons to “turn off” when showing the network a training example.
Convolutional networks are often used for image recognition, and let’s consider for a moment how shared weight convolutional layers work. Imagine that you are training a network to recognize handwritten numbers in the range [0,9]. Each shared weights convolutional filter will learn to recognize a feature, like the curved line at the top of hand drawn 2 and 3 characters. In both training and later recognition, the inputs to this feature detector will be shifted across the input field so this particular feature detector will recognize the curved line at the top of a 2 or 3 character no matter where that line segment is located in the input field. I visualize a feature detector as being a small two-dimensional patch, perhaps 10x10 pixels, that is slid across a larger two-dimensional input field (e.g., 64x64 pixels if this is the size of the inut training images).
Note: we will also solve this same problem, of classifying text as belonging to a specific topic, in the next chapter using both a Java library for maximum entropy and also Facebook’s fasttext tools. You will find the approaches in the next chapter to be easier, but using deep learning neural networks for natural language processing (NLP) is state of the art and is worth the extra effort for some applications.
In this section we use a convolutional network to recognize patterns in input text so our small feature detection window will be a small window of adjacent characters or words that in training is slid across a much longer sequence of adjacent characters or words in the input text used to train the network. We will use entire words that are represented by a word index; that is, we process input text by forming a dictionary that maps each unique word in our training text to a word index and then replace the words in training (and testing data) with the integer indices for the words.
For example, if this dictionary mapped “the” to 1002, “dog” to 209, “ran” to 5402, “very” to 809, “fast” to 10022, and “today” to 2219 then the input text “the dog ran very fast today” gets transformed to the sequence [1002, 209, 5402, 809, 10022, 2219]. We use sliding filter sizes of 3, 4, and 5 words; this is set up as a parameter in the example code:
The example we use is in cognitive_computing_book_examples/deep_learning_cnn_text_classifier and requires:
- Python 2.7
- TensorFlow version 1.0.0 (or above)
- The Python library stemming
Assuming that you have installed TensorFlow for both Python 2.7 and Python 3.5 using directions in the last section, you can run this example using:
Later we will look at screen shots of the tensorboard application that reads training logs and displays accuracy, loss function, etc. during training.
The file main.py contains code for both training a Convolutional Neural Network (CNN) and using a trained model stored on disk to classify new text samples. The example in main.py is licensed using the Apache 2 license and uses some of the code from another Apache 2 licensed project by Denny Britz, which in turn is derived from Yoon Kim’s paper “Convolutional Neural Networks for Sentence Classification”
The example program in this section is long so we will break the source file into pieces (each labelled with line numbers starting with line 1) and show them in order of occurence in the source file for discusion.
In lines 25-35 we import the required libraries for this example. We used numpy and Tesnsoflow libraries in the last section. The library stemming.porter2 is used to stem words (e.g., the words “worked” and “works” would both stem to “work”). This example reads text data files in the subdirectory data with examples for text discussing chemistry, computers, economics, and music. We will be training a model to classify text into one of these four categories. The Python standard utilties listdir, isfile, and *join will be used for reading this data. The library csv will be used to write out sample classifications for the test text.
Depending on how much RAM you have, running this example might run out of memory. If you get out of memory errors then try reducing the value of BATCH_SIZE set on line 39.
We set a flag USE_STEMMING on line 44 to true if we want to stem training and testing text (which we do).
In the next code listing in line 2, we set the fraction of training data that is withheld for cross validation is 0.2 (20%). In line 6 we set the character encoding size to 64 distinct characters. In line 7 we set the adjacent word sequence filter size to 3 words, 4 words, and 5 words. We turn off regularization in line 12 (assuming that overfitting the training data will not be a problem). In line 16 we set the batch size to 32. The batch size is the umber of training examples used for each batch of weight updates. In line 17 we set the number of training epochs (number of times we process the training data) to 10. This is a small value but the text classification problem that we are solving is relatively simple. The checkpoint value set in line 23 defines how many training cycles to process we have between saving the network model to disk.
In the following code listing we set initial values for global variables in lines 2-6. Line 8 defines a few “noise words” (or “stop words”) that are discarded in input text. The assumption is that removing commonly used words speeds up training and should not affect classification accuracy. In practice, convolutional deep learning networks are so powerful that they will notice and discard common words used in the training data that do not separate the classes.
The function clean_text defined in lines 10-19 removes any characters that are not letters from the inout text and stems words in the input (i.e., map words like “seeing” to “see”).
The function prepare_input_data defined in lines 22-40 is interesting: each text file in the subdirectory data is read, and it is assumed that the text file name is the classification of the text in the file (e.g., th file “chemistry” contains a lot of text that is on the subject of chemistry). Each separate line in the training files is one training example. Note that since there are four classes we are training, the array output_map will have four elements, all elements have the value zero except for the element with the index of the classification which has a value one. The return value of the function is a list containing all cleaned training data and the output classification map for each training example. You may want to add a print statement before the return statement to see the values of examples and y.
The function batch_iter defined in lines 1-14 of the next listing takes a set of training data, optionally shuffles it, and returns the data. Note that we shuffle the data during training but when we later test the trained network we do not want to shuffle the data.
The function cnn_text_classifier defined in the following code listing builds a Tensorflow graph and returns this graph as the value of the function. The placeholder variable input_x_ holds an input example in the network while the placeholder variable input_y holds the labeled classification for the input, that is, holds the target value that we would like to network to produce when it sees the inputs in input_x. Lines 11-13 define the weights in the neural network.
On line 12, we set the Tensorflow device to use the CPU, assuming that an appropriate GPU is not available on your system. If your computer has a supported GPU you can look at the directions in the Tensorflow documentation to configure for its use.
The following code listing shows the training process that uses the graph defined in the function cnn_text_classifier that we just looked at and which we call in lines 27-34 in the following listing:
The following liting shows the code for testing a training convolutional netowrk using new test data:
Try testing the trained model:
We can use tensorboard to visualize the graph, model parameters, and accuracy:
Deep Learning Wrap Up
The tutorials at tensorflow.org provide many exampes of using TensorFlow. Deep learning neural networks have rapidly advanced state of the art performance in many application areas. At the natural language processing conference NAACL 2016, it seemed that about half the papers reported results using deep learning neural networks.
Regardless of your application, you are likely to find both academic papers and example systems solving similar problems using Tesnsorflow or other deep learning libraries. I recommend that you also look at Google’s Tensorflow examples.
In the next chapter we continue to look at examples of natural language processing using more conventional maximum entropy machine learning models.
Tools and Techniques for Natural Language Processing
I am going to show you examples using three libraries for NLP that I find useful: spaCY (Python), OpenNLP (Java), and Facebook’s fasttext (C++). Two libraries that I also use frequently and recommend, but will not use here, are the Standford NLP tools (Java) and NLTK (Python).
We will use spaCY for part of speech (POS) tagging, which is identifying how words are used: as nouns, verbs, adjectives, prepositions, etc. This is not as simple as it might seem since some words can be used differently. Consider the word “bank” which can be used as a noun (the river bank, go to the bank to cash a check) and as a verb (bank the airplane to the left). We will also use spaCY for entity recognition, which is identifying words in text that are people, places, organization, products, etc.
We will use OpenNLP for classifying text, specifically we will train a model using labeled training data to classify text as one of twelve categories (news, sports, business, physics, etc.). Knowing how to train models for text classification is a valuable skill!
In addition to the classification model that we train from scratch, we will use -retrained models supplied with OpenNLP for segmenting sentences and performing entity detection as we also did with spaCY.
As a good alternative to OpenNLP for text classification that we will also use is Facebook’s fasttext tool to train classification models. fasttext is especially useful in applications where we need to frequently train new models because of the speed and light memory requirements for building fasttext models. An example application is frequently training new models that model users’ behavior while interacting with a system.
Installation and Introduction to the spaCy NLP Library
Instructions for installing spaCY can be found in Appendix C. Briefly, if you use conda you can simply run:
The last line will install about 0.5 gigabytes of linguistic data and training corpuses on your laptop.
Then to use spaCY:
The following four sections show Python code examples, each in a single file. Since importing the spacy library takes a while to load a lot of data into memory, in general I use spaCY in an interactive repl.
Using spaCY for Assigning Part of Speech Tags
A single word can be used as multiple parts of speech. For example, “bank” can be a noun in “the river bank” or a verb in “the pilot wanted to bank the plane.” It is common to tag input text to identify the part of speech of each word.
In line 2 of the following Python code listing we load the English language models for spaCY:
The function nlp defined in line 2 creates a new spaCY text document from Unicode input text. In lines 5-7 we print out each word with its lemma (standard form), part of speach tag, and a more readable form of the part of speach, as can be seen in this listing:
Using spaCY for Entity Recognition
Entity recognition is the identification of entities like people, places, and organizations in input text. In the following Python code listing, in line 2 we load the English language models for spaCY and in line 3 create a spaCY document object from Unicode inut text:
In lines 6-8 we print out the entities identified in the text:
The label definitions as defined in the spaCY documentation include:
- PERSON - people
- NORP - nationalities, religious, or political groups
- FACILITY - buildings, airports, etc.
- ORG - companies, institutions, etc.
- GPE - geographical locations like countries, states, and cities
- LOC - non-GPE locations
- PRODUCT - products like cars, food, etc.
- EVENT - named historical events
- WORK_OF_ART - titles of books, paintings, songs, etc.
- LANGUAGE - human languages
- DATE - a date or time period
- TIME - times shorter than a day
- MONEY - monitary values
spaCY has many additional functions that you can read about on the spaCY documentation web page. When performing NLP using Python I also use the NLTK NLP toolkit. For my work I often use the Ruby language for NLP, specifically I use my own KBSPortal.com commercial product. In the next few sections we will use the Java OpenNLP library and framework.
Introduction to the OpenNLP Library and Installation Notes
Instructions for installing OpenNLP can be found in Appendix B but all required files are in the github repository for this book in the directory cognitive_computing_book_examples/opennlp_maxent.
OpenNLP is an Apache Foundation project that is written in Java and supports standard NLP operations like:
- tokenization: identifying individual words and punctuation in text
- sentence segmentation: identifying sentence boundaries in text
- part-of-speech tagging: assign part of speech (e.g., “noun”, “verb”, “adverb”, etc.) to each word in text
- named entity extraction: find mentions of proper names (e.g., people’s names, company names, place names, etc.) in text
- text classification: assign a label (“news,” “health,” “sports,” etc.) to text
- chunking: like part-of-speech tagging but also deals with short word sequences like noun and verb phrases
- parsing: generating a parse tree for a sentence; usually there are many possible parse trees for any given sentence and these are assigned probabilities of being correct
- coreference resolution: find all word sequences that refer to an entity (e.g., person, company, etc.) in text
We will look at two examples: text classification and entity detection.
Assuming that you have Java 8 (including the full JDK) and Maven installed on your system, then the examples in directory cognitive_computing_book_examples/opennlp_maxent are self contained. I include both the pre-trained model files we will be using and the OpenNLP Java code dependencies are resolved using Maven. We will also train a model file for text classification using twelve categories. I also supply training text for training this model.
Even though the examples for this book are self-contained, if you use OpenNLP for your projects then I recommend that you also install the full source code on your laptop to use as a reference to augment the online OpenNLP documentation.
Classification Example Using the OpenNLP Library
In building a deep learning text categorization model in the last example, the deep neural network learned which features to use and to some extent the architecture of the network itself. In this section we use a more traditional approach of performing feature engineering ourselves and then use a maximum entropy machine learning model to categorize text based on the features we selected and training data that we use to train the model. In a later section we will also look at an alternative machine learning library for the same learning problem: Facebook’s fasttext tools.
When we train a classification model the features (or inputs) for a model can be individual words, adjacent word pairs (2grams), three adjacent words (3grams), structural information from source documents (e.g., does a word appear in the document title), etc. As you would expect, some features provide more useful information than others. Maximum entropy is a formal way of determining the value of features. These types of classification models calculate the probabilities that a set of category labels apply to a sample of text.
You may also have heard the term entropy in the context of the strength of passwords. Entropy measures the degree of randomness. The password “cat” has a lower entropy than “8!s” because the probability of “cat” occurring in text, in a dictionary or elsewhere is greater than “8!s.” The machine learning algorithms implemented in OpenNLP will learn which features provide the most utility in separating text samples into different categories. As a very simple example, given many training examples for the twelve categories we will use, the algorithm will learn that the feature that is the word “the” is commonly used across the training examples and thus contributes very low entropy. We want features that maximize entropy.
Many classifiers use “bag of words” (BOW) as features representing a sample of natural language text. Here the order of words is not important. It is clear that word order independence in features “throws away” useful information! If we are training a maximum entropy model on two categories “positive sentiment” and “negative sentiment” it is clear that the word ordering in the phrase “not good” contains useful information for classifying text for sentiment. We can “fix” this problem by also including features of two adjacent words where in this case “not good” is an input feature along with the individual words “not” and “good.”
The important point is that we want the training process to determine the value of the types of features to feed a maximum entropy model. Based on the training data, a model is trained maximizing the entropy of available features by assigning different weights to features. This maximum entropy approach is a little weaker than the example of deep learning text classification (in the last chapter) where feature selection was part of the automated learning process. The reason for this is that we may not be able to determine ourselves what features can be useful. Still, maximum entropy classification is a strong and practical approach and as you will see we will get excellent results with fast training times.
The examples in this section and the next are updated versions of the OpenNLP example programs in my book Power Java where I also cover training new recognition models.
Training a Model
The OpenNLP class DoccatTrainer can process specially formatted input text files and produce categorization models using maximum entropy which is a technique that handles data with many features. Features that are automatically extracted from text and used in a model are things like words in a document and word adjacency. Maximum entropy models can recognize multiple classes. In testing a model on new text data the probablilities of all possible classes add up to the value 1 (this is often refered to as “softmax”). For example, we will be training a classifier on twelve categories and the probablilities of these categories for some test input text add up to the value of one:
In this example almost all of the probability density belongs to the category “news.” These results are for the short text string: “The White House is often mentioned in the news about U.S. foreign policy. Members of Congress and the President are worried about the next election and may pander to voters by promising tax breaks. Diplomacy with Iran, Iraq, and North Korea is non existent in spite of a worry about nuclear weapons. A uni-polar world refers to the hegemony of one country, often a militaristic empire. War started with a single military strike. The voting public wants peace not war. Democrats and Republicans argue about policy.” For longer text documents it is common for several categories to have larger probabilities assigned to them.
The format of the input file for training a maximum entropy classifier is simple but has to be correct: each line starts with a category name, followed by sample text for each category which must be all on one line. Please note that I have already trained the model producing the model file models/en-text-categorization.bin so you don’t need to run the example in this section unless you want to regenerate this model file.
The file sample_category_training_text.txt contains four lines, defining four categories. Here are two lines from this file (I edited the following to look better on the printed page, but these are just two lines in the file):
Here is one training example each for the categories COMPUTERS and ECONOMY.
You must format the training file perfectly. As an example, if you have empty (or blank) lines in your input training file then you will get an error like:
The OpenNLP documentation has examples for writing custom Java code to build models but I usually just use the command line tool; for example:
The model is written to the relative file path models/en-text-categorization.bin. The training file I am using is tiny so the model is trained in a few seconds. For serious applications, the more training text the better! By default the DoccatTrainer tool uses the default text feature generator which uses word frequencies in documents but ignores word ordering. As I mention in the next section, I sometimes like to mix word frequency feature generation with 2gram (that is, frequencies of two adjacent words). In this case you cannot simply use the DoccatTrainer command line tool. You need to write a little Java code yourself that you can plug another feature generator into using the alternative API:
As I also mention in the next section, the last argument would look like:
For most purposes the default word frequency (or bag of words) feature generator is probably okay so using the command line tool is a good place to start. Here is the output from running the DoccatTrainer command line tool:
We will use our new trained model file en-text-categorization.bin in the next section.
Using Our Trained Model and Standard Models for Sentence Segmentation and Entity Recognition
This section does double duty, showing both how to use the classification model we created in the last section and also how to use the pre-trained models supplied with OpenNLP for sentence segmentation and entity detection.
The code that uses the model we trained in the last section is short enough to list in its entirety:
In lines 33 through 42 we initialize the static data for an instance of the class DoccatModel that loads the model file created in the last section.
A new instance of the class DocumentCategorizerME is created in line 28 each time we want to classify input text. I called the one argument constructor for this class that uses the default feature detector. An alternative constructor is:
The default feature generator is BagOfWordsFeatureGenerator which just uses word frequencies for classification. This is reasonable for smaller training sets as we used in the last section but when I have a large amount of training data available I prefer to combine BagOfWordsFeatureGenerator with NGramFeatureGenerator. You would use the constructor call:
The following listings show interspersed both example code snippets for using the TextClassifier class followed by the output printed by each code snippet. Here we use the sentence segmenting model:
The sentence segmentation model found two sentences: “Apple Computer, Microsoft, and Google are in the tech sector.” and “Each is very profitable according to Mr. Smith.” in the input text. Notice that this model correctly did not split out a new sentence around the period in the character sequence “Mr. Smith” which is an error simple sentence splitters often make.
The next code snippet finds entities in input text:
Here are some tests for the text classification model we built:
Training a New Categorization Model using Facebook’s fasttext
In my own NLP work, classifying text, as we did with OpenNLP in a previous sections, is a common part of my work flow. fasttext performs training and using trained models very efficiently and in the last year I have added fasttext to the set of tools I use.
For reference, please see the research paper: Bag of Tricks for Efficient Text Classification. This paper makes the point that deep learning is not always required if we already know what features to use. The developers at Facebook who wrote fasttext use both bag of words (BOW) and n-grams. We already have seen 2-grams and 3-grams, adjacent double and triple word sequences. n-grams referes to “n” adjacent words.
Install Facebook’s fasttext
You can find the C++ source code at the github repo. If you just want to use fasttext for classification then all you need is a C++ compiler which is our use here.
Build system (no dependencies on my macOS and Linux laptops with a C++ compiler and standard tools installed: built with ‘make’). Make sure the ‘fasttext’ executable that is created by the build process is on your PATH.
Training and Using a Model
We will use the same data as we did in the previous OpenNLP example. The format of the file is slightly different. For OpenNLP each training example was on one line with the classification label at the start of the line followed by a tab character. For tasttext each line begins with a token like __label__computer or __label__economic that labels the classification label for the text in the line.
The following command trains a model:
Notice that the training time is fast! On my MacBook, training with fastext took about 1/4 of a second while training with the same data using OpenNLP took a few seconds.
We can test against the test example file:
In the first case, the last argument “1” indicates that we want only the most probable classification label for each bit of test text. In the second example, we ask for the top five classification labels.
You can also pipe data into the fasttext executable file:
which produces: __label__chemistry
Using piping data for prediction can be done in scripting languages like Ruby and Python. You can also link the fasttext library. There is a Ruby language client example in the directory facebook_fasttext named ruby_example.rb:
It may look inefficent to spawn a new process to call fasttext but it only takes a few milliseconds to spawn fasttext on my MacBook.
NLP Wrap Up
Natural language processing has been a large part of my career since the 1980s and my personal interest in NLP has lead me to use many NLP techniques that we have used in this short book. Still, I believe that knowledge of NLP will be of use to you and the good news is that open source libraries and frameworks for NLP are now very good and easy enough to use in your own projects. As I have done in the machine learning chapters, here I have tried to show a few representative examples to get you started.
Book Wrap Up
Thank you for reading my book. I hope that I have met my goals of providing you with both interesting material to expand the way you think about human and computer cognition and also to provide you with actionable knowledge in your own work.
My Other Books That Might Interest You
- Interested in Machine Learning using Java? Buy my book Power Java
- Interested in General Artificial Intelligence material using Java? Buy my book Practical Artificial Intelligence Programming wth Java
- Interested in Knowledge Management, Knowledge Representation and Linked Data (Semantic Web)? Get a free PDF of my books Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition and Practical Semantic Web and Linked Data Applications, Common Lisp Edition
Appendix A - Installing TensorFlow
The following instructions set up the latest TensorFlow for both Python 2.7 and Python 3.5 for CPU only operations for Linux and macOS.
If you want to install with GPU support then follow the instructions at tensorflow.org
At the time I am writing these directions, the current version of Tensorflow is tensorflow-1.0.0
Dealing With New Versions of TensorFlow
After installing a TensorFlow environment, I like to check the exact vrsion that is installed:
I use version 1.0.0 in most of the examples in this book so hopefully the APIs will be stable in the future.
I will try to keep the examples in this book up to date for future versions of Tensorflow so check back with the github code repository for the examples in this book.
Appendix B - Installing OpenNLP
OpenNLP is written in Java and includes pre-trained model files.
All of the files you need for the examples in this book are in the github repository in the directory cognitive_computing_book_examples/opennlp_maxent
Optionally, you may also want to download at least a binary distribution from the OpenNLP download page and you might want to download the source code for reference.
Appendix C - Installing spaCy
The spaCy NLP library is compatible with 64-bit CPython versions 2.6+ or 3.3+ and runs on Linux, macOS and Windows.
You can find general installation instructions on the spaCY installation web page but if you use conda to manage python installations, you can simply use:
The last line will install about 0.5 gigabytes of linguistic data and training corpuses on your laptop.
Appendix D - Using Cloud Computing Platforms
I have been using Amazon AWS since 2006 for customer work and I used Microsoft’s Azure while I was a member of Microsoft’s BizSpark program. AWS and Azure are great services but as I started to write this book in April 2016, Google has been actively enhancing Google Cloud Platform (GCP) services and their pricing is very competitive. It does not particularly matter which cloud provider you use.
I keep a large VPS (60G memory, 16 CPU cores) inactive on the GCP and just start it for a few hours as needed. Currently, it costs about $0.65/hour to run this instance. I find it much less expensive to rent a large VPS for a few hours a month than to buy a more powerful computer for my office. When I need a large memory server for extended periods of time (generally, more than a month) I rent a physical server from Hetzner or a large memory VPS from OVH. The following instructions should also be useful for alternative server providers.
I decide on what services to rent based on cost and convenience.
Since I use several different laptops, and since I do much of my development at the command line, I generally prefer to keep my working materials and software set up on a remote server.