# Power Java

## Preface

Please note that this book was published in the fall of 2015 and updated July 2016 for the latest version 2.0 of Apache Spark machine learning library.

I have been programming since high school and since the early 1980s I have worked on artificial intelligence, neural networks, machine learning, and general web engineering projects. Most of my professional work is reflected in the examples in this book. These examples programs were also chosen based on their technological importance, i.e. the rapidly changing technical scene of big data, the use of machine learning in systems that touch most parts of our lives, and networked devices. I then narrowed the list of topics based on a public survey I announced on my blog. Many thanks to the people who took the time to take this survey. It is my hope that the Java example programs in this book will be useful in your projects. Hopefully you will also have a lot of fun working through these examples!

Java is a flexible language that has a huge collection of open source libraries and utilities. Java gets some criticism for being a verbose programming language. I have my own coding style that is concise but may break some of the things you have learned about “proper” use of the language. The Java language has seen many upgrades since its introduction over 20 years ago. This book requires and uses the features of Java 8 so please update to the latest JDK if you have not already done so. You will also need to have maven installed on your system. I also provide project files for the free Community Version of the IntelliJ IDE.

Everything you learn in this book can be used with some effort in the alternative JVM languages Clojure, JRuby, and Scala. In addition to Java I frequently use Clojure, Haskell, and Ruby in my work.

### Book Outline

This book consists of eight chapters that I believe show the power of the Java language to good effect:

• Network programming techniques for the Internet of Things (IoT)
• Natural Language Processing using OpenNLP including using existing models and creating your own models
• Machine learning using the Spark mllib library
• Anomaly Detection Machine Learning
• Deep Learning using Deeplearning4j
• Web Scraping
• Using rich semantic and linked data sources on the web to enrich the data models you use in your applications
• Java Strategies for Knowledge Management-Lite using Cloud Data Resources

The first chapter on IoT is a tutorial on network programming techniques for IoT development. I have also used these same techniques for multiplayer game development and distributed virtual reality systems, and also in the design and implementation of a world-wide nuclear test monitoring system. This chapter stands on its own and is not connected to any other material in this book.

The second chapter shows you how to use the OpenNLP library to train your own classifiers, tag parts of speech, and generally process English language text. Both this chapter and the next chapter on machine learning using the Spark mllib library use machine learning techniques.

The fourth chapter provides an example of anomaly detection using the University of Wisconsin cancer database. The fifth chapter is a short introduction to pulling plain text and semi-structured data from web sites.

The last two chapters are for information architects or developers who would like to develop information design and knowledge management skills. These chapters cover linked data (semantic web) and knowledge management techniques.

The source code for the examples can be found at https://github.com/mark-watson/power-java and are all released under the Apache 2 license. I have tried to use only existing libraries in the examples that are either Apache 2 or MIT style licensed. In general I prefer Free Software licenses like GPL, LGPL, and AGPL but for examples in a book where I expect readers to sometimes reuse entire example programs or at least small snippets of code, a license that allows use in commercial products makes more sense.

There is a subdirectory in this github repository for each chapter, each with its own maven pom.xml file to build and run the examples.

The five chapters are independent of each other so please feel free to skip around when reading and experimenting with the sample programs.

This book is available for purchase at https://leanpub.com/powerjava.

You might be interested in other books that I have self-published via leanpub:

My older books published by Springer-Verlag, McGraw-Hill, Morgan Kaufman, APress, Sybex, M&T Press, and J. Wiley are listed on the books page of my web site.

One of the major themes of this book is machine learning. In addition to my general technical blog I have a separate blog that contains information on using machine learning and cognition technology: blog.cognition.tech and an associated website supporting cognition technology.

### If You Did Not Buy This Book

I frequently find copies of my books on the web. If you have a copy of this book and did not buy it please consider paying the minimum purchase price of 4 at leanpub.com/powerjava. ## Network Programming Techniques for the Internet of Things This chapter will show you techniques of network programming relevant to developing Internet of Things (IoT) projects and products using local TCP/IP and UDP networking. We will not cover the design of hardware or designing IoT user experiences. Specifically, we will look at techniques for using local directory services to publish and look up available services and techniques for efficiently communicating using UDP, multicast and broadcast. This chapter is a tutorial on network programming techniques that I believe you will find useful for developing IoT applications. The material on User Data Protocol (UDP) and multicast is also useful for network game development. I am not covering some important material: the design of user experience and devices, and IoT devices that use local low power radios to connect cooperating devices. That said, it is worth thinking about what motivates the development of IoT devices and we will do this in the next section. There are emerging standards for communication between IoT devices and open source projects like TinyOS and Contiki that are C language based, not Java based, so I won’t discuss them. Oracle supports the Java ME Embedded profile that is used in some IoT products but in this chapter I want to concentrate network programming techniques and example programs that run on stock Java (including Android devices). ### Motivation for IoT We are used to the physical constraints of using computing devices. When I was in high school in the mid 1960s and took a programming class at a local college I had to make a pilgrimage to the computer center, wait for my turn to use a keypunch machine, walk over to submit my punch cards, and stand around and wait and eventually get a printout and my punch cards returned to me. Later, interactive terminals allowed me to work in a more comfortable physical environment. Jumping ahead almost fifty years I can now use my smartphone to SSH into my servers, watch movies, and even use a Java IDE. As I write this book, perhaps 70% of Internet use is done on mobile devices. IoT devices complete this process of integrating computing devices more into our life with fewer physical constraints on us. Our future will include clothing, small personal items like pens, cars, furniture, etc., all linked in a private and secure data fabric. Unless you are creating content (e.g., writing software, editing pictures and video, or performing a knowledge engineering task) it is likely that you won’t think too much about the devices that you interact with and in general will prefer mobile devices. This is a similar experience to driving a car where tasks like braking, steering, etc. are “automatic” and don’t require much conscious thought. We are physical creatures. Many of us gesture with our hands while talking, move things around on our desk while we are thinking, and generally use our physical environment. Amazing products will combine physical objects and computing and information management. Before diving into techniques for communication between IoT devices, we will digress briefly in the next section to see how to run the sample programs for this chapter. ### Running the example programs If you want to try running the example programs let’s set that up before going through the network programming techniques and code later in this chapter. Assuming that you have cloned the github repository for the examples in this book, go to the subdirectory internet_of_things, open two shell windows and use the following commands to run the examples in the same order that they are presented (these commands are in the README.md file so you can copy and paste them). These commands have been split to multiple lines to fit the page width of this book: #### Build #### Run service discovery #### Run UDP experiments: Line 2 compiles and builds the examples. For each of the example programs you will run two programs on the command line. For example for the service discovery example on lines 5 and 6 (which would be entered on one line in a terminal window) we start a test service and lines 7 and 8 run the service test client. Similarly, lines 10 through 20 show how to run the services and test clients for the UDP experiments and the multicast experiments. I repeat these instructions with each example as we look at the code but I thought that you might enjoy running all three examples before we look at the implementations of the examples later in this chapter. In each example, run the two example programs in different shell windows. For more fun, run the examples using different laptops or any device that can run Java code. The following figure shows the project for this chapter in the Community Edition of IntelliJ: This chapter starts with a design pattern that you might use for writing IoT applications. Then the following three major sections show you how to use directory lookups, use User Data Protocol (UDP) instead of TCP/IP, and then multi-cast/broadcast. The material in this chapter is fairly elementary but it seems like many developers don’t have experience with lower level network programming techniques. My hope is that you will understand when you might want to use a lower level protocol like UDP instead of TCP and when you might find directory lookups useful in your applcations. ### Design Pattern The pattern that the examples in the following sections support is simple (in principle): small devices register services that they offer and look for services offered in other devices. Devices will use User Data Protocol (UDP) and multi-cast to communicate for using and providing services. Once a consumer has discovered a service privider (host, IP address, port, and service types) a consumer can also open a direct TCP/IP socket connection to a provider to use a newly discovered service. The following diagram shows several IoT use cases: • A consumer uses the JmDNS library to broadcast on the local network a request for services provided. • The JmDNS library on the provider side hears a service description request and broadcasts available services. • The JmDNS library on the consumer side listens for the broadcast from the provider. • Using a service description, the consumer makes a direct socket connection to the provider for a service. • The provider periodically broadcasts updated data for its service. • The consumer listens for broadcasts of updated data and uses new data as it is available. This figure shows a few use cases. You can use these interaction patterns of interactions between IoT devices using application specific software you write. ### Directory Lookups In the next two sub-sections we will be using the Java JmDNS library that supports a local multi-cast Domain Name Service (DNS) and supports service discovery. JmDNS is compatible with Apple’s Bonjour discovery service that is also available for installation on Microsoft Windows. JmDNS has been successfully used in Android apps if you change the defaults on your phone to accept multi-cast (since Android version 4.1 Network Service Discovery (NDS)). Android development is outside the scope of this chapter and I refer you to the NDS documentation and example programs if you are interested. I tried the example Android NsdChat application on my Galaxy S III (Android 4.4.2) and Note 4 (Android 5.0.1) and this example app would be a good place for you to start if you are interested in using Android apps to experiment with IoT development. #### Create and Register a Service The example program in this section registers a service using the JmsDNS Java library that listens on port 1234. In the next section we look at another example program that connects to this service. The service is created in lines 10 and 11 defining the service name (notice that a period is added to the end of the service name FOO123), the port that the server listens on (1234 in this case), and the last argument is any string data that you also want to pass to clients looking up information on service FOO123. When you run the client in the next section you will see that the client receives this information for the local DNS lookup. Running this example produces the following output (edited to fit the page width): #### Lookup a Service by Name If you have the service created by the example program in the last section running then the following example program will lookup the service. We use the JmsDNS library in lines 12 and 14 to find a service by name. In line 14 the first argument “_http._tcp.local.” indicates that we are searching the local network (for example, your home network created by a router from your local ISP). Running this example produces the following output (edited to fit the page width): Do you want to find all local services? Then you can also create a listener that will notify you of all local HTTP services. There is an example program in the JmsDNS distribution on the path src/sample/java/samples/DiscoverServices.java written by Werner Randelshofer. I changed the package name to com.markwatson.iot and added it to the git repository for this book. You can try it using the following commands: While it is often useful to find all local services, I expect that for your own projects you will be writing both service and client applications that cooperate on your local network. In this case you know the service names and can simply look them up as in the example in this section. ### User Data Protocol Network Programming You are most likely familiar with using TPC/IP in network programming. TCP/IP provides some guarantees that the sender is notified on failure to deliver data to a receiving program. User Data Protocol (UDP) makes no such guarantees for notification of message delivery and is very much more efficient than TCP/IP. On local area networks data delivery is more robust to failure compared to the Internet and since we are programming for devices on a local network it makes some sense to use a lower level and more efficient protocol. #### Example UDP Server The example program in this section implements a simple service that listens on port 9005 for UDP packets that are client requests. The UDP packets received from clients are assumed to be text and this example slightly modifies the text and returns it to the client. In line 11 we create a new DatagramSocket listener on port 9005. On lines 13 and 14 we create a new empty DatagramPacket that is used in line 16 to hold data from a client. The example code prints out packet data, changes the text from the client, and returns the modified text to the client. Running this example produces the following output (edited to fit the page width): Note that you must also run the example program from the next section to get the output in lines 4 and 5. This is why I listed the commands at the beginning of the chapter for running all of the examples: I thought that if you ran all of the examples according to these earlier directions, the example programs would be easier to understand. #### Example UDP Client The example program in this section implements a simple client for the UDP server in the last section. A sample text string is sent to the server and the response is printed. Running this example produces the following output (edited to fit the page width): I have used UDP network programming in many projects and I will mention two of them here. In the 1980s I was an architect and developer on the NMRD project that used 38 seismic research stations around the world to collect data that we used to determine if any country was conducting nuclear explosive tests. We used a guaranteed messaging system for meta control data but used low level UDP for much of the raw data transfer. The other application was communcations library for networked racecar and hovercraft games that ran on a local network. UDP is very good for local network game development because it is efficient and the chance of losing data packets is very low and even if packets are lost game play is usually not affected. Remember that when using UDP you are not guaranteed delivery of data and not guaranteed notification of failure! ### Multicast/Broadcast Network Programming The examples in the next two sections are similar to the examples in the last two sections with one main difference: a server periodically broadcasts data using UDP. The server does not care if any client is listening or not, it simply broadcasts data. If any clients are listening they will probably receive the data. I say that they will probably receive the data because UDP provides no delivery guarantees. There are specific INET addresses that can be used for local multicast in the range of addresses between 224.0.0.0 and 224.0.0.255. We will use 224.0.0.5 in the two example programs. These are local addresses that are invisible to the the Internet. #### Multicast Broadcast Server The following example is very similar to that seen two sections ago (Java class UDPServer) except we do not wait for a client request. In line 13 we create an INET address for “224.0.0.5” and in line 14 create a DatagramSocket that is reused for broadcasting 10 messages with a time delay of 1/2 a second between messages. The broadcast messages are defined in line 16 and the code in lines 18 to 21 creates a new DatagramPacket with the message text data and gets broadcast to the local network in line 21. Running this example produces the following output (output edited for brevity): Unlike the UDPServer example two sections ago, you will get this output regardless of whether you are running the listing client in the next section. #### Multicast Broadcast Listener The example in this section listens for broadcasts by the server example in the last section. The broadcast server and clients need to agree on the INET address and a port number (9002 in these examples). The client creates an InetAddress in line 15 and allocates storage for the received text messages in line 16. We create a MulticastSocket on port 9002 in line 17 and in line 19 join the broadcast server’s group. In lines 21 through 27 we loop forever listening for broadcast messages and printing them out. Running this example produces the following output (output edited for brevity): ### Wrap Up on IoT Even though I personally have some concerns about security and privacy for IoT applications and products, I am enthusiastic about the possibilities of making our lives simpler and more productive with small networked devices. When we work and play we want as much as possible to not think about our computing devices so that we can concentrate on our activities. I have barely scratched the surface on useful IoT technologies in this chapter but I hope that I have shown you network programming techniques that you can use in your projects (IoT and also networked game programming). As I write this book there are companies supplying IoT hardware development kits and deciding on one kit and software framework and creating your own projects is a good way to proceed. ## Natural Language Processing Using OpenNLP I have worked in the field of Natural Language Processing (NLP) since the early 1980s. Many more people are interested in the field of NLP and the techniques have changed drastically. NLP has usually been considered part of the field of artificial intelligence (AI) and in the 1980s and 1990s there was more interest in symbolic AI, that is the manipulation of high level symbols that are meaningful to people because of our knowledge of the real world. As an example, the symbol car means something to us but to AI software this symbol in itself has no meaning except for possible semantic relationships to other symbols like driver and road. There is still much work in NLP that deals with words as symbols: syntactic and semantic parsers being two examples. What has changed is a strong reliance on statistical and machine learning techniques. In this chapter we will use the open source (Apache 2 license) OpenNLP project that uses machine learning to create models of language. Currently, OpenNLP has support for Danish, German, English, Spanish, Portuguese, and Swedish. I include in the github repository some trained models for English that are used in the examples in this chapter. You can download models for other languages at the web page for OpenNLP 1.5 models (we are using version 1..6.0 of OpenNLP in this book which uses the version 1.5 models). I use OpenNLP for some of the NLP projects that I do for my consulting customers. I have also written my own NLP toolkit KBSPortal.com that is a commercial product. For customers who can use the GPL license I sometimes use the Stanford NLP libraries that, like OpenNLP, are written in Java. The following figure shows the project for this chapter in the Community Edition of IntelliJ: We will use pre-trained models for tokenizing text, recognizing the names of organizations, people, locations, and parts of speech for words in input text. We will also train a new model (the file opennlp/models/en-newscat.bin in the github repository) for recognizing the category of input text. The section on training new maximum entropy classification models using your own training data is probably the material in this chapter that you will use the most in your own projects. We will train one model to recognize the categories of COMPUTERS, ECONOMY, HEALTH, and POLITICS. You should then have the knowledge for training your own models using your own training texts for the categories that you need for your applications. We will also use both some pre-trained models that are included with the OpenNLP distribution and the classification mode that we will soon create in the next chapter when we learn how to perform scalable machine learning algorithms using the Apache Spark platform where we look at techniques for processing very large collections of text to discover information. After building a classification model we finish up this chapter with an interesting topic: statistically parsing sentences to discover the most probable linguistic structure of each sentence in input text. We will not use parsing in the rest of this book so you may skip the last section of this chapter if you are not currently interested in parsing sentences into linguistic components like noun and verp phrases, proper nouns, nouns, adjectives, etc. ### Using OpenNLP Pre-Trained Models Assuming that you have cloned the github repository for this book, you can fetch the maven dependencies, compile the code, and run the unit tests using the command: The model files, including the categorization model you will learn to build later in this chapter, are found in the subdirectory models. The unit tests in src/test/java/com/markwatson/opennlp/NLPTest.java provide examples for using the code we develop in this chapter. The Java example code for tokenization (splitting text into individual words), splitting sentences, and recognizing organizations, locations, and people in text is all in the Java class NLP. You can look at the source code in the repository for this book. Here I will just show a few snippets of the code to make clear how to load and use pre-trained models. I use static class initialization to load the model files: The first operation that you will usually start with for processing natural language text is breaking input text into individual words and sentences. Here is the code for using the tokenizing code that separates text stored as a Java String into individual words: Here is the similar code for breaking text into individual sentences: Here is some sample code to use sentenceSplitter: In line 4 the static method NLP.sentenceSplitter returns an array of strings. In line 5 I use a common Java idiom for printing arrays by using the static method Arrays.toString to convert the array of strings into a List<String> object. The trick is that the List class has a toString method that formats list nicely for printing. Here is the output of this code snippet (edited for page width and clarity): The code for finding organizations, locations, and people’s names is almost identical so I will only show the code in the next listing for recognizing locations. Please look at the methods companyNames and personNames in the class com.markwatson.opennlp.NLP to see the implementations for finding the names of companies and people. The public methods in the class com.markwatson.opennlp.NLP are overriden to take either a single string value which gets tokenized inside of the method and also a method that takes as input text that has already been tokenized into a String tokens[] object. In the last example the method starting on line 1 accepts an input string and the overriden method starting on line 5 accepts an array of strings. Often you will want to tokenize text stored in a single input string into tokens and reuse the tokens for calling several of the public methods in com.markwatson.opennlp.NLP that can take input that is already tokenized. In line 2 we simply tokenize the input text and call the method that accepts tokenized input text. In line 6 we create a HashSet<String> object that will hold the return value of a set of location names. The NameFinderME object locationFinder returns an array of Span objects. The SPan class is used to represnt a sequence of adjacent words. The **Span class has a public static attribute length and instance methods getstart and getEnd that return the indices of the beginning and ending (plus one) index of a span in the origianl input text. Here is some sample code to use locationNames along with the output (edited for page width and clarity): Note that the pre-trained model does not recognize when city and state names are associated. ### Training a New Categorization Model for OpenNLP The OpenNLP class DoccatTrainer can process specially formatted input text files and produce categorization models using maximum entropy which is a technique that handles data with many features. Features that are automatically extracted from text and used in a model are things like words in a document and word adjacency. Maximum entropy models can recognize multiple classes. In testing a model on new text data the probablilities of all possible classes add up to the value 1 (this is often refered to as “softmax”). For example we will be training a classifier on four categories and the probablilities of these categories for some test input text add up to the value of one: The format of the input file for training a maximum entropy classifier is simple but has to be correct: each line starts with a category name, followed by sample text for each category which must be all on one line. Please note that I have already trained the model producing the model file models/en-newscat.bin so you don’t need to run the example in this section unless you want to regenerate this model file. The file sample_category_training_text.txt contains four lines, defining four categories. Here are two lines from this file (I edited the following to look better on the printed page, but these are just two lines in the file): Here is one training example each for the categories COMPUTERS and ECONOMY. You must format the training file perfectly. As an example, if you have empty (or blank) lines in your input training file then you will get an error like: The OpenNLP documentation has examples for writing custom Java code to build models but I usually just use the command line tool; for example: The model is written to the relative file path models/en-newscat.bin. The training file I am using is tiny so the model is trained in a few seconds. For serious applications, the more training text the better! By default the DoccatTrainer tool uses the default text feature generator which uses word frequencies in documents but ignores word ordering. As I mention in the next section, I sometimes like to mix word frequency feature generation with 2gram (that is, frequencies of two adjacent words). In this case you cannot simply use the DoccatTrainer command line tool. You need to write a little Java code yourself that you can plug another feature generator into using the alternative API: As I also mention in the next section, the last argument would look like: For most purposes the default word frequency (or bag of words) feature generator is probably OK so using the command line tool is a good place to start. Here is the output from running the DoccatTrainer command line tool: We will use our new trained model file en-newscat.bin in the next section. Please note that in this simple example I used very little data, just a few hundred words for each training category. I have used the OpenNLP maximum entropy library on various projects, mostly to good effect, but I used many thousands of words for each category. The more data, the better. ### Using Our New Trained Classification Model The code that uses the model we trained in the last section is short enough to list in its entirety: In lines 33 through 42 we initialize the static data for an instance of the class DoccatModel that loads the model file created in the last section. A new instance of the class DocumentCategorizerME is created in line 28 each time we want to classify input text. I called the one argument constructor for this class that uses the default feature detector. An alternative constructor is: The default feature generator is BagOfWordsFeatureGenerator which just uses word frequencies for classification. This is reasonable for smaller training sets as we used in the last section but when I have a large amount of training data available I prefer to combine BagOfWordsFeatureGenerator with NGramFeatureGenerator. You would use the constructor call: The following listings show interspersed both example code snippets for for using the NewsClassifier class followed by the output printed by each code snippet: ### Using the OpenNLP Parsing Model We will use the parsing model that is included in the OpenNLP distribution to parse English language input text. You are unlikely to use a statistical parsing model in your work but I think you will enjoy the material in the section. If you are more interested in practical techniques then skip to the next chapter that covers more machine learning techniques. The following example code listing is long (68 lines) but I will explain the interesting parts after the listing: The OpenNLP parsing model is read from a file in lines 45 through 53. The static variable parserModel (instance of class ParserModel) if created in line 49 and used in lines 17 and 18 to parse input text. It is instructive to look at the intermediate calculation results. The value for variable parser defined in line 17 has a value of: Note that the parser returned 5 results because we specified this number in line 18. For a long sentence the parser generates a very large number of possible parses for the sentence and returns, in order of probability of being correct, the number of resutls we requested. The OpenNLP chunking parser code prints out results in a flat list, one result on a line. This is difficult to read which is why I wrote the method prettyPrint (lines 21 through 41) to print the parse results indented. Here is the output from the last code example (the first parse shown is all on one line but line wraps in the following listing): In the 1980s I spent much time on syntax level parsing. While the example in this section is a statistical parsing model I don’t find these models very relevant to my own work but I wanted to include a probalistic parsing example for completeness in this chapter. OpenNLP is a great resource for Java programmers and its Apache 2 license is “business friendly.” If you can use software with a GPL license then please also look at the Stanford NLP libraries. ## Machine Learning Using Apache Spark I have used Java for Machine Learning (ML) for nearly twenty years using my own code and also great open source projects like Wika, Mahout, and Mallet. For the material in this chapter I will use a relatively new open source machine learning library MLlib that is included in the Apache Spark project. Spark is similar to Hadoop in providing functionality for using multiple servers for processing massively large data sets. Spark preferentially deals with data in memory and can provide near real time analysis of large data sets. We saw a specific use case of machine learning in the previous chapter on OpenNLP: building maximum entropy classification models. This chapter will introduce another good ML library. Machine learning is part of the fields of data science and artificial intelligence. There are several steps involved in building applications with ML: • Understand what problems your organization has and what data is available to solve these problems. In other words, figure out where you should you spend your effort. You should also understand early in the process how it is that you will evaluate the usefulness and quality of the results. • Collecting data. This step might involve collecting numeric time series data from instruments; for example, to train regression models or collecting text for natural language processing ML projects as we saw in the last chapter. • Cleaning data. This might involve detecting and removing errors in data from faulty sensors or incomplete collection of text data from the web. In some cases data might be annotated (or labelled) in this step. • Integrating data from different sources and organizing it for experiments. Data that has been collected and cleaned might end up being used for multiple projects so it makes sense to organize it for ease of use. This might involve storing data on disk or cloud storage like S3 with meta data stored in a database. Meta data could include date of collection, data source, steps taken to clean the data, and references to models already build using this data. • The main topic of this chapter: analysing data and creating models with machine learning (ML). This will often be experimental with many different machine learning algorithms used for quick and dirty experiments. These fast experiments should inform you on which approach works best for this data set. Once you decide which ML algorithm to use you can spend the time to refine features in data used by learning algorithms and to tune model parameters. It is usually best, however, to first quickly try a few approaches before settling on one approach and then investing a lot of time in building a model. • Interpreting and using the results of models. Often models will be used to process other data sets to automatically label data (as we do in this chapter with text data from Wikipedia) or making predictions from new sources of time series numerical data. • Deploy training models as embedded parts of a larger system. This is an optional step - sometimes it is sufficient to use a model to understand a problem well enough to know how to solve it. Usually, however, a model will be used repeatadly, often automatically, as part of normal data processing and decision workflows. I will take some care in this chapter to indicate how to reuse models embedded in your Java applications. In the examples in this book I have supplied test data with the example programs. Please keep in mind that although I have listed these steps for data mining and ML model building as a sequence of steps, in practice this is a very iterative process. Data collection and cleaning might be reworked after seeing preliminary results of building models using different algorithms. We might iteratively tune model parameters and also improve models by adding or removing the features in data that are used to build models. What do I mean by “features” of data? For text data features can be: words in text, collections of adjacent words in text (ngrams), structure of text (e.g., taking advantage of HTML or other markup), punctuation, etc. For numeric data features might be minimum and maximum of data points in a sequence, features in frequency space (calculated by Fast Forier Transfroms (FFTs), sometimes as Power Spectral Density (PSD) which is a FFT with phase information in the data discarded), etc. Cleaning data means different things for different types of data. In cleaning text data we might segment text into sentences, perform word stemming (i.e., map words like “banking” or “banks” to their stem of root “bank”), sometimes we might want to convert all text to lower case, etc. For numerical data we might have to scale data attributes to a specific numeric range - we will do this later in this chapter when using the University of Wisconsin cancer data sets. The previous seven steps that I outline are my view of the process of discovering information sources, processing the data, and producing actionable knowledge from data. Other acronyms that you are likely to run across is Knowledge Discovery in Data (KDD) and the CCC Big Data Pipeline which both use steps similar to those that I have just outlined. There are, broadly speaking, four types of ML models: • Regression - predicts a numerical value. Inputs are numeric values. • Classification - predicts yes or no. Inputs are numeric values. • Clustering - partition a set of data items into disjoint sets. Data items might be text documents, numerical data sets from sensor readings, sets of structured data that might include text and numbers. Inputs must be converted to numeric values. • Recommendation - decision to recommend an object based on a user’s previous actions and/or by matching a user to other similar users to use other users’ actions. You will see later when dealing with text that we need to convert words to numbers (each word is assigned a unique numeric ID). A document is represented by a sparse vector that has a non-empty element for each unique word index. Further, we can divide machine learning problems into two general types: • Supervised learning - input features used for training are labeled with the desired prediction. The example of supervised learning we will look at is building a regresion model using cancer symptom data to predict the malignancy vs. benign given a new input feature set. • Unsupervised learning - input features are not labeled. Examples of unsupervised learning that we will look at are: K-Means similar Wikipedia document clustering and using word2vec to find related words in documents. My goal in this chapter is to introduce you to practical techniques using Spark and machine learning algorithms. After you have worked through the material I recommend reading through the Spark documentation to see a wider variety of available models to use. Also consider taking Andrew Ng’s excellent Machine Learning class. Scaling Machine Learning While the ability to scale data mining and ML operations for very large data sets is sometimes important, for development and for working through the examples in this chapter you do not need to set up a cluster of servers to run Spark. We will use Spark in developer mode and you can run the examples on your laptop. It would be best if your laptop has at least 2 gigabytes of RAM to run the ML examples. The important thing to remember is that the techniques that you learn in this chapter can be scaled to very large data sets as needed without modifying your application source code. The Apache Spark project originated at the University of California at Berkeley. Spark has client support for Java, Scala, and Python. The examples in this chapter will all be in Java. If you are a Scala developer, after working through through the examples you might want to take a look at the Scala Spark documentation since Spark is written in Scala and Scala has the best client support. The Python Spark and MLlib APIs are also very nice to use so if you are a Python developer you should check out this Python support. Spark is an evolutionary step forward from using map reduce systems like Hadoop or Google’s Map Reduce. In following through the examples in this chapter you will likely think that you could accomplish the same effect with less code and runtime overhead. For dealing with small amounts of data that is certainly true but, just like using Hadoop, a little more verbosity in the code gives you the ability to scale your applications over large data sets. We put up with extra complexity for scalability. Spark has utility classes for reading and transforming data that make the example programs simpler than writing code yourself for importing data. I suggest that you download the source code for the Spark 2.0.0 distribution from the Apache Spark download page since there are many example programs that you can use for ideas and for reference. The Java client examples in the Spark machine learning library MLlib can be found in the directory: If you download the binary distribution, then the examples are in: Where the directory spark-2.0.0 was created when you downloaded Spark source code distribution version 2.0.0. These examples use provided data files that provide numerical data, each line specifies a local Spark vector data type. The first K-Means clustering example in this chapter is derived from a MLlib example included with the Spark distribution. The second K-Means example that clusters Wikipedia articles is quite a bit different because we need to write new code to convert text into numeric feature vectors. I provide sample text that I manually selected from Wikipedia articles. We will start this chapter by showing you how to set up Spark and MLlib on your laptop and then look at a simple “Hello World” introductory type program for generating word counts from input text. We will then implement logistic regression and also take a brief look at the JavaKMeans example program from the Spark MLlib examples and use it to understand the requirement for reducing data to a set of numeric feature vectors. We will then develop examples using Wikipedia data and this will require writing custom converters for plain text to numeric feature vectors. ### Setting Up Spark On Your Laptop Assuming that you have cloned the github project for this book, the Spark examples are in the sub-directory machine_learning_spark. The maven pom.xml file contains the Spark dependencies. You can install these dependencies using: There are four examples in this chapter. If you want to skip ahead and run the code before reading the rest of the chapter, run each maven test separately: Spark has all of Hadoop as a dependency because Spark is able to interact with Hadoop and Hadoop file systems. It will take a short while to download all of these dependencies but that only needs to be done one time. To augment the material in this chapter you can refer to the Spark Programming Guide. We will also be using material in the Spark Machine Learning Library (MLlib) Guide. The following figure shows the project for this chapter in the Community Edition of IntelliJ: ### Hello Spark - a Word Count Example The word count example is frequently used to introduce map reduce and I will also use it here for a “hello world” type simple Spark example. In the following listing, lines 3 through 6 import the required classes. In lines 3 and 4 “RDD” stands for Resilient Distributed Datasets (RDD). Spark was written in Scala so we will sometimes see Java wrapper classes that are recognizable with the prefix “Java.” Each RDD is a collection of objects of some type that can be distributed across multiple servers in a Spark cluster. RDDs are fault tolerant and there is sufficient information in a Spark cluster to recreate any lost data due to a server or software failure. Lines 14 through 17 define the simple tokenizer method that takes English text and maps it to a list of strings. I broke this out into a separate method as a reminder that punctuation needs to be handled and breaking it in to a separate method makes it easier for you to experiment with custom tokenization that might be needed for your particular text input files. One possibility is using the OpenNLP tokenizer model that we used in the last chapter. For example, we would want the string “tree.” to be tokenized as [“tree”, “.”] and not [“tree.”]. This method is used in line 23 and is called for each line in the input text file. In this example in line 22 we are reading a local text file “data/test1.txt” but in general the argument to JavaSparkContext.textFile() can be a URI specifying a file on a local Hadoop file system (HDFS), an Amazon S3 object, of a Cassandra file system. Pay some attention to the type of the variable lines defined in line 22: the type JavaRDD<String> specifies an RDD collection where the distributed objects are strings. The code in lines 29 through 32 is really not very complicated if we parse it out a bit at a time. The type of counts is a RDD distributed collection where each object in the collection is a pair of string and integer values. The code in lines 25 through 27 takes each token in the original text file and produces a pair (or Tuple2) values consisting of the token string and the integer 1. As an example, an array of three tokens like: would be converted to an array of three pairs like: The statement in line 28 is uses the method reduceByKey which partitions the array of pairs by unique key value (the key being the first element in each pair) and sums the values for each matching key. The rest of this example shows two ways to extract data from an distributed RDD collection back to a Java process as local data. In line 29 we are pulling the collection of word counts into a Java Map<String,Integer> collection and in line 31 we are doing the same thing but pulling data into a list of Scala tuples. Here is what the output of the program looks like: ### Introducing the Spark MLlib Machine Learning Library As I update this Chapter in the spring of 2016, the Spark MLlib library is at version 2.0. I will update the git repository for this book periodically and I will try to keep the code examples compatible with the current stable versions of Spark and MLlib. The examples in the following sections show how to apply the MLlib library to problems of logistic regression, using the K-Means algoirthm to cluster similar documents, and using the word2vec algorithm to find strong associations between words in documents. ### MLlib Logistic Regression Example Using University of Wisconsin Cancer Database We will use supervised learning in this section to build a regresion model that maps a numeric feature vector of symptoms to a numeric value where larger values close to one indicate malignancy and smaller values closer to zero indicate benign. MLlib supports three types of linear models: Support Vector Machine (SVM), logistic regression (models for membership in one or more classes or categories), and linear regression (models that model outputs for a class as a floating point number in the range [0..1]). The Spark MLlib documentation (http://spark.apache.org/docs/latest/mllib-linear-methods.html) on these three linear models is very good and their sample programs are similar so it is easy enough to switch the type of linear model that you use. We will use logistic regression for the example in this section. Andrew Ng, in his excellent Machine Learning class said that the first model he tries when starting a new problem is a linear model. We will repeat this example in the next section using SVM instead of logistic regression. If you read my book Build Intelligent Systems with Javascript the training and test dataset used in this section will look familiar: University of Wisconsin breast cancer data set. From the University of Wisconsin web page, here are the attributes with their allowed ranges in this data set: The file data/university_of_wisconson_data_.txt contains raw data. We will use this sample data set with two changes: we clean it by removing the sample code column and the next 9 values are scaled to the range [0..1] and the class is mapped to 0 (benign) or 1 (malignant). So, our input features are scaled (values between [0 and 1] numerical values indicating clump thickness, uniformity of cell size, etc. The label that specifies what the desired prediction is the class (2 for for benign, 4 for malignant which we scale to 0 or 1)). In the following listing data is read from the training file and scaled in lines 27 through 44. Here I am using some utility methods provided by MLlib. In line 28 I am using the method JavaSparkContext.textFile to read the text file into what is effectively a list of strings (one string per input line). Lines 29 through 44 is creating a list of instances of the class LabeledPoint which contains a training label with associated numeric feature data. The container class JavaRDD acts as a list of elements that can be automatically managed across a Spark server cluster. When in development mode with a local JavaSparkContext everything is running in a single JVM so an instance of class JavaRDD behaves a lot like a simple in-memory list. Once the data is read and converted to numeric features the next thing we do is to randomly split the data in the input data file into two separate sets: one for training and one for testing our model after training. This data splitting is done in lines 47 and 48. Line 49 copies the training data and requests to the current Spark context to try to keep this data cached in memory. In line 50 we copy the testing data but don’t specify that the test data needs to be cached in memory. There are two sections of code left to discuss. In lines 53 through 75 I am training the model, saving the model to disk, and testing the model while it is still in memory. In lines 77 through 89 I am reloading the saved model from disk and showing how to use it with two new numeric test vectors. This last bit of code is meant as a demonstration of embedding a model in Java code and using it. The following three lines of code are the output of this example program. The first line of output indicates that the error for training the model is low. The last two lines show the output from reloading the model from disk and using it to evaluate two new numeric feature vectors: This example shows a common pattern: reading training data from disk, scaling it, and building a model that is saved to disk for later use. This program can almost be reused as-is for new types of training data. You should just have to modify lines 27 through 44. ### MLlib SVM Classification Example Using University of Wisconsin Cancer Database The example in this section is very similar to that in the last section. The only differences are using the MLlib class SvmClassifier instead of the class LogisticRegression that we used in the last section and setting the number of training iterations to 500. We are using the same University of Wisconsin cancer data set in this section - refer to the previous section for a discusion of this data set. Since the code for this section is very similar to the last example I will not list it. Please refer to the source file SvmClassifier.java and the test file SvmClassifier.java in the github repository for this book. In the repository subdirectory machine_learning_spark you will need to remove the model files generated when running the example in the previous section before running this example: The output for this example will look like: Note that the test data mean square error is larger than the value of 0.050140726106582885 that we obtained using logistic regression in the last section. ### MLlib K-Means Example Program We will use unsupervised learning in this section, specifically the K-Means clustering algorithm. This clustering process partitions inputs into groups that share common features. Later in this chapter we will look at an example where our input will be some randomly chosen Wikipedia articles. K-Means analysis is useful for many types of data where you might want to cluster members of some set into different sub-groups. As an example, you might be a biologist studying fish. For each species of fish you might have attributes like length, weight, location, fresh/salt water, etc. You can use K-Means to divide fish specifies into similar sub-groups. If your job deals with marketing to people on social media, you might want to cluster people (based on what attributes you have for users of social media) into sub-groups. If you have sparse data on some users, you might be able to infer missing data on other users. The simple example in this section uses numeric feature vectors. Keep in mind, however our goal of processing Wikipedia text in a later section. In the next section we will see how to convert text to feature vectors, and finally, in the section after that we will cluster Wikipedia articles. Implementing K-Means is simple but using the MLlib implementation has the advantage of being able to process very large data sets. The first step in calculating K-Means is choosing the number of desired clusters NC. The calculate NC cluster centers randomly. We then perform an inner loop until either cluster centers stop changing or a specified number of iteractions is done: • For each data point, assign it to the cluster center closest to it. • for each cluster center, move the center to the average location of the data points in that cluster In practice you will repeat this clustering process many times and use the cluster centers with the minimum distortion. The distortion is the sum of the squares of the distances from each data point in a cluster to the final cluster center. So before we look at the Wikipedia example (clustering Wikipedia articles) later in this chapter, I want to first review the JavaKMeansExample.java example from the Spark MLlib examples directory to see how to perform K-Means clustering on sample data (from the Spark source code distribution): This is a very simple example that already uses numeric data. I copied the JavaKMeansExample.java file (changing the package name) to the github repository for his book. The most important thing in this section that you need to learn is the required types of input data: vectors of double size floating point numbers. Each sample of data needs to be converted to a numeric feature vector. For some applications this is easy. As a hypothetical example, you might have a set of features that measure weather in your town: low temperature, high temperature, average temperature, percentage of sunshine during the day, and wind speed. All of these features are numeric and can be used as a feature vector as-is. You might have 365 samples for a given year and would like to cluster the data to see which days are similar to each other. What if in this hypothetical example one of the features was a string representing the month? One obvious way to convert the data would be to map “January” to 1, “February” to 2, etc. We will see in the next section when we cluster Wikipedia articles that converting data to a numeric feature vector is not always so easy. We will use the tiny sample data set provided with the MLlib KMeans example: Here we have 6 samples and each sample has 3 features. From inspection, it looks like there are two clusters of data: the first three rows are in one cluster and the last three rows are in a second cluster. The order that feature vectors are processed in is not important; the samples in this file are organized so you can easily see the two clusters. When we run this sample program asking for two clusters we get: We will go through the code in the sample program since it is the same general pattern we use throughout the rest of this chapter. Here is a slightly modified version of JavaKMeans.java with a few small changes from the version in the MLLib examples directory: The inner class ParsePoint is used to generate a single numeric feature vector given input data. In this simple case the input data is a string like “9.1 9.1 9.1” that needs to be tokenized and each token converted to a double floating point number. This inner class in the Wikipedia example will be much more complex. In line 24 we are creating a Spark execution context. In all of the examples in this chapter we use a “local” context which means that Spark will run inside the same JVM as our example code. In production the Spark context could access a Spark cluster utilizing multiple servers. Lines 25 and 27 introduce the use of the class JavaRDD which is the Java version of the Scala Ricj Data Definition class used internally to implement Spark. This data class can live inside a single JVM when we use a local Spark context or be spread over many servers when using a Spark server cluster. In line 25 we are reading the input strings and in line 27 we are using the inner class ParsePoint to convert the input text lines to numeric feature vectors. Lines 29 and 30 use the static method KMeans.train to create a trained clustering model. Lines 33 through 35 print out the clusers (listing of the output was shown earlier in this section). This simple example along with the “Hello Spark” example in the last section has shown you how to run Spark on your laptop in developers mode. In the next section we will see how to convert input text into a numeric feature vector. ### Converting Text to Numeric Feature Vectors In this section I develop a utility class TextToSparseVector that converts a set of text documents into sparse numeric feature vectors. Spark provides dense vector classes (like an array) and sparse vector classes that contain index and value pairs. For text feature vectors, In this example we will have a unique feature ID for each unique word stem (or word root) in all of the combined input documents. We start by reading in all of the input text and forming a Map where the keys are word stems and the values are unique IDs. Any given index document will be represented with a psarse vector. For each word, the word’s stem is represented by a unique ID. The vector element at the integer ID value is set. A short document with 100 unique words would have 100 elements of the sparse vector set. Listing of the class TextToSparseVector: The public method String [] bestWords(double [] cluster) is used for display purposes: it will be interesting to see the words in a document which provided the most evidence for its inclusion in a cluster. This Java class will be used in the next section to cluster similar Wikipedia articles. ### Using K-Means to Cluster Wikipedia Articles The Wikipedia article training files contain one article per line with the article title appearing at the beginning of the line. I have already performed some data cleaning: capturing HTML, stripping HTML tags, yielding plain text for the articles and then organizing the data one article per line in the input text file. I am providing two input files: one containing 41 articles and one containing 2001 articles. One challenge we face is converting the text of an article into a numeric feature vector. This was easy to do in an earlier section using cancer data already in numeric form; I just had to discard one data attribute and scale the remaining data atrributes. In addition to converting input data into numeric feature vectors we also need to decide how many clusters we want the K-Means algorithm to partitian our data into. Given the class TextToSparseVector developed in the last section, the code to cluster Wikipedia articles is fairly straightforward. In the following listing of the class WikipediaKMeans The class WikipediaKMeans is fairly simple because most of the work is done converting text to feature vectors using class TextToSparseVector that we saw in the last seciton. The Spark setup code is also similar to that seen in our tiny HelloSpark example. The parsepoint method defined in lines 33 to 40 uses the method tokensToSparseVector (defined in class TextToSparseVector) to convert the text in one line of the input file (remember that each line contains text for an entire Wikipedia article) to a sparse feature vector. There are two important parameters that you will want to experiment with. These parameters are defined on lines 44 and 45 and set the desired number of clusters and the number of iterations you want to run to form clusters. I suggest leaving iterations set initially to 100 as it is in the code and then vary number_of_clusters. If you are using this code in one of your own projects with a different data set then you might also want to experiment with running more iterations. If you cluster a very large number of text documents then you may have to increase the constant value MAX_WORDS set in line 9 in the class TextToSparseVector. In line 48 we are creating a new Spark context that is local and will run in the same JVM as your application code. Lines 49 and 50 read in the text data from Wikipedia and create a JavaRDD of vectors pass to the K-Means clustering code that is called in lines 51 to 53. Here are a few cluster indices for the 2001 input Wikipedia articles that are printed out by cluster index. When you run the sample program this list of documents assigned to eight different clusters is a lot of output. In the following I am listing just a small bit of the output for the first four clusters. If you look carefully at the generated clusters you will notice that Cluster seems not to be so useful since there are really two or three topics represented in the cluster. This might indicate re-running the clustering with a larger number of requested clusters (i.e., increasing the value of number_of_clusters). On the other hand, Cluster 1 seems very good, almost all articles are dealing with sports. This example program can be reused in your applications with few changes. You will probably want to manually assign meaningful labels to each cluster index and store the label with each document. For example, the cluster at index 1 would probably be labeled as “sports.” ### Using SVM for Text Classification Support Vector Machines (SVM) are a popular set of algorithms for learning models for classifying data sets. The SPARK machine learning library provides APIs for SVM models for text classification: This code calculates prediction values less than zero for the EDUCATION class and greater than zero for the HEALTH class: ### Using word2vec To Find Similar Words In Documents Google released their open source word2vec library as Apache 2 licensed open source. The Deeplearning4j project and Spark’s MLlib both contain implementations. As I write this chapter the Deeplearning4j version has more flexibility but we will use the MLlib version here. We cover Deeplearning4j in a later chapter. We will use the sample data file from the Deeplearning4j project in this section. The word2vec library can identify which other words in text are strongly related to any other word in the text. If you run word2vec on a large sample of English language text you will find the the word woman is closely associated with the word man, the word child to the word mother, the word road to the word car, etc. The following listing shows the class Word2VecRelatedWords that is fairly simple because all we need to do is tokenize text, discard noise (or stop) words, and convert the input text to the correct Spark data types for processing by the MLlib class *Word2VecModel**. The output from this example program is: While word2vec results are just the result of statistical processing and the code does not understand the semantic of the English language, the results are still impressive. ### Chapter Wrap Up In the previous chapter we used machine learning for natural language processing using the OpenNLP project. In the examples in this chapter we used the Spark MLlib library for for logistic regression, document clustering, and determining which words are closely associated with each other. You can go far using OpenNLP and Spark MLlib in your projects. There are a few things to keep in mind, however. One of the most difficult aspects of machine learning is finding and preparing training data. This process will be very dependant on what kind of project you are working on and what sources of data you have. One of the most powerful machine learning techniques is artificial neural networks. I have a long history of using neural networks (I was on a DARPA neural network advisory panel and wrote the first version of the commercial ANSim neural network product in the late 1980s). I decided not to cover simple neural network models in this book because I have already written about neural networks in one chapter of my book Practical Artificial Intelligence Programming in Java. “Deep learning” neural networks are also very effective (if difficult to use) for some types of machine learning problems (e.g., image recognition and speech recognition) and we will use the Deeplearning4j project in a later chapter. In the next chapter we cover another type of machine learning: anomaly detection. ## Anomaly Detection Machine Learning Example Anomaly detection models are used in one very specific class of use cases: when you have many negative (non-anomaly) examples and relatively few positive (anomaly) examples. For training we will ignore positive examples, create a model of “how things should be”, and hopefully be able to detect anomalies different from the original negative examples. If you have a large training set of both negative and positive examples then do not use anomaly detection models. ### Motivation for Anomaly Detection There are two other examples in this book using the University of Wisconsin cancer data. These other examples are supervised learning. Anomaly detection as we do it in this chapter is, more or less, unsupervised learning. When should we use anomaly detection? You should use supervised learning algorithms like neural networks and logistic classification when there are roughly equal number of available negative and positive examples in the training data. The University of Wisconsin cancer data set is fairly evenly split between negative and positive examples. Anomaly detection should be used when you have many negative (“normal”) examples and relatively few positive (“anomaly”) examples. For the example in this chapter we will simulate scarcity of positive (“anomaly”) results by preparing the data using the Wisconsin cancer data as follows: • We will split the data into training (60%), cross validation (20%) and testing (20%). • For the training data, we will discard all but two positive (“anomaly”) examples. We do this to simulate the real world test case where some positive examples are likely to end up in the training data in spite of the fact that we would prefer the training data to only contain negative (“normal”) examples. • We will use the cross validation data to find a good value for the epsilon meta parameter. • After we find a good epsilon value, we will calculate the F1 measurement for the model. ### Math Primer for Anomaly Detection We are trying to model “normal” behavior and we do this by taking each feature and fitting a Gaussian (bell curve) distribution to each feature. The learned parameters for a Gaussian distribution are the mean of the data (where the bell shaped curve is centered) and the variance. You might be more familiar with the term standard deviation, $$\sigma$$. Variance is defined as $$\sigma ^2$$. We will need to calculate the probability of a value x given the mean and variance of a probability distribution: $$P(x : \mu, \sigma ^2)$$ where $$\mu$$ is the mean and $$\sigma ^2$$ is the squared variance: $$P(x : \mu, \sigma ^2) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - \left( {x - \mu } \right)^2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}$$ where $$x_i$$ are the samples and we can calculate the squared variance as: $$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{m}(x_i - \mu)^2} {m}$$ We calculate the parameters of $$\mu$$ and $$\sigma ^2$$ for each feature. A bell shaped distribution in two dimensions is easy to visualize as is an inverted bowl shape in three dimentions. What if we have many features? Well, the math works and don’t worry about not being able to picture it in your mind. ### AnomalyDetection Utility Class The class AnomalyDetection developed in this section is fairly general purpose. It processes a set of training examples and for each feature calculates $$\mu$$ and $$\sigma ^2$$. We are also training for a third parameter: an epsilon “cutoff” value: if for a given input vector if $$P(x : \mu, \sigma ^2)$$ evaluates to a value greater than epsilon then the input vector is “normal”, less than epsilon implies that the input vector is an “anomaly.” The math for calulating these three features from training data is fairly easy but the code is not: we need to organize the training data and search for a value of epsilon that minimizes the error for a cross validaton data set. To be clear: we separate the input examples into three separate sets of training, cross validation, and testing data. We use the training data to set the model parameters, use the cross validation data to learn an epsilon value, and finally use the testing data to get precision, recall, and F1 scores that indicate how well the model detects anomalies in data not used for training and cross validation. I present the example program as one long listing, with more code explanation after the listing. Please note the long loop over each input training example starting at line 28 and ending on line 74. The code in lines 25 through 44 processes the input training data sample into three disjoint sets of training, cross validation, and testing data. Then the code in lines 45 through 63 copies these three sets of data to Java arrays. The code in lines 65 through 73 calculates, for a training example, the value of $$\mu$$ (the varible mu in the code). Please note in the code example that I prepend class variables used in methods with “this.” even when it is not required. I do this for legibility and is a personal style. Once the training data and the values of $$\mu$$ (the varible mu in the code) are defined for each feature we can define the method train in lines 86 through 104 that calculated the best epsilon “cutoff” value for the training data set using the method train_helper defined in lines 138 through 165. We use the “best” epsilon value by testing with the separate cross validation data set; we do this by calling the method test that is defined in lines 167 through 198. ### Example Using the University of Wisconsin Cancer Data The example in this section loads the University of Wisconsin data and uses the class AnomalyDetection developed in the last section to find anomalies, which for this example will be input vectors that represented malignancy in the original data. Data used by an anomaly detecton model should have (roughly) a Gaussian (bell curve shape) distribution. What form does the cancer data have? Unfortunately, each of the data features seems to either have a greater density at the lower range of feature values or large density at the extremes of the data feature ranges. This will cause our model to not perform as well as we would like. Here are the inputs displayed as five-bin histograms: I won’t do it in this example, but the feature “Bare Nuclei” should be removed because it is not even close to being a bell-shaped distribution. Another thing that you can do (recommended by Andrew Ng in his Coursera Machine Learning class) is to take the log of data and otherwise transform it to something that looks more like a Gaussian distribution. In the class WisconsinAnomalyDetection, you could for example, transform the data using something like: The constant 1.2 in line 4 is a tuning parameter that I got by trial and error by iterating on adjusting the factor and looking at the data histograms. In a real application you would drop features that you can not transform to something like a Gaussian distribution. Here are the results of running the code as it is in the github repository for this book: How do we evaluate these results? The precision value of 1.0 means that there were no false positives. False positives are predictions of a true result when it should have been false. The value 0.578 for recall means that of all the samples that should have been classified as positive, we only predicted about 57.8% of them. The F1 score is calculated as two times the product of precision and recall, divided by the sum of precision plus recall. ## Deep Learning Using Deeplearning4j The Deeplearning4j.org Java library supports several neural network algorithms including support for Deep Learning (DL). We will look at an example of DL implementing Deep-Belief networks using the same University of Wisconsin cancer database that we used in the chapters on machine learning with Spark and on anomaly detection. Deep learning refers to neural networks with many layers, possibly with weights connecting neurons in non-adjacent layers which makes it possible to model temporal and spacial patterns in data. I am going to assume that you have some knowledge of simple backpropagation neural networks. If you are unfamiliar with neural networks you might want to pause and do a web search for “neural networks backpropagation tutorial” or read the neural network chapter in my book Practical Artificial Intelligence Programming With Java, 4th edition. The difficulty in training many layer networks with backpropagation is that the delta errors in the layers near the input layer (and far from the output layer) get very small and training can take a very long time even on modern processors and GPUs. Geoffrey Hinton and his colleagues created a new technique for pretraining weights. In 2012 I took a Coursera course taught by Hinton and some colleagues titled ‘Neural Networks for Machine Learning’ and the material may still be available online when you read this book. For Java developers, Deeplearning4j is a great starting point for experimenting with deep learning. If you also use Python then there are good tutorials and other learning assets at deeplearning.net. ### Deep Belief Networks Deep Belief Networks (DBN) are a type of deep neural network containing multiple hidden neuron layers where there are no connections between neurons inside any specific hidden layer. Each hidden layer learns features based on training data an the values of weights from the previous hidden layer. By previous layer I refer to the connected layer that is closer to the input layer. A DBN can learn more abstract features, with more abstraction in the hidden layers “further” from the input layer. DBNs are first trained a layer at a time. Initially a set of training inputs is used to train the weights between the input and the first hidden layer of neurons. Technically, as we preliminarily train each succeeding pair of layers we are training a restricted Boltzmann machine (RBM) to learn a new set of features. It is enough for you to know at this point that RBMs are two layers, input and output, that are completely connected (all neurons in the first layer have a weighted connection to each neuron in the next layer) and there are no inner-layer connections. As we progressively train a DBN, the output layer for one RBM becomes the input layer for the next neuron layer pair for preliminary training. Once the hidden layers are all preliminarily trained then backpropagation learning is used to retrain the entire network but now delta errors are calculated by comparing the forward pass outputs with the training outputs and back-propagating errors to update weights in layers proceeding back towards the first hidden layer. The important thing to understand is that backpropagation tends to not work well for networks with many hidden layers because the back propagated errors get smaller as we process backwards towards the input neurons - this would cause network training to be very slow. By precomputing weights using RBM pairs, we are closer to a set of weights to minimize errors over the training set (and the test set). ### Deep Belief Example The following screen show shows an IntelliJ project (you can use the free community orprofessional version for the examples in this book) for the example in this chapter: The Deeplearning4j library uses user-written Java classes to import training and testing data into a form that the Deeplearning4j library can use. The following listing shows the implementation of the class WisconsinDataSetIterator that iterates through the University of Wisconsin cancer data set: The WisconsinDataSetIterator constructor calls its super class with an instance of the class WisconsinDataFetcher (defined in the next listing) to read the Comma Separated Values (CSV) spreadsheet data from the file data/cleaned_wisconsin_cancer_data.csv: In line 14, the last argument 9 defines which column in the input CSV file contains the label data. This value is zero indexed so if you look at the input file data/cleaned_wisconsin_cancer_date.csv this will be the last column. Values of 4 in the last column idicate malignant and values of 2 indicate not malignant. The following listing shows the definition of the class DeepBeliefNetworkWisconsinData that reads the University of Wisconsin cancer data set using the code in the last two listings, randmly selects part of it to use for training and for testing, creates a DBN, and tests it. The value of the variable numHidden set in line 51 refers to the number of neurons in each hidden layer. Setting numberOfLayers to 3 in line 52 indicates that we will just use a single hidden layer since this value (3) also counts the input and output layers. The network is configured and constructed in lines 83 through 104. If we increased the number of hidden units (something that you might do for more complex problems) then you would repeat lines 88 through 97 to add a new hidden layer, and you would change the layer indices (first argument) as appropriate in calls to the chained method .layer(). In line 71 we set fractionOfDataForTraining to 0.7 which means that we will use 70% of the available data for training and 30% for testing. It is very important to not use training data for testing because performance on recognizing training data should always be good assuming that you have enough memory capacity in a network (i.e., enough hidden units and enough neurons in each hidden layer). In lines 78 through 81 we divide our data into training and testing disjoint sets. In line 86 we are setting three meta learning parameters: learning rate for the first set of weights between the input and hidden layer to be 1e-1, setting the model to use regularization, and setting the learning rate for the hidden to output weights to be 2e-4. In line 95 we are setting the dropout factor, here saying that we will randomly not use 25% of the neurons for any given training example. Along with regularization, using dropout helps prevent overfitting. Overfitting occurs when a neural netowrk fails to generalize and learns noise in training data. The goal is to learn important features that affect the utility of a trained model for processing new data. We don’t want to learn random noise in the data as important features. Another way to prevent overfitting is to use the smallest possible number of neurons in hidden labels and still perform well on the independent test data. After training the model in lines 105 through 107, we test the model (lines 110 through 120) on the separate test data. The program output when I ran the model is: The F1 score is calculated as twice precision times recall, all divided by precision + recall. We would like F1 to be as close to 1.0 as possible and it is common to spend a fair amount of time experimenting with meta learning paramters to increase F1. It is also fairly common to try to learn good values of meta learning paramters also. We won’t do this here but the process involves splitting the data into three disjoint sets: training, validation, and testing. The meta parameters are varied, training is performed, and using the validation data the best set of meta parameters is selected. Finally, we test the netowrk as defined my meta parameters and learned weights for those meta parameters with the separate test data to see what the effective F1 score is. ### Deep Learning Wrapup I first used complex neural network topologies in the late 1980s for phoneme (speech) recognition, specifically using time delay neural networks and I gave a talk about it at IEEE First Annual International Conference on Neural Networks San Diego, California June 21-24, 1987. Back then, neural networks were not really considered to be a great technology for this application but in the present time Google, Microsoft, and other companies are using deep (many layered) neural networks for speech and image recognition. Exciting work is also being done in the field of natural language processing. I just provided a small example in this chapter that you can experiment with easily. I wanted to introduce you to Deeplearning4j because I think it is probably the easiest way for Java developers to get started working with many layered neural networks and I refer you to the project documentation. ## Web Scraping Examples Except for the first chapter on network programming techniques, this chapter, and the final chapter on what I call Knowledge Management-Lite, this book is primarily about machine learning in one form or another. As a practical matter, much of the data that many people use for machine learning either comes from the web or from internal data sources. This short chapter provides some guidance and examples for getting text data from the web. In my work I usually use the Ruby scripting language for web scraping and information gathering (as I wrote about in my APress book Scripting Intelligence Web 3.0 Information Gathering and Processing) but there is also good support for using Java for web scraping and since this is a book on modern Java development, we will use Java in this chapter. Before we start a technical discusion about “web scraping” I want to point out to you that much of the information on the web is copyright and the first thing that you should do is to read the terms of service for web sites to insure that your use of “scraped” or “spidered” data conforms with the wishes of the persons or organizations who own the content and pay to run scraped web sites. ### Motivation for Web Scraping As we will see in the next chapter on linked data there is a huge amount of structured data available on the web via web services, semantic web/linked data markup, and APIs. That said, you will frequently find data that is useful to pull raw text from web sites but this text is usually fairly unstructured and in a messy (and frequently changing) format as web pages meant for human consumption and not meant to be ingested by software agents. In this chapter we will cover useful “web scraping” techniques. You will see that there is often a fair amount of work in dealing with different web design styles and layouts. To make things even more inconvenient you might find that your software information gathering agents will often break because of changes in web sites. I tend to use one of three general techniques for scraping web sites. Only the first two will be covered in this chapter: • Use an HTML parsing library that strips all HTML markup and Javascript from a page and returns a “pure text” block of text. Here the text in navigation menus, headers, etc. will be interspersed with what we might usually think of a “content” from a web site. • Exploit HTML DOM (Document Omject MOdel) formatting information on web sites to pick out headers, page titles, navigation menues, and large blocks of content text. • Use a tool like (Selenium)[http://docs.seleniumhq.org/] to programatically control a web browser so your software agents can login to site and otherwise perform navigation. In other words your software agents can simulate a human using a web browser. I seldom need to use tools like Selenium but as the saying goes “when you need them, you need them.” For simple sites I favor extracting all text as a single block and use DOM processing as needed. I am not going to cover the use of Selenium and the Java Selenium Web-Driver APIs in this chapter because, as I mentioned, I tend to not use it frequently and I think that you are unlikely to need to do so either. I refer you to the Selenium documentation if the first two approaches in the last list do not work for your application. Selenium is primarily intended for building automating testing of complex web applications, so my occasional use in web spidering is not the common use case. I assume that you have some experience with HTML and DOM. For reference, the following figure shows a small part of the DOM for a page on one of my web sites: This screen shot shows the Chrome web browser developer tools, specifically viewing the page’s DOM. Since a DOM is a tree data structure it is useful to be able to collapse or to expand sub-trees in the DOM. In this figure, the HTML BODY element contains two top level DIV elements. The first DIV that contains the navigation menu for my site is collapes. The second DIV contains an H2 heading and various nested DIV and P (paragraph) elements. I show this fragment of my web pagenot as an example of clean HTML coding but rather as an example of how messy and nested web page elements can be. ### Using the jsoup Library We will use the MIT licensed library jsoup. One reason I selected jsoup for the examples in this chapter out of many fine libraries that provide similar functionality is the particularly nice documentation, especially The jsoup Cookbook which I urge you to bookmark as a general reference. In this chapter I will concentrate on just the most frequent web scraping use cases that I use in my own work. The following bit of code uses jsoup to get the text inside all P (paragraph) elements that are direct children of any DIV element. On line 14 we use the jsoup library to fetch my home web page: In line 15 I am selecting the pattern that returns all P elements that are direct children of any DIV element and in lines 16 to 18 print the text inside these P elements. For training data for machine learning it is useful to just grab all text on a web page and assume that common phrases dealing with web navigaion, etc. will be dropped from learned models because they occur in many different training examples for different classifications. The following code snippet shows how to fetch the plain text from an entire web page: The 2gram (i.e., two words in sequence) “Toggle navigation” in the last listing has nothing to do with the real content in my site and is an artifact of using the Bootstrap CSS and Javascript tools. Often “noise” like this is simply ignored by machine learning models if it appears on many different sites but beware that this might be a problem and you might need to precisiely fetch text from specific DOM elements. Similarly, notice that this last listing picks up the plain text from the navigation menus. The following code snippet finds HTML anchor elements and prints the data associated with these elements: Notice that there are diffent types of URIs like #, relative, and absolute. Any characters following a # character do not affect the routing of which web page is shown (or which API is called) but the characters after the # character are available for use in specifying anchor positions on a web page or extra parameters for API calls. Relative APIs like /consulting/ (as seen in line 5) are understood to be relative to the base URI of the web site. I often require that URIs be absolute URIs (i.e., starts with a protocol like “http:” or “https:”) and the following code snippet selects just absolute URI anchors: In line 3 I am specifying the attribute as “abs:href” to be more selective: ### Wrap Up I just showed you quick reference to the most common use cases for my work. What I didn’t show you was an example for organizing spidered information for reuse. I sometimes collect training data for machine learning by using web searches with query keywords tailored to find information in specific categories. I am not covering automating web search in this book but I would like to refer you to an open source wrapper that I wrote for Microsoft’s Bing Search APIs on github. As an example, just to give you an idea for experimentation, if you wanted to train a model to categorize text containing car descriptions into two classes: “US domestic cars” “foreign made cars” thsn you might use search queries like “cars Ford Chevy” and “cars Volvo Volkswagen Peugeot” to get example text for these two categories. If you use the Bing Search APIs for collecting training data, then you can for the top ranked search results use the techniques covered in this chapter to retreive the text from the original web pages. Then use one or more of the machine learning techniques covered in this book to build classification models. This is a good technique and some people might even consider it a super power. For machine learning, I sometimes collect text in file names that indicate the classification of the text in each file. I often also collect data in a NOSQL datastore like MongoDB or CouchDB, or use a relational database like Postgres to store training data for future reuse. ## Linked Data This chapter introduces you to techniques for information architecture and development using linked data and semantic web technologies. This chapter is paired with the next chapter on knowledge management so please read these two chapters in order. Note: I published “Practical Semantic Web and Linked Data Applications: Java, JRuby, Scala, and Clojure Edition” and you can get a free PDF of this book here. While the technical details of that book are still relevant, I am now more opinionated about which linked data and semantic web technologies are most useful so this chapter is more focused. With some improvements, I use the Java client code for DBPedia search and SPARQL queries from my previous book. I also include new versions that return query results as JSON. I am going to assume that you have fetched a free copy of my book “Practical Semantic Web and Linked Data Applications” and use chapters 6 and 7 for an introduction to the Resource Definition Language (RDF and RDFS) and chapter 8 for a tutorial on the SPARQL query language. It is okay to read this background material later if you want to dive right into the examples in this chapter. I do provide a quick introduction to SPARQL and RDF i the next section, but please do also get a free copy of my semantic web book for a longer treatment. We use the standard query language SPARQL to retrieve information from RDF data sources in much the same way that we use SQL to retrieve information from relational databases. In this chapter we will look at consuming linked data using SPARQL queries against public RDF data stores and searching DBPedia. Using the code from the earlier chapter “Natural Language Processing Using OpenNLP” we will also develop an example application that annotates text by tagging entities with unique DBPedia URIs for the entities (we will use this example also in the next chapter in a code example for knowledge Management). Why use linked data? Before we jump into technical details I want to give you some motivation for why you might use linked data and semantic web technologies to organize and access information. Suppose that your company sells products to customers and your task is to augment the relational databases used for ordering, shipping, and customer data with a new system that needs to capture relationships between products, customers, and information sources on the web that might help both your customers and customer support staff increase the utility of your products to your customers. Customers, products, and information sources are entities that will sometimes be linked with properties or relationships with other entities. This new data will be unstructured and you do not know ahead of time all of the relationships and what new types of entities that the new system will use. If you come from a relational database background using an sort of graph database might be uncomfortable at first. As motivation consider that Google uses its Knowledge Graph (I customized the Knowledge Graph when I worked at Google) for internal applications, to add value to search results, and to power Google Now. Facebook uses its Open Graph to store information about uses, user posts, and relationships between users and posted material. These graph data stores are key assets for Google and Facebook. Just because you don’t deal with huge amounts of data does not mean that the flexibility of graph data is not useful for your projects! We will use the DBPedia graph database in this chapter for the example programs. DBPedia (which Google used as core knowledge when developing their Knowledge Graph) is a rich information source for typed data for products, people, companies, places, etc. Connecting internal data in your organization to external data sources like DBPedia increases the utility of yor private data. You can make this data sharing relationship symmetrical by selectively publishing some of your internal data externally as linked data for the benefit of your customers and business partners. The key idea is to design the shape of your information (what entities do you deal with and what types of properties or relationships exist between different types of entities), ingest data into a graph database with a plan for keeping everything up to date, and then provide APIs or a query language so application programmers can access this linked data when developing new applications. In this chapter we will use semantic web standards with the SPARQL query language to access data but these ideas should apply to using other graph databases like Neo4J or non-relational data stores like MongoDB or CouchDB. ### Example Code The examples for this chapter are set up to run either from the command line or using IntelliJ. There are three examples in this chapter. If you want to try running the code before reading the rest of the chapter, run each maven test separately: The following figure shows the project for this chapter in the Community Edition of IntelliJ: ### Overview of RDF and SPARQL The Resource Description Framework (RDF) is used to store information as subject/predicate/object triple values. RDF data was originally encoded as XML and intended for automated processing. In this chapter we will use two simple to read formats called “N-Triples” and “N3.” RDF data consists of a set of triple values: • subject • predicate • object As an example, suppose our application involved news stories and we wanted a way to store and query meta data. We might do this by extracting semantic information from the text, and storing it in RDF. I will use this application domain for the examples in this chapter. We might use triples like: • subject: a URL (or URI) of a news article • predicate: a relation like “containsPerson” • object: a value like “Bill Clinton” As previously mentioned, we will use either URIs or string literals as values for subjects and objects. We will always use URIs for the values of predicates. In any case URIs are usually preferred to string literals because they are unique. We will see an example of this preferred use but first we need to learn the N-Triple and N3 RDF formats. Any part of a triple (subject, predicate, or object) is either a URI or a string literal. URIs encode namespaces. For example, the containsPerson predicate in the last example could properly be written as: The first part of this URI is considered to be the namespace for (what we will use as a predicate) “containsPerson.” When different RDF triples use this same predicate, this is some assurance to us that all users of this predicate subscribe to the same meaning. Furthermore, we will see in Section on RDFS that we can use RDFS to state equivalency between this predicate (in the namespace http://knowledgebooks.com/ontology/) with predicates represented by different URIs used in other data sources. In an “artificial intelligence” sense, software that we write does not understand a predicate like “containsPerson” in the way that a human reader can by combining understood common meanings for the words “contains” and “person” but for many interesting and useful types of applications that is fine as long as the predicate is used consistently. We will see shortly that we can define abbreviation prefixes for namespaces which makes RDF and RDFS files shorter and easier to read. A statement in N-Triple format consists of three URIs (or string literals – any combination) followed by a period to end the statement. While statements are often written one per line in a source file they can be broken across lines; it is the ending period which marks the end of a statement. The standard file extension for N-Triple format files is *.nt and the standard format for N3 format files is *.n3. My preference is to use N-Triple format files as output from programs that I write to save data as RDF. I often use Sesame to convert N-Triple files to N3 if I will be reading them or even hand editing them. You will see why I prefer the N3 format when we look at an example: Here we see the use of an abbreviation prefix “kb:” for the namespace for my company KnowledgeBooks.com ontologies. The first term in the RDF statement (the subject) is the URI of a news article. The second term (the predicate) is “containsCountry” in the “kb:” namespace. The last item in the statement (the object) is a string literal “China.” I would describe this RDF statement in English as, “The news article at URI http://news.com/201234 mentions the country China.” This was a very simple N3 example which we will expand to show additional features of the N3 notation. As another example, suppose that this news article also mentions the USA. Instead of adding a whole new statement like this: we can combine them using N3 notation. N3 allows us to collapse multiple RDF statements that share the same subject and optionally the same predicate: We can also add in additional predicates that use the same subject: This single N3 statement represents ten individual RDF triples. Each section defining triples with the same subject and predicate have objects separated by commas and ending with a period. Please note that whatever RDF storage system we use (we will be using Sesame) it makes no difference if we load RDF as XML, N-Triple, of N3 format files: internally subject, predicate, and object triples are stored in the same way and are used in the same way. I promised you that the data in RDF data stores was easy to extend. As an example, let us assume that we have written software that is able to read online news articles and create RDF data that captures some of the semantics in the articles. If we extend our program to also recognize dates when the articles are published, we can simply reprocess articles and for each article add a triple to our RDF data store using a form like: Furthermore, if we do not have dates for all news articles that is often acceptable depending on the application. SPARQL is a query language used to query RDF data stores. While SPARQL may initially look like SQL, there are some important differences like support for RDFS and OWL inferencing. We will cover the basics of SPARQL in this section. We will use the following sample RDF data for the discusion in this section: In the following examples, we will look at queries but not the results. We will start with a simple SPARQL query for subjects (news article URLs) and objects (matching countries) with the value for the predicate equal tocontainsCountry$: Variables in queries start with a question mark character and can have any names. We can make this query easier and reduce the chance of misspelling errors by using a namespace prefix: We could have filtered on any other predicate, for instance$containsPlace\$. Here is another example using a match against a string literal to find all articles exactly matching the text “Maryland.” The following queries were copied from Java source files and were embedded as string literals so you will see quotation marks backslash escaped in these examples. If you were entering these queries into a query form you would not escape the quotation marks.

We can also match partial string literals against regular expressions:

Prior to this last example query we only requested that the query return values for subject and predicate for triples that matched the query. However, we might want to return all triples whose subject (in this case a news article URI) is in one of the matched triples. Note that there are two matching triples, each terminated with a period:

When WHERE clauses contain more than one triple pattern to match, this is equivalent to a Boolean “and” operation. The DISTINCT clause removes duplicate results. The ORDER BY clause sorts the output in alphabetical order: in this case first by predicate (containsCity, containsCountry, etc.) and then by object. The LIMIT modifier limits the number of results returned and the OFFSET modifier sets the number of matching results to skip.

We are finished with our quick tutorial on using the SELECT query form. There are three other query forms that I am not covering in this chapter:

• CONSTRUCT – returns a new RDF graph of query results
• ASK – returns Boolean true or false indicating if a query matches any triples
• DESCRIBE – returns a new RDF graph containing matched resources

### SPARQL Query Client

There are many available public RDF data sources. Some examples of publicly available SPARQL endpoints are:

• DBPedia which encodes the structured information in Wikipedia info boxes as RDF data. The DBPedia data is also available as a free download for hosting on your own SPARQL endpoint.
• LinkDb Which contains human genome data
• Library of Congress There is currently no SPARQL endpoint provided but subject matter Sparql data is available for download for use on your own SPARQL endpoint and there is a search faclity that can return RDF data results.
• BBC Programs and Music
• FactForge Combines RDF data from a many sources including DBPedia, Geonames, Wordnet, and Freebase. The SPARQL query page has many great examples to get you started. For example, look at the SPARQL for the query and results for “Show the distance from London of airports located at most 50 miles away from it.”

There are many public SPARQL endpoints on the web but I need to warn you that they are not always available. The raw RDF data for most public endpoints is usually available so if you are building a system relying on a public data set you should consider hosting a copy of the data on one of your own servers. I usually use one of the following RDF repositories and SPARQL query interfaces: Star Dog, Virtuoso, and Sesame. That said, there are many other fine RDF repository server options, both commercial and open source. After reading this chapter, you may want to read a fairly recent blog article I wrote for importing and using the OpenCYC world knowledge base in to a local repository.

In the next two sections you will learn how to make SPARQL queries against RDF data stores on the web, getting results both as JSON and parsed out to Java data values. A major theme in the next chapter is combining data in the “cloud” with local data.

#### JSON Client

We will be using the SPARQL query for finding people who were born in Arizona:

You can run the unit tests for the DBPedia lookups and SPARQL queries using:

Here is a small bit of the output produced when running the unit test for this class with this example SPARQL query:

This example uses the DBPedia SPARQL endpoint. There is another public DBPedia service for looking up names and we will look at it in the next section.

### DBPedia Entity Lookup

Later in this chapter I will show you what the raw DBPedia data looks like. For this section we will access DBPdedia through a public search API. The following two sub-sections show example code for getting the DBPedia lookup results as Java data and as JSON data.

#### Java Data Client

In the DBPedia Lookup service that we use in this section I only list a bit of the example program for this section. This example is fairly long and I refer you to the source code.

The DBPedia Lookup service returns XML response data by default. The DBpediaLookupClient class is fairly simple. It encodes a query string, calls the web service, and parses the XML payload that is returned from the service.

You can run the unit tests for the DBPedia lookups and SPARQL queries using:

Sample output for looking up “London” showing values for map key “Description”, “Label”, and “URI” in the fetched data looks like:

If you run this example program you will see that the description text prints all on one line; it line wraps in the about example.

This example returns a list of maps as a result where each list item is a map which each key is a variable name and each map value is the value for the key. In the next section we will look at similar code that instead returns JSON data.

#### JSON Client

The example in the section is simpler than the one in the last section because we just need to return the payload from the lookup web service as-is (i.e., as JSON data encoded in a string).

You can run the unit tests for the DBPedia lookups and SPARQL queries using:

The following example for the lookup for “Bill Clinton” shows sample JSON (as a string) content that I “pretty printed” using IntelliJ for readability:

### Annotate Text with DBPedia Entity URIs

I have used DBPdedia, the semantic web/linked data version of Wikipedia, in several projects. One useful use case is using DBPedia URIs as a unique identifier of a unique real world entity. For example, there are many people named Bill Clinton but usually this name refers to a President of the USA. It is useful to annotate people’s names, company names, etc. in text with the DBPedia URI for that entity. We will do this in a quick and dirty way (but hopefully still useful!) in this section and then solve the same problem in a different way in the next section.

In this section we simply look for exact matches in text for the descriptions for DBPedia URIs. In the next section we tokenize the descriptions and also tokenize the text to match entity names with URIs. I will also show you some samples of the raw DBPedia RDF data in the next section.

For one of my own projects I created mappings of entity names to DBPedia URIs for nine entity classes. For the example in this section I use three of my entity class mappings: people, countries, and companies. The data for these entity classes is stored in text files in the subdirectory dbpedia_entities. Later in this chapter I will show you what the raw DBPedia dump data looks like.

The example program in this section modified input text adding DBPedia URIs after entities recognized in the text.

The annotation code in the following example is simple but not very efficient. See the source code for comments discussing this that I left out of the following listing (edited to fit the page width):

If you run the unit tests for this code the output looks like:

### Resolving Named Entities in Text to Wikipedia URIs

We will solve the same problem in this section as we looked at in the last section: resolving named entities with DBPedia (Wikipedia) URIs but we will use an alternative approach.

I some of my work I find it useful to resolve named entities (e.g., people, organizations, products, locations, etc.) in text to unique DBPedia URIs. The DBPedia semantic web data, which we will also discuss in the next chapter, is derived from the “info boxes” in Wikipedia articles. I find DBPedia to be an incredibly useful knowledge source. When I worked at Google with the Knowledge Graph we were using a core of knowledge from Wikipedia/DBPedia and Freebase with a lot of other information sources. DBPedia is already (reasonably) clean data and is ready to be used as-is in your applications.

The idea behind the example in this chapter is simple: the RDF data dumps from DBPedia can be processed to capture an entity name and the corresponding DBPedia entity URI.

The raw DPBedia files can be downloaded from Resolving Named Entities (latest version when I wrote this chapter) and as an example, the SKOS categories dataset looks like (reformatted to be a litle easier to read):

SKOS (Simple Knowledge Organization System) defines hierarchies of categories that are useful for categorizing real world entities, actions, etc. Cateorgies are then used for RDF versions of Wikipedia article titles, abstracts, geographic data, etc. Here is a small sample of RDF that maps subject URIs to text labels:

As a final example of what the raw DBPedia RDF dump data looks like, here are some RDF statements that define abstracts for articles (text descriptions are shortened in the following listing to fit on one line):

I have already pulled entity names and URIs for nine classes of entities and this data is available in the github repo for this book in the directory inked_data/dbpedia_entities.

#### Java Named Entity Recognition Library

We already saw the use of Named Entity Resolution (NER) in the chapter using OpenNLP. Here we do something different. For my own projects I have data that I have processed that maps entity names to DBPedia URIs and Clojure code to use this data. For the purposes of this book, I am releasing my data under the Apache 2 license so you can use it in your projects and I converted my Clojure NER library to Java.

The following figure shows the project for this chapter in IntelliJ:

The following code example uses my data for mapping nine classes of entities to DBPedia URIs:

The class TextToDbpediaUris uses the nine NER types defined in class NerMaps. In the following listing the code for all by the NER class Broadcast News Networks is not shown for brevity:

Here is an example showing how to use this code and the output:

The second and third values are word indices (first word in sequence and last word in sequence from the input tokenized text).

There are many types of applications that can benefit from using “real world” information from sources like DBPedia.

### Combining Data from Public and Private Sources

A powerful idea is combining public and private linked data; for example, combining the vast real world knowledge stored in DBPedia with an RDF data store containing private data specific to your company. There are two ways to do this:

• Use SPARQL queries to join data from multiple SPARQL endpoints
• Assuming you have an RDF repository of private data, import external sources like DBPedia into your local data store

It is beyond the scope of my coverage of SPARQL but data can be joined from multiple endpoints using the SERVICE keyword in a SPARQL WHERE clause. In a similar way, public sources like DBPedia can be imported into your local RDF repository: this both makes queries more efficient and makes your system more robust in the face of failures of third party services.

One other data source that can be very useful to integrate into your applications is web search. I decided not to cover Java examples for calling public web search APIs in this book but I will refer you to my project on github that wraps the Microsoft Bing Search API. I mention this as an example pattern of using local data (notes and stored PDF documents) with public data APIs.

We will talk more about combining different data sources in the next chapter.

### Wrap Up for Linked Data

The semantic web and linked data are large topics and my goal in this chapter was to provide an easy to follow introduction to hopefully interest you in the subject and motivate you for further study. If you have not seen graph data before then RDF likely seems a little strange to you. There are more general graph databases like the commercial Neo4j commercial product (a free community edition is available) which provides an alternative to RDF data stores. If you would like to experiment with running your own RDF data store and SPARQL endpoint there are many fine open source and commercial products to choose from. If you read my web blog you have seen my experiments using the free edition of the Stardog RDF data store.

## Java Strategies for Working with Cloud Data: Knowledge Management-Lite

Knowledge Management (KM) is the collection, organization and maintenance of knowledge. The last chapter introduced you to techniques for information architecture and development using linked data and semantic web technologies and is paired with this chapter so please read these two chapters in order.

KM is a vast subject but I am interested in exploring one particular idea in this chapter: strategies for combining information stored in local data sources (relational and NoSQL data stores) with cloud data sources (e.g., Google Drive, Microsoft OneDrive, and Dropbox) to produce fused information that in turn may provide actionable knowledge. Think about your workflow: do you use Google Docs and Calendars? Microsoft Office 365 documents in the Azure cloud? I use both companies cloud services in my work. Sometimes cloud data just exists to backup local data on my laptops but usually I use cloud data for storing:

• very large data sets
• data that needs to be available for a wide range of devices
• data that is valuable and needs to be replicated

I have been following the field of KM for many years but in the last few years I have been more taken with the idea of fusing together public information sources with private data. The flexibility of hybrid private data storage in the cloud and locally stored is a powerful pattern:

In this chapter we use data sources of documents in the cloud and local relational databases.

When I first planned on writing this chapter the example program that I wanted to write would access a user’s data in real time from either Google’s or Microsoft’s cloud services. I decided that the complexity of implementing OAuth2 to access Google and Microsoft services was too much for a book example because of the need for a custom web app to support an example application authenticating with these services.

What I decided to do was export (using Google Takeout) a small set of test data from my Google account for my calendar, Google Drive, and GMail and use a sanitized version of these data for examples programs in this chapter. Google Drive documents export as Microsoft document formats (Microsoft Word docx files, Excel xlsx spreadsheet files) and in standard calendar ics format files and email in the standard mbox format. In other words, this example could also be applicable for data stored in Microsoft Office 365. While using canned data does not make a totally satisfactory demo it does allow me to concentrate on techniques for accessing the content of popular document types while ignoring the complexities of authenticating users to access their data on Google and Microsoft cloud services.

If you want to experiment with the example programs in this chapter with your own data you can use www.google.com/settings/takeout/ to download your own data and replace my sanitized Google Takeout data - doing this will make the example programs more interesting and hopefully motivate you to experiment further with them.

A real system using the techniques that we will cover could be implemented as a web app running on either Google Cloud Services or Microsoft Azure that would authenticate a user with their Google or Microsoft accounts, and have live access to the user’s cloud data.

We will be using open source Java libraires to access the data in Microsoft Excel and Office formats (note that Google exports Google docs in these formats as does Microsoft’s Office 365).

One excellant tool for building KM systems that I am not covering is web search. Automatically accessing a web-wide search index to provide system users with more information to make decisions is something to consider when designong KM and decision support systems. I decided not to cover Java examples for calling public web search APIs in this book but you can use my project on github that wraps the Microsoft Bing Search API. I once wrote a service using search APIs to annotate notes and stored documents. Using DBPedia (as we did in the last chapter) and Bing search was easy and allowed me to annotate text (with hyperlinks to web information sources) in near real time.

### Motivation for Knowledge Management

Knowledge workers are different than previous kinds of workers who were trained to do a specific job repetitively. Knowledge workers tend to combine information from different sources to make decisions.

When workers leave, if their knowledge is not codified then the organization loses value. This situation will get worse in the future as baby boomers retire in larger numbers. So, the broad goals of KM are to capture employee knowledge, to facilitate sharing knowledge and information between workers company wide, and provide tools to increase workers’ velocity in making good decisions that are based on solid information and previous organizational knowledge.

For the purposes of this chapter we will discuss KM from the viewpoint of making maximum use of all available data sources, including shared cloud-based sources of information. We will do this through the implementation of Java code examples for accessing common document formats.

Since KM is a hugely broad topic I am going to concentrate on a few relatively simple examples that will give you ideas for both capturing and using knowledge and in some cases also provide useful code from the examples in this book that you can use:

• Capture your email in a relational database and assign categories using some of the techniques we learned in the NLP and machine learning chapters.
• Store web page URIs and content snippets for future use.
• Annotate text in a document store by resolving entity names to unique DBPedia URIs.
• Use Postgres as a local database to store discovered data that is discovered by the example software in this chapter.

Microsoft Office 365, which I use, also has APIs for integrating your own programs into an Office 365 work flow. I decided to use Google Drive in this chapter because it is free to use but I also encourage you to experiment with the Office 365 APIs. You can take free Office 365 API classes offered online edX.org. We will use Google Drive as a data source for the example code developed in this chapter.

As I mentioned earlier in this chapter in order to build a real system based on the ideas and example code in this chapter you will need to write a web applicaton that can authenticate users of your system to access data in Google Drive and/or in Microsoft OneDrive. In addition to real time access to shared data in the cloud, another important part of organizational KM systems that we are not discussing in this chapter is providing user and access roles. In the examples in this chapter we will ignore these organizational requirements and concentrate on the process of automatic discovery of useful metadata that helps to organize shared data.

There are many KM products that cater to organizations or individuals. Depending on your requirements you might want to mix custom software with off the shelf software tools. Another important aspect of KM is the concept of master data which is creating a uniform data model and storage strategy for diverse (and often legacy) databases. This is partly what I worked on at Google. Even for the three simple examples we will develop in this chapter there are interesting issues involving master data that I will return to later.

Most KM systems deal with internal data that is stored on local servers. While I do provide some very simple examples of using the Postgres relational database later in this chapter, I wanted to get you thinking about using multiple data sources and concentrating on data stored in commercial cloud services makes sense.

### Using Google Drive Cloud Takeout Service

As I mentioned in the introduction to this chapter, while it is fairly easy to write a web app for Microsoft Azure that uses the Office 365 APIs an a web app for AppEngine (or Google Cloud) that uses the Google Drive data access APIs, building a web app with the auth2 support required to use these APIs is outside of the scope for an example program. So, I am cheating a little. Google has a “Google Takeout” service that everyone who uses Google services should periodically use to save all of their data from Google services.

In this section and contained sub-sections we will use a “Google Takeout” dump as the data used by the example programs. The following subsections contain utilities for reading and parsing:

• Microsoft Word docx files which is the format that Google Takeout uses to export Google Drive word processing documents.
• iCal calendar files which is a standard format that Google Calendar data is exported as.
• Email mbox files which is the format that GMail is exported as.

These are all widely used file formats so I hope the utilities developed in the next four sub-sections will be generally useful to you. The following figure shows a screen grab of the IntelliJ Community Edition project browser showing just the data files that I exported using Google Takeout. I “sanitized” these files by changing names, email addresses, etc. to use for the example programs in this chapter.

We will be using the Apache Tika and iCal4j libraires to implement the code in the next three sub-sections. The following figure shows the IntelliJ source and test files used in this chapter:

#### Processing Microsoft Excel Spreadsheet xlsx and Microsoft Word docx Files

We will be using the Tika library that parses all sheets (i.e., separate pages) in Excel spreadsheet files. The output is a nicely formatted text file that is new line and tab character delimited. The following utility class parses this text output for more convenient use by parsing the plain text representation into Java data:

When a new instance of the class ExcelData is created you can access the individual sheets and the rows in each sheet from the public data in sheetsAndRows.

The first index of sheetsAndRows provides access to individual sheets, the second index provided access to the rows in a selected sheet, and the third (inner) index provides access to the individual columns in a selected row. The public utility method toString provides an example for accessing this data. The class ExcellData is used in the example code in the next listing.

The following utility class PoiMicrosoftFileReader has methods for processing Excel and Word files:

The static public method DocxToText transforms a file path specifying a Microsoft Word docx file to the plain text in that file. The public static method readXlsx transforms a file path to an Excel xlsx file to an instance of the class ExcelData that we looked at earlier.

The following output was generated from running the test programs GoogleDriveTest.testDocxToText and GoogleDriveTest.testExcelReader. I am using jUnit tests as a convenient way to provide examples of calling APIs in the example code. Here is the test code and its output:

#### Processing iCal Calendar Files

The iCal format is the standard way that Google, Apple, Microsoft, and other vendors export calendar data. The following code uses the net.fortuna.ical4j library that provides a rich API for dealing with iCal formatted files. In the follow example we are using a small part of this API.

The following output was generated from running the test program GoogleDriveTest.testCalendarFileReader. Here is the test code and its output:

The utility method toString for the class ReadCalendarFiles prints for each entry three attributes: a date stamp, a summary, and a description.

#### Processing Email mbox Files

In this section we use the Apace Tika library for parsing email mbox formatted files. There is a commented out printout on line 36 and you might want to uncomment this and look at the full message text to see all available meta data for each email as it is processed. You might want to use more of the meta data in your applications than I am using in the example code.

The following output was generated from running the test program GoogleDriveTest.testMboxFileReader. Here is the test code and its output (edited for brevity and clarity):

You now can parse and use information from exported Microsoft Office documents. As I mentioned before, when you export Google documents this process also generates documents in the Microsoft formats.

We will now develop a simple set of utiltity classes for easily string structured data and “less structured” JSON data in a Postgres database. We will also provide support for text search in database table columns containing text data.

I use many different data stores in my work: Amazon S3 and DynamoDB, Google Big Table, Hadoop File System, MongoDB, CouchDB, Cassandra, etc. Lots of good options for different types of projects!

That said, Postgres is my “Swiss Army knife” data store offering a rock solid relational database, full text indexing, and native support for schema less native JSON documents. I this short section I am going to cover JDBC access to Postgres and cover the common use cases that you might need if you also want to use Postgres as your “Swiss Army knife” data store.

The following listing shows these utilities, encapsulated in the Java class PostgresUtilities. In line 16 we set the server name, Postgres account anme, and password for connecting to a database. Here the server name is “localhost” since I am running both the example program and the postgres server on my laptop. I pass the connection parameters in the connection URI on line 14 but the commented out lines 25 and 26 show you how to set connection properties in your code if you prefer that.

I use the private method convertResultSetToList (lines 107 through 102) to convert a result set object (class java.sql.ResultSet) to pure Java data. This method returns a list. Each element in the returned list contains a map representing a row. the keys of the map are table column names and the values are the row value for the column.

Besides rhe constructor, this class contains three public methods: doUpdate, doQuery and textSearch. Both take as an argument a string containign an SQL statement. I assume that you know the SQL language, and if not the following test code for the class PostgresUtilities provides a few simple SQL statement examples for creating tables, adding and updating rows in a table, querying to find specific rows, and performing full text search.

The test program PostgresTest contains examples for using this utility class. In the following snippets I will alternate code from PostgresTest with the outputs from the snippets:

The doUpdate method returns the number of modified rows in the database. From the output you see that creating a table changes no rows in the database:

The output is:

The output is:

The output is:

The output is:

The output is:

The output is:

The output is:

There is a lot of hype concerning NoSQL data stores and some of this hype is well deserved: when you have a massive amount of data to handle then splitting it accross many servers with systems like the Hadoop File System or Cassandra allows you to process large data sets using many lower cost servers. That said, for many applications, you don’t have very large data sets so general purpose database systems like Postgres make more sense. Ideally you will wrap data access in your own code to minimize future code changes if you switch data stores.

### Wrap Up

Much of my career has involved building systems to store and provide access to information to support automated and user in the loop systems. Knowledge Management is the branch of computer science that helps turn data into data into information and information into knowledge. I titled this chapter “Knowledge Management-Lite” because I am just providing you with ideas for your own KM projects and hopefully changed for the better how you look at the storage and use of data.

## Book Wrap Up

This book has provided you with my personal views on how to process information using machine learning, natural language processing, and various data store technologies. My hope is that at least some of my ideas for building software systems has been of use to you and given you ideas that you can use when you design and build your applications.

My philosophy in general for building systems that solve real problems in the world includes the following system:

• Learn new technologies as a background exercise to “keep a full toolbelt.” This is especially important now that technology is changing rapidly and often new technologies make it easier to implement new functionality. A major reason that I wrote this book is to introduce you to some of the technology that I have found to be most useful.
• Make sure you solve the right problems. Build systems that have an impact on your company or organization, and on the world in general by building things that you think will have the greatest positive impact.