Resolve Entity Names to DBPedia References

As a personal research project I have collected a large data set that maps entity names (e.g., people’s names, city names, names of music groups, company names, etc.) to the DBPedia URI for each entity. I have developed libraries to use this data in Common Lisp, Haskell, and Java. Here we use the Java version of this library.

The Java library is found in the directory ner_dbpedia in the GitHub repository. The raw data for these entity to URI mappings are found in the directory ner_dbpedia/dbpedia_as_text.

This example shows the use of a standard Java and Maven packaging technique: building a JAR file that contains resource files in addition to compiled Java code. The example code reads the required data resources from the JAR file (or the temporary target directory during development). This makes the JAR file self contained when we use this example library in later chapters.

DBPedia Entities

DBPedia is the structured RDF database that is automatically created from WikiPedia info boxes. We will go into some detail on RDF data in the later chapter Semantic Web. The raw data for these entity to URI mappings is found in the directory ner_dbpedia/dbpedia_as_text files have the format (for people in this case):

1 Al Stewart      <http://dbpedia.org/resource/Al_Stewart>
2 Alan Watts      <http://dbpedia.org/resource/Alan_Watts>

If you visit any or these URIs using a web browser, for example http://dbpedia.org/page/Al_Stewart you will see the DBPedia data for the entity formatted for human reading but to be clear the primary purpose of information in DBPedia is for use by software, not humans.

There are 58953 entities defined with their DBPedia URI and the following listing shows the breakdown of number of entities by entity type by counting the number of lines in each resource file:

 1 ner_dbpedia: $ wc -l ./src/main/resources/*.txt
 2      108 ./src/main/resources/BroadcastNetworkNamesDbPedia.txt
 3     2580 ./src/main/resources/CityNamesDbpedia.txt
 4     1786 ./src/main/resources/CompanyNamesDbPedia.txt
 5      167 ./src/main/resources/CountryNamesDbpedia.txt
 6    14315 ./src/main/resources/MusicGroupNamesDbPedia.txt
 7    35606 ./src/main/resources/PeopleDbPedia.txt
 8      555 ./src/main/resources/PoliticalPartyNamesDbPedia.txt
 9      351 ./src/main/resources/TradeUnionNamesDbPedia.txt
10     3485 ./src/main/resources/UniversityNamesDbPedia.txt
11    58953 total

The URI for each entity defines a unique identifier for real world entities as well as concepts.

Library Implementation

The following UML class diagram shows the APIs and fields for the two classes in the package com.markwatson.ner_dbpedia for this example: NerMaps and TextToDbpediaUris:

Overview of Java Class UML Diagram for this Example

As you see in the following figure showing the IntelliJ Community Edition project for this example, there are nine text files, one for each entity type in the directory src/main/resources. Later we will look at the code required to read these files in two cases:

During development these files are read from target/classes.
During client application use of the JAR file (created using mvn install) these files are read as resources from the Java class loader.

The class com.markwatson.ner_dbpedia.NerMaps is a utility for reading the raw entity mapping data files and creating hash tables for these mappings:

 1 package com.markwatson.ner_dbpedia;
 2 
 3 import java.io.*;
 4 import java.nio.file.Files;
 5 import java.nio.file.Paths;
 6 import java.util.ArrayList;
 7 import java.util.HashMap;
 8 import java.util.List;
 9 import java.util.Map;
10 import java.util.stream.Stream;
11 
12 /**
13  * Copyright Mark Watson 2020. Apache 2 license,
14  */
15 public class NerMaps {
16 
17   private static String enforceAngleBrackets(String s) {
18     if (s.startsWith("<")) return s;
19     return "<" + s + ">";
20   }
21   private static Map<String, String> textFileToMap(String nerFileName) {
22     Map<String, String> ret = new HashMap<String, String>();
23     try {
24       InputStream in = ClassLoader.getSystemResourceAsStream(nerFileName);
25       BufferedReader reader = new BufferedReader(new InputStreamReader(in));
26       List<String> lines = new ArrayList<String>();
27       String line2;
28       while((line2 = reader.readLine()) != null) {
29         lines.add(line2);
30       }
31       reader.close();
32       lines.forEach(line -> {
33         String[] tokens = line.split("\t");
34         if (tokens.length > 1) {
35           ret.put(tokens[0], enforceAngleBrackets(tokens[1]));
36         }
37       });
38     } catch (Exception ex) {
39       ex.printStackTrace();
40     }
41     return ret;
42   }
43 
44   static public final Map<String, String> broadcastNetworks = 
45     textFileToMap("BroadcastNetworkNamesDbPedia.txt");
46   static public final Map<String, String> cityNames = 
47     textFileToMap("CityNamesDbpedia.txt");
48   static public final Map<String, String> companyames = 
49     textFileToMap("CompanyNamesDbPedia.txt");
50   static public final Map<String, String> countryNames = 
51     textFileToMap("CountryNamesDbpedia.txt");
52   static public final Map<String, String> musicGroupNames = 
53     textFileToMap("MusicGroupNamesDbPedia.txt");
54   static public final Map<String, String> personNames = 
55     textFileToMap("PeopleDbPedia.txt");
56   static public final Map<String, String> politicalPartyNames = 
57     textFileToMap("PoliticalPartyNamesDbPedia.txt");
58   static public final Map<String, String> tradeUnionNames = 
59     textFileToMap("TradeUnionNamesDbPedia.txt");
60   static public final Map<String, String> universityNames = 
61     textFileToMap("UniversityNamesDbPedia.txt");
62 }

The class com.markwatson.ner_dbpedia.TextToDbpediaUris processes an input string and uses public fields to output found entity names and matching DBPedia URIs. We will use this code later in the chapter Automatically Generating Data for Knowledge Graphs.

The code in the class TextToDbpediaUris is simple and repeats two common patterns for each entity type. We will look at some of the code here.

 1 package com.markwatson.ner_dbpedia;
 2 
 3 import java.util.ArrayList;
 4 import java.util.List;
 5 
 6 public class TextToDbpediaUris {
 7   private TextToDbpediaUris() {
 8   }
 9 
10   public List<String> personUris = new ArrayList<String>();
11   public List<String> personNames = new ArrayList<String>();
12   public List<String> companyUris = new ArrayList<String>();
13   public List<String> companyNames = new ArrayList<>();

The empty constructor is private since it makes no sense to create an instance of TextToDbpediaUris without text input. The code supports nine entity types. Here we show the definition of public output fields for just two entity types (people and companies).

As a matter of programming style I generally no longer use getter and setter methods, preferring a more concise coding style. I usually make output fields package default visibility (i.e., no private or public specification so the fields are public within a package and private from other packages). Here I make them public because the package nerdbpedia developed here is meant to be used by other packages. If you prefer using getter and setter methods, modern IDEs like IntelliJ and Eclipse can generate those for you for the example code in this book.

We will handle entity names comprised of one, two, and three word sequences. We check for longer word sequences before shorter sequences:

 1   public TextToDbpediaUris(String text) {
 2     String[] tokens = tokenize(text + " . . .");
 3     String uri = "";
 4     for (int i = 0, size = tokens.length - 2; i < size; i++) {
 5       String n2gram = tokens[i] + " " + tokens[i + 1];
 6       String n3gram = n2gram + " " + tokens[i + 2];
 7       // check for 3grams:
 8       if ((uri = NerMaps.personNames.get(n3gram)) != null) {
 9         log("person", i, i + 2, n3gram, uri);
10         i += 2;
11         continue;
12       }

The class NerMaps that we previously saw listed converts text files of entities to DBPedia URIs mappings to Java hash maps. The method log does two things:

Prints out the entity type, the word indices from the original tokenized text, the entity name as a single string (combine tokens for an entity to a string), and the DBPedia URI.
Saves entity mapping in the public fields personUris, personNames, etc.

After we check for three word entity names, we process two word names, and one word names. Here is an example:

1       // check for 2grams:
2       if ((s = NerMaps.personNames.get(n2gram)) != null) {
3         log("person", i, i + 1, n2gram, s);
4         i += 1;
5         continue;
6       }

The following listing shows the log method that write descriptive output and saves entity mappings. We only show the code for the entity type person:

 1   public void log(String nerType, int index1, int index2, String ngram, String uri) {
 2     System.out.println(nerType + "\t" + index1 + "\t" + index2 + "\t" + 
 3                        ngram + "\t" + uri);
 4     if (!uri.startsWith("<")) uri = "<" + uri + ">";
 5     if (nerType.equals("person")) {
 6       if (!personUris.contains(uri)) {
 7         personUris.add(uri);
 8         personNames.add(ngram);
 9       }
10     }

For some NLP applications I will use a standard tokenizer like the OpenNLP tokenizer that we used in two previous chapters. Here, I simply add spaces around punctuation characters and use the Java string split method:

1   private String[] tokenize(String s) {
2     return s.replaceAll("\\.", " \\. ").
3              replaceAll(",", " , ").
4              replaceAll("\\?", " ? ").
5              replaceAll("\n", " ").
6              replaceAll(";", " ; ").split(" ");
7   }

The following listing shows the code snippet from the unit test code in the class TextToDbpediaUrisTest that calls the TextToDbpediaUris constructor with a text sample (junit boilerplate code is not shown):

 1 package com.markwatson.ner_dbpedia;
 2  
 3   ...
 4 
 5   /**
 6    * Test that is just for side effect printouts:
 7    */
 8   public void test1() throws Exception {
 9     String s = "PTL Satellite Network covered President Bill Clinton going to "   
10       + " Guatemala and visiting the Coca Cola Company.";
11     TextToDbpediaUris test = new TextToDbpediaUris(s);
12   }
13 }

On line 11, the object test contains public fields for accessing the entity names and corresponding URIs. We will use these fields in the later chapters Automatically Generating Data for Knowledge Graphs and Knowledge Graph Navigator.

Here is the output from running the unit test code:

1 broadcastNetwork 0 2 PTL Satellite Network <http://dbpedia.org/resource/PTL_Satellite_Network>
2 person   5   6   Bill Clinton   <http://dbpedia.org/resource/Bill_Clinton>
3 country  9 10  Guatemala     <http://dbpedia.org/resource/Guatemala>
4 company 13   14  Coca Cola  <http://dbpedia.org/resource/Coca-Cola>

Wrap-up for Resolving Entity Names to DBPedia References

The idea behind this example is simple but useful for information processing applications using raw text input. We will use this library later in two semantic web examples.

Up next

Semantic Web