Automatically Generating Data for Knowledge Graphs

Here we develop a complete application using the package developed in the earlier chapter Resolve Entity Names to DBPedia References. The Knowledge Graph Creator (KGcreator) is a tool for automating the generation of data for Knowledge Graphs from raw text data. Here we generate RDF data for a Knowledge Graph. You might also be interested in the Knowledge Graph Creator implementation in my Common Lisp book that generates data for the Neo4J open source graph database in addition to generating RDF data.

Data created by KGcreator generates data in RDF triples suitable for loading into any linked data/semantic web data store.

This example application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on making named entities to DBPedia URIs and we will reuse this code.

I originally wrote KGCreator as two research prototypes, one in Common Lisp (see my Common Lisp book) and one in Haskell. The example in this chapter is a port of these systems to Java.

Implementation Notes

The implementation is contained in a single Java class KGC and the junit test class KgcTest is used to process the test files included with this example.

As can be seen in the following figure I have defined final static strings for each type of entity type URI. For example, personTypeUri has the value http://www.w3.org/2000/01/rdf-schema#person.

Overview of Java Class UML Diagram for the Knowledge Graph Creator

The following figure shows a screen shot of this example project in the free Community Edition of IntelliJ.

IDE View of Project

Notice in this screen shot that there are several test files in the directory test_data. The files with the file extension .meta contain a single line which is the URI for the source of the text in the matching text file. For example, the meta file test1.meta provides the URI for the source of the text in the file test1.txt.

Generating RDF Data

RDF data is comprised of triples, where the value for each triple are a subject, a predicate, and an object. Subjects are URIs, predicates are usually URIs, and objects are either literal values or URIs. Here are two triples written by this example application:

1 <http://dbpedia.org/resource/The_Wall_Street_Journal> 
2   <http://knowledgebooks.com/schema/aboutCompanyName> 
3   "Wall Street Journal" .
4 <https://newsshop.com/june/z902.html>
5   <http://knowledgebooks.com/schema/containsCountryDbPediaLink>
6   <http://dbpedia.org/resource/Canada> .

The following listing of the file KGC.java contains the implementation the main Java class for generating RDF data. Code for different entity types is similar so the following listing only shows the code for handling entity types for people and companies. The following is reformatted to fit the page width:

 1 package com.knowledgegraphcreator;
 2 
 3 import com.markwatson.ner_dbpedia.TextToDbpediaUris;
 4 
 5 import java.io.*;
 6 import java.nio.charset.StandardCharsets;
 7 import java.nio.file.Files;
 8 import java.nio.file.Paths;
 9 
10 /**
11  * Java implementation of Knowledge Graph Creator.
12  *
13  * Copyright 2020 Mark Watson. All Rights Reserved. Apache 2 license.
14  *
15  * For documentation see my book "Practical Artificial Intelligence Programming
16  * With Java", chapter "Automatically Generating Data for Knowledge Graphs"
17  * at https://leanpub.com/javaai that can be read free online.
18  *
19  */
20 
21 public class KGC  {
22 
23     static String subjectUri = 
24       "<http://www.w3.org/1999/02/22-rdf-syntax-ns#/subject>";
25     static String labelUri = 
26       "<http://www.w3.org/1999/02/22-rdf-syntax-ns#/label>";
27     static String personTypeUri = 
28       "<http://www.w3.org/2000/01/rdf-schema#person>";
29     static String companyTypeUri = 
30       "<http://www.w3.org/2000/01/rdf-schema#company>";
31 
32     static String typeOfUri = 
33       "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>";
34 
35     private KGC() { }
36     
37     public KGC(String directoryPath, String outputRdfPath) throws IOException {
38         System.out.println("KGN");
39         PrintStream out = new PrintStream(outputRdfPath);
40         File dir = new File(directoryPath);
41         File[] directoryListing = dir.listFiles();
42         if (directoryListing != null) {
43             for (File child : directoryListing) {
44                 System.out.println("child:" + child);
45                 if (child.toString().endsWith(".txt")) {
46                     // try to open the meta file with the same extension:
47                     String metaAbsolutePath = child.getAbsolutePath();
48                     File meta = 
49                       new File(metaAbsolutePath.substring(0,
50                                                       metaAbsolutePath.length() - 4)
51                                                        + ".meta");
52                     System.out.println("meta:" + meta);
53                     String [] text_and_meta = 
54                       readData(child.getAbsolutePath(), meta.getAbsolutePath());
55                     String metaData = "<" + text_and_meta[1].strip() + ">";
56                     TextToDbpediaUris kt = 
57                       new TextToDbpediaUris(text_and_meta[0]);
58                     for (int i=0; i<kt.personNames.size(); i++) {
59                         out.println(metaData + " " + subjectUri + " " + 
60                                     kt.personUris.get(i) + " .");
61                         out.println(kt.personUris.get(i) + " " + labelUri + 
62                                     " \"" + kt.personNames.get(i) + "\" .");
63                         out.println(kt.personUris.get(i) + " " + typeOfUri + 
64                                     " " + personTypeUri + " .");
65                     }
66                     for (int i=0; i<kt.companyNames.size(); i++) {
67                         out.println(metaData + " " + 
68                                     subjectUri + " " + 
69                                     kt.companyUris.get(i) + " .");
70                         out.println(kt.companyUris.get(i) + " " +
71                                    labelUri + " \"" +
72                                    kt.companyNames.get(i) + "\" .");
73                         out.println(kt.companyUris.get(i) + " " + typeOfUri + 
74                                     " " + companyTypeUri + " .");
75                     }
76                 }
77             }
78         }
79         out.close();
80     }
81 
82     private String [] readData(String textPath, String metaPath) throws IOException {
83         String text = Files.readString(Paths.get(textPath), StandardCharsets.UTF_8);
84         String meta = Files.readString(Paths.get(metaPath), StandardCharsets.UTF_8);
85         System.out.println("\n\n** text:\n\n" + text);
86         return new String[] { text, meta };
87     }
88 }

This code works on a list of paired files for text data and the meta data for each text file. As an example, if there is an input text file test123.txt then there would be a matching meta file test123.meta that contains the source of the data in the file test123.txt. This data source will be a URI on the web or a local file URI. The class contractor for KGC takes an output file path for writing the generated RDF data and a list of pairs of text and meta file paths.

The junit test class KgcTest will process the local directory test_data and generate an RDF output file:

 1 package com.knowledgegraphcreator;
 2 
 3 import junit.framework.Test;
 4 import junit.framework.TestCase;
 5 import junit.framework.TestSuite;
 6 
 7 public class KgcTest extends TestCase {
 8 
 9   public KgcTest(String testName) {
10     super(testName);
11   }
12 
13   public static Test suite() {
14     return new TestSuite(KgcTest.class);
15   }
16 
17   public void testKGC() throws Exception {
18     assertTrue(true);
19     KGC client = new KGC("test_data/", "output_with_duplicates.rdf");
20   }
21   private static void pause() {
22     try { Thread.sleep(2000);
23     } catch (Exception ignore) { }
24   }
25 }

If specific entity names occur in multiple input files there will be a few duplicated RDF statements generated. The simplest way to deal with this is to add a one line call to the awk utility to efficiently remove duplicate lines in the RDF output file. Here is a listing of the Makefile for this example:

1 create_data_and_remove_duplicates:
2     mvn test
3     echo "Removing duplicate RDF statements"
4     awk '!visited[$$0]++' output_with_duplicates.rdf > output.rdf
5     rm -f output_with_duplicates.rdf

If you are not familiar with awk and want to learn the basics then I recommend this short tutorial.

KGCreator Wrap Up

When developing applications or systems using Knowledge Graphs it is useful to be able to quickly generate test data which is the primary purpose of KGCreator. A secondary use is to generate Knowledge Graphs for production use using text data sources. In this second use case you will want to manually inspect the generated data to verify its correctness or usefulness for your application.