Knowledge Graph Creator

The large project described here processes raw text inputs and generates data for knowledge graphs in formats for both the Neo4J graph database and in RDF format for semantic web and linked data applications.

This application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on natural language processing (NLP) and we will reuse this code. We will discuss later three strategies for reusing code from different projects.

The following figure shows part of a Neo4J Knowledge Graph created with the example code. This graph has shortened labels in displayed nodes but Neo4J offers a web browser-based console that lets you interactively explore Knowledge Graphs. We don’t cover setting up Neo4J here so please use the Neo4J documentation. As an introduction to RDF data, the semantic web, and linked data you can get free copies of my two books Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition.

Figure 3. Part of a Knowledge Graph shown in Neo4J web application console

There are two versions of this project that deal with generating duplicate data in two ways:

As either Neo4J Cypher data or RDF triples data are created, store generated data in a SQLite embedded database. Check this database before writing new output data.
Ignore the problem of generating duplicate data and filter out duplicates in the outer processing pipeline that uses the Knowledge Graph Creator as one processing step.

For my own work I choose the second method since filtering duplicates is as easy as a few Makefile targets (the following listing is in the file Makefile in the directory haskell_tutorial_cookbook_examples/knowledge_graph_creator_pure):

all: gendata rdf cypher

gendata:
    stack build --fast --exec Dev-exe

rdf:
    echo "Removing duplicate RDF statements"
    awk '!visited[$$0]++' out.n3 > output.n3
    rm -f out.n3

cypher:
    echo "Removing duplicate Cypher statements"
    awk '!visited[$$0]++' out.cypher > output.cypher
    rm -f out.cypher

The Haskell KGCreator application we develop here writes output files out.n3 (N3 is a RDF data format) and out.cypher (Cypher is the import output format and query language for the Neo4J open source and commercial graph database). The awk commands remove duplicate lines and write de-duplicated data to output.n3 and output.cypher.

We will use this second approach but the next section provides sufficient information and a link to alternative code in case you are interested in using SQLite to prevent duplicate data generation.

Notes for Using SQLite to Avoid Duplicates (Optional Material)

We saw two methods of avoiding duplicates in generated data in the last section. If you want to use the first method for avoiding generating duplicate data, I leave it as an exercise but here are some notes to get you started: you can then modify the example code by using the utility function Blackboard.h in the directory knowledge_graph_creator_pure/src/fileutils and implement the logic seen below for checking new generated data to see if it is in the SQLite database. This first method as it also is a good example for wrapping the embedded SQLite library in an IO Monad and is left as an exercise, otherwise skip this section.

Before you write either an RDF statement or a Neo4J Cypher data import statement, check to see if the statement has already been written using something like:

  check <- blackboard_check_key new_data_uri
  if check
     ....

and after writing a RDF statement or a Neo4J Cypher data import statement, write it to the temporary SQLite database using something like:

  blackboard_write newStatementString

For the rest of the chapter we will use the approach of not keeping track of generated data in SQLite and instead remove duplicates during post-processing using the standard awk command line utility.

This section is optional. In the rest of this chapter we use the example code in knowledge_graph_creator_pure.

We will reuse the code for finding entities that we studied in an earlier chapter. There are several ways to reuse code from multiple local Haskell projects:

In a project’s cabal file, use relative paths to the source code for other projects. This is my preferred way to work but has the drawback that the stack command sdist to make a distribution tarball will not work with relative paths. If this is a problem for you then create relative symbolic file links to the source directories in other projects.
In your project’s stack.yaml file, add the other project’s name and path as a extra-deps.
In library projects, define a packages definition and install the library globally on your system.

I almost always use the first method on my projects with dependencies on other local projects I work on and this is also the approach we use here. The relevant lines in the file KGCreator.cabal are:

 1 library
 2   exposed-modules:
 3       CorefWebClient
 4       NlpWebClient
 5       ClassificationWebClient
 6       DirUtils
 7       FileUtils
 8       BlackBoard
 9       GenTriples
10       GenNeo4jCypher
11       Apis
12       Categorize
13       NlpUtils
14       Summarize
15       Entities
16   other-modules:
17       Paths_KGCreator
18       BroadcastNetworkNamesDbPedia
19       Category1Gram
20       Category2Gram
21       CityNamesDbpedia
22       CompanyNamesDbpedia
23       CountryNamesDbpedia
24       PeopleDbPedia
25       PoliticalPartyNamesDbPedia
26       Sentence
27       Stemmer
28       TradeUnionNamesDbPedia
29       UniversityNamesDbPedia
30 
31   hs-source-dirs:
32       src
33       src/webclients
34       src/fileutils
35       src/sw
36       src/toplevel
37       ../NlpTool/src/nlp
38       ../NlpTool/src/nlp/data

This is a standard looking cabal file except for lines 37 and 38 where the source paths reference the example code for the NlpTool application developed in a previous chapter. The exposed module BlackBoard (line 8) is not used but I leave it in the cabal file in case you want to experiment with recording generated data in SQLite to avoid data duplication. You are likely to also want to use BlackBoard if you modify this example to continuously process incoming data in a production system. This is left as an exercise.

Before going into too much detail on the implementation let’s look at the layout of the project code:

 1 src/fileutils:
 2 BlackBoard.hs   DirUtils.hs FileUtils.hs
 3 
 4 ../NlpTool/src/nlp:
 5 Categorize.hs   Entities.hs NlpUtils.hs Sentence.hs Stemmer.hs  Summarize.hs    data
 6 
 7 ../NlpTool/src/nlp/data:
 8 BroadcastNetworkNamesDbPedia.hs CompanyNamesDbpedia.hs      TradeUnionNamesDbPedia.hs
 9 Category1Gram.hs        CountryNamesDbpedia.hs      UniversityNamesDbPedia.hs
10 Category2Gram.hs        PeopleDbPedia.hs
11 CityNamesDbpedia.hs     PoliticalPartyNamesDbPedia.hs
12 
13 src/sw:
14 GenNeo4jCypher.hs   GenTriples.hs
15 
16 src/toplevel:
17 Apis.hs

As mentioned before, we are using the Haskell source fies in a relative path ../NlpTool/src/… and the local src directory. We discuss this code in the next few sections.

The Main Event: Detecting Entities in Text

A primary task in KGCreator is to identify entities (people, places, etc.) in text and then we will create RDF and Neo4J Cypher data statements using these entities, knowledge of the origin of text data and general relationships between entities.

We will use the top level code that we developed earlier that is located in the directory ../NlpTool/src/nlp (please see the chapter Natural Language Processing Tools for more detail):

Categorize.hs - categorizes text into categories like news, religion, business, politics, science, etc.
Entities.hs - identifies entities like people, companies, places, new broadcast networks, labor unions, etc. in text
Summarize.hs - creates an extractive summary of text

The KGCreator Haskell application looks in a specified directory for text files to process. For each file with a .txt extension there should be a matching file with the extension .meta that contains a single line: the URI of the web location where the corresponding text was found. The reason we need this is that we want to create graph knowledge data from information found in text sources and the original location of the data is important to preserve. In other words, we want to know where the data elements in our knowledge graph came from.

We have not looked at an example of using command line arguments yet so let’s go into some detail on how we do this. Previously when we have defined an output target executable in our .cabal file, in this case KGCreator-exe, we could use stack to build the executable and run it with:

1 stack build --fast --exec KGCreator-exe"

Now, we have an executable that requires two arguments: a source input directory and the file root for generated RDF and Cypher output files. We can pass command line arguments using this notation:

1 stack build --fast --exec "KGCreator-exe test_data outtest"

The two command line arguments are:

test_data which is the file path of a local directory containing the input files
outtest which is the root file name for generated Neo4J Cypher and RDF output files

If you are using KGCreator in production, then you will want to copy the compiled and linked executable file KGCreator-exe to somewhere on your PATH like /usr/local/bin.

The following listing shows the file app/Main.hs, the main program for this example that handles command line arguments and calls two top level functions in src/toplevel/Apis.hs:

 1 module Main where
 2 
 3 import System.Environment (getArgs)
 4 import Apis (processFilesToRdf, processFilesToNeo4j)
 5 
 6 main :: IO ()
 7 main = do
 8   args <- getArgs
 9   case args of
10     [] -> error "must supply an input directory containing text and meta files"
11     [_] -> error "in addition to an input directory, also specify a root file name f\
12 or the generated RDF and Cypher files"
13     [inputDir, outputFileRoot] -> do
14         processFilesToRdf   inputDir $ outputFileRoot ++ ".n3"
15         processFilesToNeo4j inputDir $ outputFileRoot ++ ".cypher"
16     _ -> error "too many arguments"

Here we use getArgs in line8 to fetch a list of command line arguments and verify that at least two arguments have been provided. Then we call the functions processFilesToRdf and processFilesToNeo4j and the functions they call in the next three sections.

Utility Code for Generating RDF

The code for generating RDF and for generating Neo4J Cypher data is similar. We start with the code to generate RDF triples. Before we look at the code, let’s start with a few lines of generated RDF:

<http://dbpedia.org/resource/The_Wall_Street_Journal> 
  <http://knowledgebooks.com/schema/aboutCompanyName> 
  "Wall Street Journal" .
<https://newsshop.com/june/z902.html>
  <http://knowledgebooks.com/schema/containsCountryDbPediaLink>
  <http://dbpedia.org/resource/Canada> .

The next listing shows the file src/sw/GenTriples.hs that finds entities like broadcast network names, city names, company names, people’s names, political party names, and university names in text and generates RDF triple data. If you need to add more entity types for your own applications, then use the following steps:

Look at the format of entity data for the NlpTool example and add names for the new entity type you are adding.
Add a utility function to find instances of the new entity type to NlpTools. For example, if you are adding a new entity type “park names”, then copy the code for companyNames to parkNames, modify as necessary, and export parkNames.
In the following code, add new code for the new entity helper function after lines 10, 97, 151, and 261. Use the code for companyNames as an example.

The map category_to_uri_map* created in lines 36 to 84 maps a topic name to a linked Data URI that describes the topic. For example, we would not refer to an information source as being about the topic “economics”, but would instead refer to a linked data URI like http://knowledgebooks.com/schema/topic/economics. The utility function uri_from_categor takes a text description of a topic like “economy” and converts it to an appropriate URI using the map category_to_uri_map*.

The utility function textToTriple takes a file path to a text input file and a path to meta file path, calculates the text string representing the generated triples for the input text file, and returns the result wrapped in an IO monad.

  1 module GenTriples
  2   ( textToTriples
  3   , category_to_uri_map
  4   ) where
  5 
  6 import Categorize (bestCategories)
  7 import Entities
  8   ( broadcastNetworkNames
  9   , cityNames
 10   , companyNames
 11   , countryNames
 12   , peopleNames
 13   , politicalPartyNames
 14   , tradeUnionNames
 15   , universityNames
 16   )
 17 import FileUtils
 18   ( MyMeta
 19   , filePathToString
 20   , filePathToWordTokens
 21   , readMetaFile
 22   , uri
 23   )
 24 import Summarize (summarize, summarizeS)
 25 
 26 import qualified Data.Map as M
 27 import Data.Maybe (fromMaybe)
 28 
 29 generate_triple :: [Char] -> [Char] -> [Char] -> [Char]
 30 generate_triple s p o = s ++ "  " ++ p ++ "  " ++ o ++ " .\n"
 31 
 32 make_literal :: [Char] -> [Char]
 33 make_literal s = "\"" ++ s ++ "\""
 34 
 35 category_to_uri_map :: M.Map [Char] [Char]
 36 category_to_uri_map =
 37   M.fromList
 38     [ ("news_weather", "<http://knowledgebooks.com/schema/topic/weather>")
 39     , ("news_war", "<http://knowledgebooks.com/schema/topic/war>")
 40     , ("economics", "<http://knowledgebooks.com/schema/topic/economics>")
 41     , ("news_economy", "<http://knowledgebooks.com/schema/topic/economics>")
 42     , ("news_politics", "<http://knowledgebooks.com/schema/topic/politics>")
 43     , ("religion", "<http://knowledgebooks.com/schema/topic/religion>")
 44     , ( "religion_buddhism"
 45       , "<http://knowledgebooks.com/schema/topic/religion/buddhism>")
 46     , ( "religion_islam"
 47       , "<http://knowledgebooks.com/schema/topic/religion/islam>")
 48     , ( "religion_christianity"
 49       , "<http://knowledgebooks.com/schema/topic/religion/christianity>")
 50     , ( "religion_hinduism"
 51       , "<http://knowledgebooks.com/schema/topic/religion/hinduism>")
 52     , ( "religion_judaism"
 53       , "<http://knowledgebooks.com/schema/topic/religion/judaism>")
 54     , ("chemistry", "<http://knowledgebooks.com/schema/topic/chemistry>")
 55     , ("computers", "<http://knowledgebooks.com/schema/topic/computers>")
 56     , ("computers_ai", "<http://knowledgebooks.com/schema/topic/computers/ai>")
 57     , ( "computers_ai_datamining"
 58       , "<http://knowledgebooks.com/schema/topic/computers/ai/datamining>")
 59     , ( "computers_ai_learning"
 60       , "<http://knowledgebooks.com/schema/topic/computers/ai/learning>")
 61     , ( "computers_ai_nlp"
 62       , "<http://knowledgebooks.com/schema/topic/computers/ai/nlp>")
 63     , ( "computers_ai_search"
 64       , "<http://knowledgebooks.com/schema/topic/computers/ai/search>")
 65     , ( "computers_ai_textmining"
 66       , "<http://knowledgebooks.com/schema/topic/computers/ai/textmining>")
 67     , ( "computers/programming"
 68       , "<http://knowledgebooks.com/schema/topic/computers/programming>")
 69     , ( "computers_microsoft"
 70       , "<http://knowledgebooks.com/schema/topic/computers/microsoft>")
 71     , ( "computers/programming/ruby"
 72       , "<http://knowledgebooks.com/schema/topic/computers/programming/ruby>")
 73     , ( "computers/programming/lisp"
 74       , "<http://knowledgebooks.com/schema/topic/computers/programming/lisp>")
 75     , ("health", "<http://knowledgebooks.com/schema/topic/health>")
 76     , ( "health_exercise"
 77       , "<http://knowledgebooks.com/schema/topic/health/exercise>")
 78     , ( "health_nutrition"
 79       , "<http://knowledgebooks.com/schema/topic/health/nutrition>")
 80     , ("mathematics", "<http://knowledgebooks.com/schema/topic/mathematics>")
 81     , ("news_music", "<http://knowledgebooks.com/schema/topic/music>")
 82     , ("news_physics", "<http://knowledgebooks.com/schema/topic/physics>")
 83     , ("news_sports", "<http://knowledgebooks.com/schema/topic/sports>")
 84     ]
 85 
 86 uri_from_category :: [Char] -> [Char]
 87 uri_from_category key =
 88   fromMaybe ("\"" ++ key ++ "\"") $ M.lookup key category_to_uri_map
 89 
 90 textToTriples :: FilePath -> [Char] -> IO [Char]
 91 textToTriples file_path meta_file_path = do
 92   word_tokens <- filePathToWordTokens file_path
 93   contents <- filePathToString file_path
 94   putStrLn $ "** contents:\n" ++ contents ++ "\n"
 95   meta_data <- readMetaFile meta_file_path
 96   let people = peopleNames word_tokens
 97   let companies = companyNames word_tokens
 98   let countries = countryNames word_tokens
 99   let cities = cityNames word_tokens
100   let broadcast_networks = broadcastNetworkNames word_tokens
101   let political_parties = politicalPartyNames word_tokens
102   let trade_unions = tradeUnionNames word_tokens
103   let universities = universityNames word_tokens
104   let a_summary = summarizeS contents
105   let the_categories = bestCategories word_tokens
106   let filtered_categories =
107         map (uri_from_category . fst) $
108         filter (\(name, value) -> value > 0.3) the_categories
109   putStrLn "\nfiltered_categories:"
110   print filtered_categories
111   --putStrLn "a_summary:"
112   --print a_summary
113   --print $ summarize contents
114 
115   let summary_triples =
116         generate_triple
117           (uri meta_data)
118           "<http://knowledgebooks.com/schema/summaryOf>" $
119         "\"" ++ a_summary ++ "\""
120   let category_triples =
121         concat
122           [ generate_triple
123             (uri meta_data)
124             "<http://knowledgebooks.com/schema/news/category/>"
125             cat
126           | cat <- filtered_categories
127           ]
128   let people_triples1 =
129         concat
130           [ generate_triple
131             (uri meta_data)
132             "<http://knowledgebooks.com/schema/containsPersonDbPediaLink>"
133             (snd pair)
134           | pair <- people
135           ]
136   let people_triples2 =
137         concat
138           [ generate_triple
139             (snd pair)
140             "<http://knowledgebooks.com/schema/aboutPersonName>"
141             (make_literal (fst pair))
142           | pair <- people
143           ]
144   let company_triples1 =
145         concat
146           [ generate_triple
147             (uri meta_data)
148             "<http://knowledgebooks.com/schema/containsCompanyDbPediaLink>"
149             (snd pair)
150           | pair <- companies
151           ]
152   let company_triples2 =
153         concat
154           [ generate_triple
155             (snd pair)
156             "<http://knowledgebooks.com/schema/aboutCompanyName>"
157             (make_literal (fst pair))
158           | pair <- companies
159           ]
160   let country_triples1 =
161         concat
162           [ generate_triple
163             (uri meta_data)
164             "<http://knowledgebooks.com/schema/containsCountryDbPediaLink>"
165             (snd pair)
166           | pair <- countries
167           ]
168   let country_triples2 =
169         concat
170           [ generate_triple
171             (snd pair)
172             "<http://knowledgebooks.com/schema/aboutCountryName>"
173             (make_literal (fst pair))
174           | pair <- countries
175           ]
176   let city_triples1 =
177         concat
178           [ generate_triple
179             (uri meta_data)
180             "<http://knowledgebooks.com/schema/containsCityDbPediaLink>"
181             (snd pair)
182           | pair <- cities
183           ]
184   let city_triples2 =
185         concat
186           [ generate_triple
187             (snd pair)
188             "<http://knowledgebooks.com/schema/aboutCityName>"
189             (make_literal (fst pair))
190           | pair <- cities
191           ]
192   let bnetworks_triples1 =
193         concat
194           [ generate_triple
195             (uri meta_data)
196             "<http://knowledgebooks.com/schema/containsBroadCastDbPediaLink>"
197             (snd pair)
198           | pair <- broadcast_networks
199           ]
200   let bnetworks_triples2 =
201         concat
202           [ generate_triple
203             (snd pair)
204             "<http://knowledgebooks.com/schema/aboutBroadCastName>"
205             (make_literal (fst pair))
206           | pair <- broadcast_networks
207           ]
208   let pparties_triples1 =
209         concat
210           [ generate_triple
211             (uri meta_data)
212             "<http://knowledgebooks.com/schema/containsPoliticalPartyDbPediaLink>"
213             (snd pair)
214           | pair <- political_parties
215           ]
216   let pparties_triples2 =
217         concat
218           [ generate_triple
219             (snd pair)
220             "<http://knowledgebooks.com/schema/aboutPoliticalPartyName>"
221             (make_literal (fst pair))
222           | pair <- political_parties
223           ]
224   let unions_triples1 =
225         concat
226           [ generate_triple
227             (uri meta_data)
228             "<http://knowledgebooks.com/schema/containsTradeUnionDbPediaLink>"
229             (snd pair)
230           | pair <- trade_unions
231           ]
232   let unions_triples2 =
233         concat
234           [ generate_triple
235             (snd pair)
236             "<http://knowledgebooks.com/schema/aboutTradeUnionName>"
237             (make_literal (fst pair))
238           | pair <- trade_unions
239           ]
240   let universities_triples1 =
241         concat
242           [ generate_triple
243             (uri meta_data)
244             "<http://knowledgebooks.com/schema/containsUniversityDbPediaLink>"
245             (snd pair)
246           | pair <- universities
247           ]
248   let universities_triples2 =
249         concat
250           [ generate_triple
251             (snd pair)
252             "<http://knowledgebooks.com/schema/aboutTradeUnionName>"
253             (make_literal (fst pair))
254           | pair <- universities
255           ]
256   return $
257     concat
258       [ people_triples1
259       , people_triples2
260       , company_triples1
261       , company_triples2
262       , country_triples1
263       , country_triples2
264       , city_triples1
265       , city_triples2
266       , bnetworks_triples1
267       , bnetworks_triples2
268       , pparties_triples1
269       , pparties_triples2
270       , unions_triples1
271       , unions_triples2
272       , universities_triples1
273       , universities_triples2
274       , category_triples
275       , summary_triples
276       ]

The code in this file could be shortened but having repetitive code for each entity type hopefully makes it easier for you to understand how it works:

This code processes text from a given file and generates RDF triples (subject-predicate-object statements) based on the extracted information.

Key Functionality

category_to_uri_map: A map defining the correspondence between categories and their URIs.
uri_from_category: Retrieves the URI associated with a category, or returns the category itself in quotes if not found in the map.
textToTriples:
- Takes file paths for the text and metadata files.
- Extracts various entities (people, companies, countries, etc.) and categories from the text.
- Generates RDF triples representing:
  - Summary of the text
  - Categories associated with the text
  - Links between the text’s URI and identified entities (people, companies, etc.)
  - Additional information about each identified entity (e.g., name)
- Returns a concatenated string of all generated triples.

Pattern

The code repeatedly follows this pattern for different entity types:

Identify entities of a certain type (e.g., peopleNames).
Generate triples linking the text’s URI to the entity’s URI.
Generate triples providing additional information about the entity itself.

Purpose

This code is designed for knowledge extraction and representation. It aims to transform unstructured text into structured RDF data, making it suitable for semantic web applications or knowledge graphs.

Note:

The code relies on external modules (Categorize, Entities, FileUtils, Summarize) for specific functionalities like categorization, entity recognition, file handling, and summarization.
The quality of the generated triples will depend on the accuracy of these external modules.

Utility Code for Generating Cypher Input Data for Neo4J

Now we will generate Neo4J Cypher data. In order to keep the implementation simple, both the RDF and Cypher generation code starts with raw text and performs the NLP analysis to find entities. This example could be refactored to perform the NLP analysis just one time but in practice you will likely be working with either RDF or NEO4J and so you will probably extract just the code you need from this example (i.e., either the RDF or Cypher generation code).

Before we look at the code, let’s start with a few lines of generated Neo4J Cypher import data:

 1 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCompanyDbPediaLink]->(Wall_Stree\
 2 t_Journal)
 3 CREATE (Canada:Entity {name:"Canada", uri:"<http://dbpedia.org/resource/Canada>"})
 4 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCountryDbPediaLink]->(Canada)
 5 CREATE (summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahom\
 6 a_texas_storyid63146361:Summary {name:"summary_of_abcnews_go_com_US_violent_long_las
 7 ting_tornadoes_threaten_oklahoma_texas_storyid63146361", uri:"<https://abcnews.go.co
 8 m/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/story?id=63146361>", sum
 9 mary:"Part of the system that delivered severe weather to the central U.S. over the 
10 weekend is moving into the Northeast today, producing strong to severe storms -- dam
11 aging winds, hail or isolated tornadoes can't be ruled out. Severe weather is foreca
12 st to continue on Tuesday, with the western storm moving east into the Midwest and p
13 arts of the mid-Mississippi Valley."})

The following listing shows the file src/sw/GenNeo4jCypher.hs. This code is very similar to the code for generating RDF in the last section. The same notes for adding your own new entity notes in the last section are also relevant here.

Notice that we import in line 29 the map category_to_uri_map that was defined in the last section. The function neo4j_category_node_defs defined in lines 35 to 43 creates category graph nodes for each category in the map category_to_uri_map. These nodes will be referenced by graph nodes created in the functions create_neo4j_node, create_neo4j_lin, create_summary_node, and create_entity_node. The top level function is textToCypher that is similar to the function textToTriples in the last section.

  1 {-# LANGUAGE OverloadedStrings #-}
  2 
  3 module GenNeo4jCypher
  4   ( textToCypher
  5   , neo4j_category_node_defs
  6   ) where
  7 
  8 import Categorize (bestCategories)
  9 import Data.List (isInfixOf)
 10 import Data.Char (toLower)
 11 import Data.String.Utils (replace)
 12 import Entities
 13   ( broadcastNetworkNames
 14   , cityNames
 15   , companyNames
 16   , countryNames
 17   , peopleNames
 18   , politicalPartyNames
 19   , tradeUnionNames
 20   , universityNames
 21   )
 22 import FileUtils
 23   ( MyMeta
 24   , filePathToString
 25   , filePathToWordTokens
 26   , readMetaFile
 27   , uri
 28   )
 29 import GenTriples (category_to_uri_map)
 30 import Summarize (summarize, summarizeS)
 31 
 32 import qualified Data.Map as M
 33 import Data.Maybe (fromMaybe)
 34 import Database.SQLite.Simple
 35 
 36 -- for debug:
 37 import Data.Typeable (typeOf)
 38 
 39 neo4j_category_node_defs :: [Char]
 40 neo4j_category_node_defs =
 41   replace
 42     "/"
 43     "_"
 44     $ concat
 45     [ "CREATE (" ++ c ++ ":CategoryType {name:\"" ++ c ++ "\"})\n"
 46     | c <- M.keys category_to_uri_map
 47     ]
 48 
 49 uri_from_category :: p -> p
 50 uri_from_category s = s -- might want the full version from GenTriples
 51 
 52 repl :: Char -> Char
 53 repl '-' = '_'
 54 repl '/' = '_'
 55 repl '.' = '_'
 56 repl c = c
 57 
 58 filterChars :: [Char] -> [Char]
 59 filterChars = filter (\c -> c /= '?' && c /= '=' && c /= '<' && c /= '>')
 60 
 61 create_neo4j_node :: [Char] -> ([Char], [Char])
 62 create_neo4j_node uri =
 63   let name =
 64         (map repl (filterChars
 65                     (replace "https://" "" (replace "http://" "" uri)))) ++
 66                     "_" ++
 67                     (map toLower node_type)
 68       node_type =
 69         if isInfixOf "dbpedia" uri
 70           then "DbPedia"
 71           else "News"
 72       new_node =
 73         "CREATE (" ++
 74         name ++ ":" ++
 75         node_type ++ " {name:\"" ++ (replace " " "_" name) ++
 76         "\", uri:\"" ++ uri ++ "\"})\n"
 77    in (name, new_node)
 78 
 79 create_neo4j_link :: [Char] -> [Char] -> [Char] -> [Char]
 80 create_neo4j_link node1 linkName node2 =
 81   "CREATE (" ++ node1 ++ ")-[:" ++ linkName ++ "]->(" ++ node2 ++ ")\n"
 82 
 83 create_summary_node :: [Char] -> [Char] -> [Char]
 84 create_summary_node uri summary =
 85   let name =
 86         "summary_of_" ++
 87         (map repl $
 88          filterChars (replace "https://" "" (replace "http://" "" uri)))
 89       s1 = "CREATE (" ++ name ++ ":Summary {name:\"" ++ name ++ "\", uri:\""
 90       s2 = uri ++ "\", summary:\"" ++ summary ++ "\"})\n"
 91    in s1 ++ s2
 92 
 93 create_entity_node :: ([Char], [Char]) -> [Char]
 94 create_entity_node entity_pair = 
 95   "CREATE (" ++ (replace " " "_" (fst entity_pair)) ++ 
 96   ":Entity {name:\"" ++ (fst entity_pair) ++ "\", uri:\"" ++
 97   (snd entity_pair) ++ "\"})\n"
 98 
 99 create_contains_entity :: [Char] -> [Char] -> ([Char], [Char]) -> [Char]
100 create_contains_entity relation_name source_uri entity_pair =
101   let new_person_node = create_entity_node entity_pair
102       new_link = create_neo4j_link source_uri
103                    relation_name
104                    (replace " " "_" (fst entity_pair))
105   in
106     (new_person_node ++ new_link)
107 
108 entity_node_helper :: [Char] -> [Char] -> [([Char], [Char])] -> [Char]
109 entity_node_helper relation_name node_name entity_list =
110   concat [create_contains_entity
111            relation_name node_name entity | entity <- entity_list]
112 
113 textToCypher :: FilePath -> [Char] -> IO [Char]
114 textToCypher file_path meta_file_path = do
115   let prelude_nodes = neo4j_category_node_defs
116   putStrLn "+++++++++++++++++ prelude node defs:"
117   print prelude_nodes
118   word_tokens <- filePathToWordTokens file_path
119   contents <- filePathToString file_path
120   putStrLn $ "** contents:\n" ++ contents ++ "\n"
121   meta_data <- readMetaFile meta_file_path
122   putStrLn "++ meta_data:"
123   print meta_data
124   let people = peopleNames word_tokens
125   let companies = companyNames word_tokens
126   putStrLn "^^^^ companies:"
127   print companies
128   let countries = countryNames word_tokens
129   let cities = cityNames word_tokens
130   let broadcast_networks = broadcastNetworkNames word_tokens
131   let political_parties = politicalPartyNames word_tokens
132   let trade_unions = tradeUnionNames word_tokens
133   let universities = universityNames word_tokens
134   let a_summary = summarizeS contents
135   let the_categories = bestCategories word_tokens
136   let filtered_categories =
137         map (uri_from_category . fst) $
138         filter (\(name, value) -> value > 0.3) the_categories
139   putStrLn "\nfiltered_categories:"
140   print filtered_categories
141   let (node1_name, node1) = create_neo4j_node (uri meta_data)
142   let summary1 = create_summary_node (uri meta_data) a_summary
143   let category1 =
144         concat
145           [ create_neo4j_link node1_name "Category" cat
146           | cat <- filtered_categories
147           ]
148   let pp = entity_node_helper "ContainsPersonDbPediaLink" node1_name people
149   let cmpny = entity_node_helper "ContainsCompanyDbPediaLink" node1_name companies
150   let cntry = entity_node_helper "ContainsCountryDbPediaLink" node1_name countries
151   let citys = entity_node_helper "ContainsCityDbPediaLink" node1_name cities
152   let bnet = entity_node_helper "ContainsBroadcastNetworkDbPediaLink"
153                                 node1_name broadcast_networks
154   let ppart = entity_node_helper "ContainsPoliticalPartyDbPediaLink"
155                                 node1_name political_parties
156   let tunion = entity_node_helper "ContainsTradeUnionDbPediaLink"
157                                   node1_name trade_unions
158   let uni = entity_node_helper "ContainsUniversityDbPediaLink"
159                                node1_name universities
160   return $ concat [node1, summary1, category1, pp, cmpny, cntry, citys, bnet,
161                    ppart, tunion, uni]

This code generates Cypher queries to create nodes and relationships in a Neo4j graph database based on extracted information from text.

Core Functionality:

neo4j_category_node_defs: Defines Cypher statements to create nodes for predefined categories.
uri_from_category: Placeholder, potentially for full URI mapping (not used in this code).
create_neo4j_node: Creates a Cypher statement to create a node representing either a DbPedia entity or a News article, based on the URI.
create_neo4j_link: Creates a Cypher statement to create a relationship between two nodes.
create_summary_node: Creates a Cypher statement to create a node representing a summary of the text.
create_entity_node: Creates a Cypher statement to create a node representing an entity.
create_contains_entity: Creates Cypher statements to create an entity node and link it to a source node with a specified relationship.
entity_node_helper: Generates Cypher statements for creating entity nodes and relationships for a list of entities.
textToCypher:
- Processes text from a file and its metadata.
- Extracts various entities and categories from the text.
- Generates Cypher statements to:
  - Create nodes for the text itself, its summary, and identified categories.
  - Create nodes and relationships for entities (people, companies, etc.) mentioned in the text.
- Returns a concatenated string of all generated Cypher statements.

Purpose:

This code is designed to transform text into a structured representation within a Neo4j graph database. This allows for querying and analyzing relationships between entities and categories extracted from the text.

Because the top level function is textToCypher returns a string wrapped in a monad, it is possible to add “debug”“ print statements in textToCypher. I left many such debug statements in the example code to help you understand the data that is being operated on. I leave it as an exercise to remove these print statements if you use this code in your own projects and no longer need to see the debug output.

Top Level API Code for Handling Knowledge Graph Data Generation

So far we have looked at processing command line arguments and processing individual input files. Now we look at higher level utility APIs for processing an entire directory of input files. The following listing shows the file API.hs that contains the two top level helper functions we saw in app/Main.hs.

The functions processFilesToRdf and processFilesToNeo4j both have the function type signature FilePath->FilePath->IO() and are very similar except for calling different helper functions to generate RDF triples or Cypher input graph data:

 1 module Apis
 2   ( processFilesToRdf
 3   , processFilesToNeo4j
 4   ) where
 5 
 6 import FileUtils
 7 import GenNeo4jCypher
 8 import GenTriples (textToTriples)
 9 
10 import qualified Database.SQLite.Simple as SQL
11 
12 import Control.Monad (mapM)
13 import Data.String.Utils (replace)
14 import System.Directory (getDirectoryContents)
15 
16 import Data.Typeable (typeOf)
17 
18 processFilesToRdf :: FilePath -> FilePath -> IO ()
19 processFilesToRdf dirPath outputRdfFilePath = do
20   files <- getDirectoryContents dirPath :: IO [FilePath]
21   let filtered_files = filter isTextFile files
22   let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
23   putStrLn "full_paths:"
24   print full_paths
25   let r =
26         [textToTriples fp1 (replace ".txt" ".meta" fp1)
27         |
28         fp1 <- full_paths] :: [IO [Char]]
29   tripleL <-
30     mapM (\fp -> textToTriples fp (replace ".txt" ".meta" fp)) full_paths
31   let tripleS = concat tripleL
32   putStrLn tripleS
33   writeFile outputRdfFilePath tripleS
34 
35 processFilesToNeo4j :: FilePath -> FilePath -> IO ()
36 processFilesToNeo4j dirPath outputRdfFilePath = do
37   files <- getDirectoryContents dirPath :: IO [FilePath]
38   let filtered_files = filter isTextFile files
39   let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
40   putStrLn "full_paths:"
41   print full_paths
42   let prelude_node_defs = neo4j_category_node_defs
43   putStrLn
44     ("+++++  type of prelude_node_defs is: " ++
45      (show (typeOf prelude_node_defs)))
46   print prelude_node_defs
47   cypher_dataL <-
48     mapM (\fp -> textToCypher fp (replace ".txt" ".meta" fp)) full_paths
49   let cypher_dataS = concat cypher_dataL
50   putStrLn cypher_dataS
51   writeFile outputRdfFilePath $ prelude_node_defs ++ cypher_dataS

Since both of these functions return IO monads, I could add “debug” print statements that should be helpful in understanding the data being operated on.

The code defines two functions for processing text files in a directory:

processFilesToRdf: Processes text files and their corresponding metadata files (with .meta extension) in a given directory. It converts the content into RDF triples using textToTriples and writes the concatenated triples to an output RDF file.
processFilesToNeo4j: Processes text files and metadata files to generate Cypher statements for Neo4j. It uses textToCypher to create Cypher data from file content, combines it with predefined Neo4j category node definitions, and writes the result to an output file.

Key Points

File Handling: It utilizes getDirectoryContents for file listing, filter for selecting text files, and writeFile for output.
Data Transformation: textToTriples and textToCypher are functions that convert text content into RDF triples and Cypher statements, respectively.
Metadata Handling: It expects metadata files with the same base name as the text files but with a .meta extension.
Output: The generated RDF triples or Cypher statements are written to specified output files.
neo4j_category_node_defs: A variable holding predefined Cypher node definitions for Neo4j categories.

This code relies on external modules like FileUtils, GenNeo4jCypher, GenTriples, and Database.SQLite.Simple for specific functionalities.

Wrap Up for Automating the Creation of Knowledge Graphs

The code in this chapter will provide you with a good start for creating both test knowledge graphs and for generating data for production. In practice, generated data should be reviewed before use and additional data manually generated as needed. It is good practice to document required manual changes because this documentation can be used in the requirements for updating the code in this chapter to more closely match your knowledge graph requirements.

Up next

Hybrid Haskell and Python Natural Language Processing