Knowledge Graph Creator
The large project described here processes raw text inputs and generates data for knowledge graphs in formats for both the Neo4J graph database and in RDF format for semantic web and linked data applications.
This application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on natural language processing (NLP) and we will reuse this code. We will discuss later three strategies for reusing code from different projects.

The following figure shows part of a Neo4J Knowledge Graph created with the example code. This graph has shortened labels in displayed nodes but Neo4J offers a web browser-based console that lets you interactively explore Knowledge Graphs. We don’t cover setting up Neo4J here so please use the Neo4J documentation. As an introduction to RDF data, the semantic web, and linked data you can get free copies of my two books Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition.

There are two versions of this project that deal with generating duplicate data in two ways:
- As either Neo4J Cypher data or RDF triples data are created, store generated data in a SQLite embedded database. Check this database before writing new output data.
- Ignore the problem of generating duplicate data and filter out duplicates in the outer processing pipeline that uses the Knowledge Graph Creator as one processing step.
For my own work I choose the second method since filtering duplicates is as easy as a few Makefile targets (the following listing is in the file Makefile in the directory haskell_book/source-code/knowledge_graph_creator_pure):
all: gendata rdf cypher
gendata:
stack build --fast --exec Dev-exe
rdf:
echo "Removing duplicate RDF statements"
awk '!visited[$$0]++' out.n3 > output.n3
rm -f out.n3
cypher:
echo "Removing duplicate Cypher statements"
awk '!visited[$$0]++' out.cypher > output.cypher
rm -f out.cypher
The Haskell KGCreator application we develop here writes output files out.n3 (N3 is a RDF data format) and out.cypher (Cypher is the import output format and query language for the Neo4J open source and commercial graph database). The awk commands remove duplicate lines and write de-duplicated data to output.n3 and output.cypher.
We will use this second approach but the next section provides sufficient information and a link to alternative code in case you are interested in using SQLite to prevent duplicate data generation.
Notes for Using SQLite to Avoid Duplicates (Optional Material)
We saw two methods of avoiding duplicates in generated data in the last section. If you want to use the first method for avoiding generating duplicate data, I leave it as an exercise but here are some notes to get you started: you can then modify the example code by using the utility function Blackboard.h in the directory knowledge_graph_creator_pure/src/fileutils and implement the logic seen below for checking new generated data to see if it is in the SQLite database. This first method as it also is a good example for wrapping the embedded SQLite library in an IO Monad and is left as an exercise, otherwise skip this section.
Before you write either an RDF statement or a Neo4J Cypher data import statement, check to see if the statement has already been written using something like:
check <- blackboard_check_key new_data_uri
if check
....
and after writing a RDF statement or a Neo4J Cypher data import statement, write it to the temporary SQLite database using something like:
blackboard_write newStatementString
For the rest of the chapter we will use the approach of not keeping track of generated data in SQLite and instead remove duplicates during post-processing using the standard awk command line utility.
This section is optional. In the rest of this chapter we use the example code in knowledge_graph_creator_pure.
Code Layout For the KGCreator Project and strategies for sharing Haskell code between projects
We will reuse the code for finding entities that we studied in an earlier chapter. There are several ways to reuse code from multiple local Haskell projects:
- In a project’s cabal file, use relative paths to the source code for other projects. This is my preferred way to work but has the drawback that the stack command sdist to make a distribution tarball will not work with relative paths. If this is a problem for you then create relative symbolic file links to the source directories in other projects.
- In your project’s stack.yaml file, add the other project’s name and path as a extra-deps.
- In library projects, define a packages definition and install the library globally on your system.
I almost always use the first method on my projects with dependencies on other local projects I work on and this is also the approach we use here. The relevant lines in the file KGCreator.cabal are:
1 library
2 exposed-modules:
3 CorefWebClient
4 NlpWebClient
5 ClassificationWebClient
6 DirUtils
7 FileUtils
8 BlackBoard
9 GenTriples
10 GenNeo4jCypher
11 Apis
12 Categorize
13 NlpUtils
14 Summarize
15 Entities
16 other-modules:
17 Paths_KGCreator
18 BroadcastNetworkNamesDbPedia
19 Category1Gram
20 Category2Gram
21 CityNamesDbpedia
22 CompanyNamesDbpedia
23 CountryNamesDbpedia
24 PeopleDbPedia
25 PoliticalPartyNamesDbPedia
26 Sentence
27 Stemmer
28 TradeUnionNamesDbPedia
29 UniversityNamesDbPedia
30
31 hs-source-dirs:
32 src
33 src/webclients
34 src/fileutils
35 src/sw
36 src/toplevel
37 ../NlpTool/src/nlp
38 ../NlpTool/src/nlp/data
This is a standard looking cabal file except for lines 37 and 38 where the source paths reference the example code for the NlpTool application developed in a previous chapter. The exposed module BlackBoard (line 8) is not used but I leave it in the cabal file in case you want to experiment with recording generated data in SQLite to avoid data duplication. You are likely to also want to use BlackBoard if you modify this example to continuously process incoming data in a production system. This is left as an exercise.
Before going into too much detail on the implementation let’s look at the layout of the project code:
1 src/fileutils:
2 BlackBoard.hs DirUtils.hs FileUtils.hs
3
4 ../NlpTool/src/nlp:
5 Categorize.hs Entities.hs NlpUtils.hs Sentence.hs Stemmer.hs Summarize.hs data
6
7 ../NlpTool/src/nlp/data:
8 BroadcastNetworkNamesDbPedia.hs CompanyNamesDbpedia.hs TradeUnionNamesDbPedia.hs
9 Category1Gram.hs CountryNamesDbpedia.hs UniversityNamesDbPedia.hs
10 Category2Gram.hs PeopleDbPedia.hs
11 CityNamesDbpedia.hs PoliticalPartyNamesDbPedia.hs
12
13 src/sw:
14 GenNeo4jCypher.hs GenTriples.hs
15
16 src/toplevel:
17 Apis.hs
As mentioned before, we are using the Haskell source fies in a relative path ../NlpTool/src/… and the local src directory. We discuss this code in the next few sections.
The Main Event: Detecting Entities in Text
A primary task in KGCreator is to identify entities (people, places, etc.) in text and then we will create RDF and Neo4J Cypher data statements using these entities, knowledge of the origin of text data and general relationships between entities.
We will use the top level code that we developed earlier that is located in the directory ../NlpTool/src/nlp (please see the chapter Natural Language Processing Tools for more detail):
- Categorize.hs - categorizes text into categories like news, religion, business, politics, science, etc.
- Entities.hs - identifies entities like people, companies, places, new broadcast networks, labor unions, etc. in text
- Summarize.hs - creates an extractive summary of text
The KGCreator Haskell application looks in a specified directory for text files to process. For each file with a .txt extension there should be a matching file with the extension .meta that contains a single line: the URI of the web location where the corresponding text was found. The reason we need this is that we want to create graph knowledge data from information found in text sources and the original location of the data is important to preserve. In other words, we want to know where the data elements in our knowledge graph came from.
We have not looked at an example of using command line arguments yet so let’s go into some detail on how we do this. Previously when we have defined an output target executable in our .cabal file, in this case KGCreator-exe, we could use stack to build the executable and run it with:
1 stack build --fast --exec KGCreator-exe"
Now, we have an executable that requires two arguments: a source input directory and the file root for generated RDF and Cypher output files. We can pass command line arguments using this notation:
1 stack build --fast --exec "KGCreator-exe test_data outtest"
The two command line arguments are:
- test_data which is the file path of a local directory containing the input files
- outtest which is the root file name for generated Neo4J Cypher and RDF output files
If you are using KGCreator in production, then you will want to copy the compiled and linked executable file KGCreator-exe to somewhere on your PATH like /usr/local/bin.
The following listing shows the file app/Main.hs, the main program for this example that handles command line arguments and calls two top level functions in src/toplevel/Apis.hs:
1 -- Entry point: parses command-line args and runs batch processing
2 -- Usage: run with an input directory and an output file root, e.g.
3 -- `kgcreator ./test_data out` generates `out.n3` and `out.cypher`
4 module Main where
5
6 import System.Environment (getArgs)
7 import Apis (processFilesToRdf, processFilesToNeo4j)
8
9 main :: IO ()
10 main
11 -- Minimal argument handling: expect 2 args (input dir, output root)
12 = do
13 args <- getArgs
14 case args of
15 [] -> error "must supply an input directory containing text and meta files"
16 [_] -> error "also specify a root file name for the generated RDF and Cypher files"
17 [inputDir, outputFileRoot] -> do
18 -- Generate RDF triples (.n3) from input text/meta files
19 processFilesToRdf inputDir $ outputFileRoot ++ ".n3"
20 -- Generate Neo4j Cypher (.cypher) from the same input
21 processFilesToNeo4j inputDir $ outputFileRoot ++ ".cypher"
22 _ -> error "too many arguments"
Here we use getArgs in line8 to fetch a list of command line arguments and verify that at least two arguments have been provided. Then we call the functions processFilesToRdf and processFilesToNeo4j and the functions they call in the next three sections.
Utility Code for Generating RDF
The code for generating RDF and for generating Neo4J Cypher data is similar. We start with the code to generate RDF triples. Before we look at the code, let’s start with a few lines of generated RDF:
<http://dbpedia.org/resource/The_Wall_Street_Journal>
<http://knowledgebooks.com/schema/aboutCompanyName>
"Wall Street Journal" .
<https://newsshop.com/june/z902.html>
<http://knowledgebooks.com/schema/containsCountryDbPediaLink>
<http://dbpedia.org/resource/Canada> .
The next listing shows the file src/sw/GenTriples.hs that finds entities like broadcast network names, city names, company names, people’s names, political party names, and university names in text and generates RDF triple data. If you need to add more entity types for your own applications, then use the following steps:
- Look at the format of entity data for the NlpTool example and add names for the new entity type you are adding.
- Add a utility function to find instances of the new entity type to NlpTools. For example, if you are adding a new entity type “park names”, then copy the code for companyNames to parkNames, modify as necessary, and export parkNames.
- In the following code, add new code for the new entity helper function after lines 10, 97, 151, and 261. Use the code for companyNames as an example.
The map category_to_uri_map* created in lines 36 to 84 maps a topic name to a linked Data URI that describes the topic. For example, we would not refer to an information source as being about the topic “economics”, but would instead refer to a linked data URI like http://knowledgebooks.com/schema/topic/economics. The utility function uri_from_categor takes a text description of a topic like “economy” and converts it to an appropriate URI using the map category_to_uri_map*.
The utility function textToTriple takes a file path to a text input file and a path to meta file path, calculates the text string representing the generated triples for the input text file, and returns the result wrapped in an IO monad.
1 -- Builds RDF triples from input text and `.meta` data
2 module GenTriples
3 ( textToTriples
4 , category_to_uri_map
5 ) where
6
7 import Categorize (bestCategories)
8 import Entities
9 ( broadcastNetworkNames
10 , cityNames
11 , companyNames
12 , countryNames
13 , peopleNames
14 , politicalPartyNames
15 , tradeUnionNames
16 , universityNames
17 )
18 import FileUtils
19 ( MyMeta
20 , filePathToString
21 , filePathToWordTokens
22 , readMetaFile
23 , uri
24 )
25 import Summarize (summarize, summarizeS)
26
27 import qualified Data.Map as M
28 import Data.Maybe (fromMaybe)
29
30 -- Helper to format an RDF triple line: subject predicate object .
31 generate_triple :: [Char] -> [Char] -> [Char] -> [Char]
32 generate_triple s p o = s ++ " " ++ p ++ " " ++ o ++ " .\n"
33
34 -- Wrap a string as an RDF literal
35 make_literal :: [Char] -> [Char]
36 make_literal s = "\"" ++ s ++ "\""
37
38 -- Maps classifier category keys to schema URIs used in triples
39 category_to_uri_map :: M.Map [Char] [Char]
40 category_to_uri_map =
41 M.fromList
42 [ ("news_weather", "<http://knowledgebooks.com/schema/topic/weather>")
43 , ("news_war", "<http://knowledgebooks.com/schema/topic/war>")
44 , ("economics", "<http://knowledgebooks.com/schema/topic/economics>")
45 , ("news_economy", "<http://knowledgebooks.com/schema/topic/economics>")
46 , ("news_politics", "<http://knowledgebooks.com/schema/topic/politics>")
47 , ("religion", "<http://knowledgebooks.com/schema/topic/religion>")
48 , ( "religion_buddhism"
49 , "<http://knowledgebooks.com/schema/topic/religion/buddhism>")
50 , ( "religion_islam"
51 , "<http://knowledgebooks.com/schema/topic/religion/islam>")
52 , ( "religion_christianity"
53 , "<http://knowledgebooks.com/schema/topic/religion/christianity>")
54 , ( "religion_hinduism"
55 , "<http://knowledgebooks.com/schema/topic/religion/hinduism>")
56 , ( "religion_judaism"
57 , "<http://knowledgebooks.com/schema/topic/religion/judaism>")
58 , ("chemistry", "<http://knowledgebooks.com/schema/topic/chemistry>")
59 , ("computers", "<http://knowledgebooks.com/schema/topic/computers>")
60 , ("computers_ai", "<http://knowledgebooks.com/schema/topic/computers/ai>")
61 , ( "computers_ai_datamining"
62 , "<http://knowledgebooks.com/schema/topic/computers/ai/datamining>")
63 , ( "computers_ai_learning"
64 , "<http://knowledgebooks.com/schema/topic/computers/ai/learning>")
65 , ( "computers_ai_nlp"
66 , "<http://knowledgebooks.com/schema/topic/computers/ai/nlp>")
67 , ( "computers_ai_search"
68 , "<http://knowledgebooks.com/schema/topic/computers/ai/search>")
69 , ( "computers_ai_textmining"
70 , "<http://knowledgebooks.com/schema/topic/computers/ai/textmining>")
71 , ( "computers/programming"
72 , "<http://knowledgebooks.com/schema/topic/computers/programming>")
73 , ( "computers_microsoft"
74 , "<http://knowledgebooks.com/schema/topic/computers/microsoft>")
75 , ( "computers/programming/ruby"
76 , "<http://knowledgebooks.com/schema/topic/computers/programming/ruby>")
77 , ( "computers/programming/lisp"
78 , "<http://knowledgebooks.com/schema/topic/computers/programming/lisp>")
79 , ("health", "<http://knowledgebooks.com/schema/topic/health>")
80 , ( "health_exercise"
81 , "<http://knowledgebooks.com/schema/topic/health/exercise>")
82 , ( "health_nutrition"
83 , "<http://knowledgebooks.com/schema/topic/health/nutrition>")
84 , ("mathematics", "<http://knowledgebooks.com/schema/topic/mathematics>")
85 , ("news_music", "<http://knowledgebooks.com/schema/topic/music>")
86 , ("news_physics", "<http://knowledgebooks.com/schema/topic/physics>")
87 , ("news_sports", "<http://knowledgebooks.com/schema/topic/sports>")
88 ]
89
90 uri_from_category :: [Char] -> [Char]
91 uri_from_category key =
92 fromMaybe ("\"" ++ key ++ "\"") $ M.lookup key category_to_uri_map
93
94 -- Read a `.txt` and matching `.meta` file; emit RDF describing entities, categories, and summary.
95 textToTriples :: FilePath -> [Char] -> IO [Char]
96 textToTriples file_path meta_file_path = do
97 word_tokens <- filePathToWordTokens file_path
98 contents <- filePathToString file_path
99 putStrLn $ "** contents:\n" ++ contents ++ "\n"
100 meta_data <- readMetaFile meta_file_path
101 let people = peopleNames word_tokens
102 let companies = companyNames word_tokens
103 let countries = countryNames word_tokens
104 let cities = cityNames word_tokens
105 let broadcast_networks = broadcastNetworkNames word_tokens
106 let political_parties = politicalPartyNames word_tokens
107 let trade_unions = tradeUnionNames word_tokens
108 let universities = universityNames word_tokens
109 let a_summary = summarizeS contents
110 let the_categories = bestCategories word_tokens
111 let filtered_categories =
112 map (uri_from_category . fst) $
113 filter (\(name, value) -> value > 0.3) the_categories
114 putStrLn "\nfiltered_categories:"
115 print filtered_categories
116 --putStrLn "a_summary:"
117 --print a_summary
118 --print $ summarize contents
119
120 let summary_triples =
121 generate_triple
122 (uri meta_data)
123 "<http://knowledgebooks.com/schema/summaryOf>" $
124 "\"" ++ a_summary ++ "\""
125 let category_triples =
126 concat
127 [ generate_triple
128 (uri meta_data)
129 "<http://knowledgebooks.com/schema/news/category/>"
130 cat
131 | cat <- filtered_categories
132 ]
133 let people_triples1 =
134 concat
135 [ generate_triple
136 (uri meta_data)
137 "<http://knowledgebooks.com/schema/containsPersonDbPediaLink>"
138 (snd pair)
139 | pair <- people
140 ]
141 let people_triples2 =
142 concat
143 [ generate_triple
144 (snd pair)
145 "<http://knowledgebooks.com/schema/aboutPersonName>"
146 (make_literal (fst pair))
147 | pair <- people
148 ]
149 let company_triples1 =
150 concat
151 [ generate_triple
152 (uri meta_data)
153 "<http://knowledgebooks.com/schema/containsCompanyDbPediaLink>"
154 (snd pair)
155 | pair <- companies
156 ]
157 let company_triples2 =
158 concat
159 [ generate_triple
160 (snd pair)
161 "<http://knowledgebooks.com/schema/aboutCompanyName>"
162 (make_literal (fst pair))
163 | pair <- companies
164 ]
165 let country_triples1 =
166 concat
167 [ generate_triple
168 (uri meta_data)
169 "<http://knowledgebooks.com/schema/containsCountryDbPediaLink>"
170 (snd pair)
171 | pair <- countries
172 ]
173 let country_triples2 =
174 concat
175 [ generate_triple
176 (snd pair)
177 "<http://knowledgebooks.com/schema/aboutCountryName>"
178 (make_literal (fst pair))
179 | pair <- countries
180 ]
181 let city_triples1 =
182 concat
183 [ generate_triple
184 (uri meta_data)
185 "<http://knowledgebooks.com/schema/containsCityDbPediaLink>"
186 (snd pair)
187 | pair <- cities
188 ]
189 let city_triples2 =
190 concat
191 [ generate_triple
192 (snd pair)
193 "<http://knowledgebooks.com/schema/aboutCityName>"
194 (make_literal (fst pair))
195 | pair <- cities
196 ]
197 let bnetworks_triples1 =
198 concat
199 [ generate_triple
200 (uri meta_data)
201 "<http://knowledgebooks.com/schema/containsBroadCastDbPediaLink>"
202 (snd pair)
203 | pair <- broadcast_networks
204 ]
205 let bnetworks_triples2 =
206 concat
207 [ generate_triple
208 (snd pair)
209 "<http://knowledgebooks.com/schema/aboutBroadCastName>"
210 (make_literal (fst pair))
211 | pair <- broadcast_networks
212 ]
213 let pparties_triples1 =
214 concat
215 [ generate_triple
216 (uri meta_data)
217 "<http://knowledgebooks.com/schema/containsPoliticalPartyDbPediaLink>"
218 (snd pair)
219 | pair <- political_parties
220 ]
221 let pparties_triples2 =
222 concat
223 [ generate_triple
224 (snd pair)
225 "<http://knowledgebooks.com/schema/aboutPoliticalPartyName>"
226 (make_literal (fst pair))
227 | pair <- political_parties
228 ]
229 let unions_triples1 =
230 concat
231 [ generate_triple
232 (uri meta_data)
233 "<http://knowledgebooks.com/schema/containsTradeUnionDbPediaLink>"
234 (snd pair)
235 | pair <- trade_unions
236 ]
237 let unions_triples2 =
238 concat
239 [ generate_triple
240 (snd pair)
241 "<http://knowledgebooks.com/schema/aboutTradeUnionName>"
242 (make_literal (fst pair))
243 | pair <- trade_unions
244 ]
245 let universities_triples1 =
246 concat
247 [ generate_triple
248 (uri meta_data)
249 "<http://knowledgebooks.com/schema/containsUniversityDbPediaLink>"
250 (snd pair)
251 | pair <- universities
252 ]
253 let universities_triples2 =
254 concat
255 [ generate_triple
256 (snd pair)
257 "<http://knowledgebooks.com/schema/aboutUniversityName>"
258 (make_literal (fst pair))
259 | pair <- universities
260 ]
261 return $
262 concat
263 [ people_triples1
264 , people_triples2
265 , company_triples1
266 , company_triples2
267 , country_triples1
268 , country_triples2
269 , city_triples1
270 , city_triples2
271 , bnetworks_triples1
272 , bnetworks_triples2
273 , pparties_triples1
274 , pparties_triples2
275 , unions_triples1
276 , unions_triples2
277 , universities_triples1
278 , universities_triples2
279 , category_triples
280 , summary_triples
281 ]
The code in this file could be shortened but having repetitive code for each entity type hopefully makes it easier for you to understand how it works:
This code processes text from a given file and generates RDF triples (subject-predicate-object statements) based on the extracted information.
Key Functionality
category_to_uri_map: A map defining the correspondence between categories and their URIs.uri_from_category: Retrieves the URI associated with a category, or returns the category itself in quotes if not found in the map.-
textToTriples:- Takes file paths for the text and metadata files.
- Extracts various entities (people, companies, countries, etc.) and categories from the text.
-
Generates RDF triples representing:
- Summary of the text
- Categories associated with the text
- Links between the text’s URI and identified entities (people, companies, etc.)
- Additional information about each identified entity (e.g., name)
- Returns a concatenated string of all generated triples.
Pattern
The code repeatedly follows this pattern for different entity types:
- Identify entities of a certain type (e.g.,
peopleNames). - Generate triples linking the text’s URI to the entity’s URI.
- Generate triples providing additional information about the entity itself.
Purpose
This code is designed for knowledge extraction and representation. It aims to transform unstructured text into structured RDF data, making it suitable for semantic web applications or knowledge graphs.
Note:
- The code relies on external modules (
Categorize,Entities,FileUtils,Summarize) for specific functionalities like categorization, entity recognition, file handling, and summarization. - The quality of the generated triples will depend on the accuracy of these external modules.
Utility Code for Generating Cypher Input Data for Neo4J
Now we will generate Neo4J Cypher data. In order to keep the implementation simple, both the RDF and Cypher generation code starts with raw text and performs the NLP analysis to find entities. This example could be refactored to perform the NLP analysis just one time but in practice you will likely be working with either RDF or NEO4J and so you will probably extract just the code you need from this example (i.e., either the RDF or Cypher generation code).
Before we look at the code, let’s start with a few lines of generated Neo4J Cypher import data:
1 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCompanyDbPediaLink]->(Wall_Street_Journal)
2 CREATE (Canada:Entity {name:"Canada", uri:"<http://dbpedia.org/resource/Canada>"})
3 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCountryDbPediaLink]->(Canada)
4 CREATE (summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahoma_texas_storyid63146361:Summary {name:"summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahoma_texas_storyid63146361", uri:"<https://abcnews.go.com/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/story?id=63146361>", summary:"Part of the system that delivered severe weather to the central U.S. over the weekend is moving into the Northeast today, producing strong to severe storms -- damaging winds, hail or isolated tornadoes can't be ruled out. Severe weather is forecast to continue on Tuesday, with the western storm moving east into the Midwest and parts of the mid-Mississippi Valley."})
The following listing shows the file src/sw/GenNeo4jCypher.hs. This code is very similar to the code for generating RDF in the last section. The same notes for adding your own new entity notes in the last section are also relevant here.
Notice that we import in line 29 the map category_to_uri_map that was defined in the last section. The function neo4j_category_node_defs defined in lines 35 to 43 creates category graph nodes for each category in the map category_to_uri_map. These nodes will be referenced by graph nodes created in the functions create_neo4j_node, create_neo4j_lin, create_summary_node, and create_entity_node. The top level function is textToCypher that is similar to the function textToTriples in the last section.
1 {-# LANGUAGE OverloadedStrings #-}
2
3 -- Builds Neo4j Cypher statements (nodes and relationships) from text/meta
4 module GenNeo4jCypher
5 ( textToCypher
6 , neo4j_category_node_defs
7 ) where
8
9 import Categorize (bestCategories)
10 import Data.List (isInfixOf)
11 import Data.Char (toLower)
12 import Data.String.Utils (replace)
13 import Entities
14 ( broadcastNetworkNames
15 , cityNames
16 , companyNames
17 , countryNames
18 , peopleNames
19 , politicalPartyNames
20 , tradeUnionNames
21 , universityNames
22 )
23 import FileUtils
24 ( MyMeta
25 , filePathToString
26 , filePathToWordTokens
27 , readMetaFile
28 , uri
29 )
30 import GenTriples (category_to_uri_map)
31 import Summarize (summarize, summarizeS)
32
33 import qualified Data.Map as M
34 import Data.Maybe (fromMaybe)
35
36 -- Pre-create CategoryType nodes for all known categories
37 neo4j_category_node_defs :: [Char]
38 neo4j_category_node_defs =
39 replace
40 "/"
41 "_"
42 $ concat
43 [ "CREATE (" ++ c ++ ":CategoryType {name:\"" ++ c ++ "\"})\n"
44 | c <- M.keys category_to_uri_map
45 ]
46
47 uri_from_category :: p -> p
48 uri_from_category s = s -- might want the full version from GenTriples
49
50 repl :: Char -> Char
51 repl '-' = '_'
52 repl '/' = '_'
53 repl '.' = '_'
54 repl c = c
55
56 filterChars :: [Char] -> [Char]
57 filterChars = filter (\c -> c /= '?' && c /= '=' && c /= '<' && c /= '>')
58
59 -- Create a Neo4j node name and Cypher for a source URI (DbPedia or News)
60 create_neo4j_node :: [Char] -> ([Char], [Char])
61 create_neo4j_node uri =
62 let name =
63 (map repl (filterChars
64 (replace "https://" "" (replace "http://" "" uri)))) ++
65 "_" ++
66 (map toLower node_type)
67 node_type =
68 if isInfixOf "dbpedia" uri
69 then "DbPedia"
70 else "News"
71 new_node =
72 "CREATE (" ++
73 name ++ ":" ++
74 node_type ++ " {name:\"" ++ (replace " " "_" name) ++
75 "\", uri:\"" ++ uri ++ "\"})\n"
76 in (name, new_node)
77
78 create_neo4j_link :: [Char] -> [Char] -> [Char] -> [Char]
79 create_neo4j_link node1 linkName node2 =
80 "CREATE (" ++ node1 ++ ")-[:" ++ linkName ++ "]->(" ++ node2 ++ ")\n"
81
82 -- Create a Summary node attached to the source URI
83 create_summary_node :: [Char] -> [Char] -> [Char]
84 create_summary_node uri summary =
85 let name =
86 "summary_of_" ++
87 (map repl $
88 filterChars (replace "https://" "" (replace "http://" "" uri)))
89 s1 = "CREATE (" ++ name ++ ":Summary {name:\"" ++ name ++ "\", uri:\""
90 s2 = uri ++ "\", summary:\"" ++ summary ++ "\"})\n"
91 in s1 ++ s2
92
93 create_entity_node :: ([Char], [Char]) -> [Char]
94 create_entity_node entity_pair =
95 "CREATE (" ++ (replace " " "_" (fst entity_pair)) ++
96 ":Entity {name:\"" ++ (fst entity_pair) ++ "\", uri:\"" ++
97 (snd entity_pair) ++ "\"})\n"
98
99 create_contains_entity :: [Char] -> [Char] -> ([Char], [Char]) -> [Char]
100 create_contains_entity relation_name source_uri entity_pair =
101 let new_person_node = create_entity_node entity_pair
102 new_link = create_neo4j_link source_uri
103 relation_name
104 (replace " " "_" (fst entity_pair))
105 in
106 (new_person_node ++ new_link)
107
108 entity_node_helper :: [Char] -> [Char] -> [([Char], [Char])] -> [Char]
109 entity_node_helper relation_name node_name entity_list =
110 concat [create_contains_entity
111 relation_name node_name entity | entity <- entity_list]
112
113 -- Read `.txt` + `.meta`, build Cypher to describe entities, categories, and summary
114 textToCypher :: FilePath -> [Char] -> IO [Char]
115 textToCypher file_path meta_file_path = do
116 let prelude_nodes = neo4j_category_node_defs
117 putStrLn "+++++++++++++++++ prelude node defs:"
118 print prelude_nodes
119 word_tokens <- filePathToWordTokens file_path
120 contents <- filePathToString file_path
121 putStrLn $ "** contents:\n" ++ contents ++ "\n"
122 meta_data <- readMetaFile meta_file_path
123 putStrLn "++ meta_data:"
124 print meta_data
125 let people = peopleNames word_tokens
126 let companies = companyNames word_tokens
127 putStrLn "^^^^ companies:"
128 print companies
129 let countries = countryNames word_tokens
130 let cities = cityNames word_tokens
131 let broadcast_networks = broadcastNetworkNames word_tokens
132 let political_parties = politicalPartyNames word_tokens
133 let trade_unions = tradeUnionNames word_tokens
134 let universities = universityNames word_tokens
135 let a_summary = summarizeS contents
136 let the_categories = bestCategories word_tokens
137 let filtered_categories =
138 map (uri_from_category . fst) $
139 filter (\(name, value) -> value > 0.3) the_categories
140 putStrLn "\nfiltered_categories:"
141 print filtered_categories
142 let (node1_name, node1) = create_neo4j_node (uri meta_data)
143 let summary1 = create_summary_node (uri meta_data) a_summary
144 let category1 =
145 concat
146 [ create_neo4j_link node1_name "Category" cat
147 | cat <- filtered_categories
148 ]
149 let pp = entity_node_helper "ContainsPersonDbPediaLink" node1_name people
150 let cmpny = entity_node_helper "ContainsCompanyDbPediaLink" node1_name companies
151 let cntry = entity_node_helper "ContainsCountryDbPediaLink" node1_name countries
152 let citys = entity_node_helper "ContainsCityDbPediaLink" node1_name cities
153 let bnet = entity_node_helper "ContainsBroadcastNetworkDbPediaLink"
154 node1_name broadcast_networks
155 let ppart = entity_node_helper "ContainsPoliticalPartyDbPediaLink"
156 node1_name political_parties
157 let tunion = entity_node_helper "ContainsTradeUnionDbPediaLink"
158 node1_name trade_unions
159 let uni = entity_node_helper "ContainsUniversityDbPediaLink"
160 node1_name universities
161 return $ concat [node1, summary1, category1, pp, cmpny, cntry, citys, bnet,
162 ppart, tunion, uni]
This code generates Cypher queries to create nodes and relationships in a Neo4j graph database based on extracted information from text.
Core Functionality:
neo4j_category_node_defs: Defines Cypher statements to create nodes for predefined categories.uri_from_category: Placeholder, potentially for full URI mapping (not used in this code).create_neo4j_node: Creates a Cypher statement to create a node representing either a DbPedia entity or a News article, based on the URI.create_neo4j_link: Creates a Cypher statement to create a relationship between two nodes.create_summary_node: Creates a Cypher statement to create a node representing a summary of the text.create_entity_node: Creates a Cypher statement to create a node representing an entity.create_contains_entity: Creates Cypher statements to create an entity node and link it to a source node with a specified relationship.entity_node_helper: Generates Cypher statements for creating entity nodes and relationships for a list of entities.-
textToCypher:- Processes text from a file and its metadata.
- Extracts various entities and categories from the text.
-
Generates Cypher statements to:
- Create nodes for the text itself, its summary, and identified categories.
- Create nodes and relationships for entities (people, companies, etc.) mentioned in the text.
- Returns a concatenated string of all generated Cypher statements.
Purpose:
This code is designed to transform text into a structured representation within a Neo4j graph database. This allows for querying and analyzing relationships between entities and categories extracted from the text.
Because the top level function is textToCypher returns a string wrapped in a monad, it is possible to add “debug”“ print statements in textToCypher. I left many such debug statements in the example code to help you understand the data that is being operated on. I leave it as an exercise to remove these print statements if you use this code in your own projects and no longer need to see the debug output.
Top Level API Code for Handling Knowledge Graph Data Generation
So far we have looked at processing command line arguments and processing individual input files. Now we look at higher level utility APIs for processing an entire directory of input files. The following listing shows the file API.hs that contains the two top level helper functions we saw in app/Main.hs.
The functions processFilesToRdf and processFilesToNeo4j both have the function type signature FilePath->FilePath->IO() and are very similar except for calling different helper functions to generate RDF triples or Cypher input graph data:
1 -- High-level API: batch processes a directory of text/meta files
2 -- Exposes two entry points to produce RDF triples and Neo4j Cypher.
3 module Apis
4 ( processFilesToRdf
5 , processFilesToNeo4j
6 ) where
7
8 import FileUtils
9 import GenNeo4jCypher
10 import GenTriples (textToTriples)
11
12 import qualified Database.SQLite.Simple as SQL
13
14 import Control.Monad (mapM)
15 import Data.String.Utils (replace)
16 import System.Directory (getDirectoryContents)
17
18 import Data.Typeable (typeOf)
19
20 -- Given `dirPath` with `.txt` and matching `.meta` files, write all RDF triples to `outputRdfFilePath`.
21 processFilesToRdf :: FilePath -> FilePath -> IO ()
22 processFilesToRdf dirPath outputRdfFilePath = do
23 files <- getDirectoryContents dirPath :: IO [FilePath]
24 let filtered_files = filter isTextFile files
25 let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
26 putStrLn "full_paths:"
27 print full_paths
28 let r =
29 [textToTriples fp1 (replace ".txt" ".meta" fp1)
30 |
31 fp1 <- full_paths] :: [IO [Char]]
32 tripleL <-
33 mapM (\fp -> textToTriples fp (replace ".txt" ".meta" fp)) full_paths
34 let tripleS = concat tripleL
35 putStrLn tripleS
36 writeFile outputRdfFilePath tripleS
37
38 -- Given `dirPath`, write Neo4j Cypher nodes/relationships to `outputRdfFilePath`.
39 processFilesToNeo4j :: FilePath -> FilePath -> IO ()
40 processFilesToNeo4j dirPath outputRdfFilePath = do
41 files <- getDirectoryContents dirPath :: IO [FilePath]
42 let filtered_files = filter isTextFile files
43 let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
44 putStrLn "full_paths:"
45 print full_paths
46 let prelude_node_defs = neo4j_category_node_defs
47 putStrLn
48 ("+++++ type of prelude_node_defs is: " ++
49 (show (typeOf prelude_node_defs)))
50 print prelude_node_defs
51 cypher_dataL <-
52 mapM (\fp -> textToCypher fp (replace ".txt" ".meta" fp)) full_paths
53 let cypher_dataS = concat cypher_dataL
54 putStrLn cypher_dataS
55 writeFile outputRdfFilePath $ prelude_node_defs ++ cypher_dataS
Since both of these functions return IO monads, I could add “debug” print statements that should be helpful in understanding the data being operated on.
The code defines two functions for processing text files in a directory:
processFilesToRdf: Processes text files and their corresponding metadata files (with.metaextension) in a given directory. It converts the content into RDF triples usingtextToTriplesand writes the concatenated triples to an output RDF file.processFilesToNeo4j: Processes text files and metadata files to generate Cypher statements for Neo4j. It usestextToCypherto create Cypher data from file content, combines it with predefined Neo4j category node definitions, and writes the result to an output file.
Key Points
File Handling: It utilizes
getDirectoryContentsfor file listing,filterfor selecting text files, andwriteFilefor output.Data Transformation:
textToTriplesandtextToCypherare functions that convert text content into RDF triples and Cypher statements, respectively.Metadata Handling: It expects metadata files with the same base name as the text files but with a
.metaextension.Output: The generated RDF triples or Cypher statements are written to specified output files.
neo4j_category_node_defs: A variable holding predefined Cypher node definitions for Neo4j categories.
This code relies on external modules like FileUtils, GenNeo4jCypher, GenTriples, and Database.SQLite.Simple for specific functionalities.
Wrap Up for Automating the Creation of Knowledge Graphs
The code in this chapter will provide you with a good start for creating both test knowledge graphs and for generating data for production. In practice, generated data should be reviewed before use and additional data manually generated as needed. It is good practice to document required manual changes because this documentation can be used in the requirements for updating the code in this chapter to more closely match your knowledge graph requirements.