Knowledge Graph Creator
The large project described here processes raw text inputs and generates data for knowledge graphs in formats for both the Neo4J graph database and in RDF format for semantic web and linked data applications.
This application works by identifying entities in text. Example entity types are people, companies, country names, city names, broadcast network names, political party names, and university names. We saw earlier code for detecting entities in the chapter on natural language processing (NLP) and we will reuse this code. We will discuss later three strategies for reusing code from different projects.
The following figure shows part of a Neo4J Knowledge Graph created with the example code. This graph has shortened labels in displayed nodes but Neo4J offers a web browser-based console that lets you interactively explore Knowledge Graphs. We don’t cover setting up Neo4J here so please use the Neo4J documentation. As an introduction to RDF data, the semantic web, and linked data you can get free copies of my two books Practical Semantic Web and Linked Data Applications, Common Lisp Edition and Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition.

There are two versions of this project that deal with generating duplicate data in two ways:
- As either Neo4J Cypher data or RDF triples data are created, store generated data in a SQLite embedded database. Check this database before writing new output data.
- Ignore the problem of generating duplicate data and filter out duplicates in the outer processing pipeline that uses the Knowledge Graph Creator as one processing step.
For my own work I choose the second method since filtering duplicates is as easy as a few Makefile targets (the following listing is in the file Makefile in the directory haskell_tutorial_cookbook_examples/knowledge_graph_creator_pure):
all: gendata rdf cypher
gendata:
stack build --fast --exec Dev-exe
rdf:
echo "Removing duplicate RDF statements"
awk '!visited[$$0]++' out.n3 > output.n3
rm -f out.n3
cypher:
echo "Removing duplicate Cypher statements"
awk '!visited[$$0]++' out.cypher > output.cypher
rm -f out.cypher
The Haskell KGCreator application we develop here writes output files out.n3 (N3 is a RDF data format) and out.cypher (Cypher is the import output format and query language for the Neo4J open source and commercial graph database). The awk commands remove duplicate lines and write de-duplicated data to output.n3 and output.cypher.
We will use this second approach but the next section provides sufficient information and a link to alternative code in case you are interested in using SQLite to prevent duplicate data generation.
Notes for Using SQLite to Avoid Duplicates (Optional Material)
We saw two methods of avoiding duplicates in generated data in the last section. If you want to use the first method for avoiding generating duplicate data, I leave it as an exercise but here are some notes to get you started: you can then modify the example code by using the utility function Blackboard.h in the directory knowledge_graph_creator_pure/src/fileutils and implement the logic seen below for checking new generated data to see if it is in the SQLite database. This first method as it also is a good example for wrapping the embedded SQLite library in an IO Monad and is left as an exercise, otherwise skip this section.
Before you write either an RDF statement or a Neo4J Cypher data import statement, check to see if the statement has already been written using something like:
check <- blackboard_check_key new_data_uri
if check
....
and after writing a RDF statement or a Neo4J Cypher data import statement, write it to the temporary SQLite database using something like:
blackboard_write newStatementString
For the rest of the chapter we will use the approach of not keeping track of generated data in SQLite and instead remove duplicates during post-processing using the standard awk command line utility.
This section is optional. In the rest of this chapter we use the example code in knowledge_graph_creator_pure.
Code Layout For the KGCreator Project and strategies for sharing Haskell code between projects
We will reuse the code for finding entities that we studied in an earlier chapter. There are several ways to reuse code from multiple local Haskell projects:
- In a project’s cabal file, use relative paths to the source code for other projects. This is my preferred way to work but has the drawback that the stack command sdist to make a distribution tarball will not work with relative paths. If this is a problem for you then create relative symbolic file links to the source directories in other projects.
- In your project’s stack.yaml file, add the other project’s name and path as a extra-deps.
- In library projects, define a packages definition and install the library globally on your system.
I almost always use the first method on my projects with dependencies on other local projects I work on and this is also the approach we use here. The relevant lines in the file KGCreator.cabal are:
1 library
2 exposed-modules:
3 CorefWebClient
4 NlpWebClient
5 ClassificationWebClient
6 DirUtils
7 FileUtils
8 BlackBoard
9 GenTriples
10 GenNeo4jCypher
11 Apis
12 Categorize
13 NlpUtils
14 Summarize
15 Entities
16 other-modules:
17 Paths_KGCreator
18 BroadcastNetworkNamesDbPedia
19 Category1Gram
20 Category2Gram
21 CityNamesDbpedia
22 CompanyNamesDbpedia
23 CountryNamesDbpedia
24 PeopleDbPedia
25 PoliticalPartyNamesDbPedia
26 Sentence
27 Stemmer
28 TradeUnionNamesDbPedia
29 UniversityNamesDbPedia
30
31 hs-source-dirs:
32 src
33 src/webclients
34 src/fileutils
35 src/sw
36 src/toplevel
37 ../NlpTool/src/nlp
38 ../NlpTool/src/nlp/data
This is a standard looking cabal file except for lines 37 and 38 where the source paths reference the example code for the NlpTool application developed in a previous chapter. The exposed module BlackBoard (line 8) is not used but I leave it in the cabal file in case you want to experiment with recording generated data in SQLite to avoid data duplication. You are likely to also want to use BlackBoard if you modify this example to continuously process incoming data in a production system. This is left as an exercise.
Before going into too much detail on the implementation let’s look at the layout of the project code:
1 src/fileutils:
2 BlackBoard.hs DirUtils.hs FileUtils.hs
3
4 ../NlpTool/src/nlp:
5 Categorize.hs Entities.hs NlpUtils.hs Sentence.hs Stemmer.hs Summarize.hs data
6
7 ../NlpTool/src/nlp/data:
8 BroadcastNetworkNamesDbPedia.hs CompanyNamesDbpedia.hs TradeUnionNamesDbPedia.hs
9 Category1Gram.hs CountryNamesDbpedia.hs UniversityNamesDbPedia.hs
10 Category2Gram.hs PeopleDbPedia.hs
11 CityNamesDbpedia.hs PoliticalPartyNamesDbPedia.hs
12
13 src/sw:
14 GenNeo4jCypher.hs GenTriples.hs
15
16 src/toplevel:
17 Apis.hs
As mentioned before, we are using the Haskell source fies in a relative path ../NlpTool/src/… and the local src directory. We discuss this code in the next few sections.
The Main Event: Detecting Entities in Text
A primary task in KGCreator is to identify entities (people, places, etc.) in text and then we will create RDF and Neo4J Cypher data statements using these entities, knowledge of the origin of text data and general relationships between entities.
We will use the top level code that we developed earlier that is located in the directory ../NlpTool/src/nlp (please see the chapter Natural Language Processing Tools for more detail):
- Categorize.hs - categorizes text into categories like news, religion, business, politics, science, etc.
- Entities.hs - identifies entities like people, companies, places, new broadcast networks, labor unions, etc. in text
- Summarize.hs - creates an extractive summary of text
The KGCreator Haskell application looks in a specified directory for text files to process. For each file with a .txt extension there should be a matching file with the extension .meta that contains a single line: the URI of the web location where the corresponding text was found. The reason we need this is that we want to create graph knowledge data from information found in text sources and the original location of the data is important to preserve. In other words, we want to know where the data elements in our knowledge graph came from.
We have not looked at an example of using command line arguments yet so let’s go into some detail on how we do this. Previously when we have defined an output target executable in our .cabal file, in this case KGCreator-exe, we could use stack to build the executable and run it with:
1 stack build --fast --exec KGCreator-exe"
Now, we have an executable that requires two arguments: a source input directory and the file root for generated RDF and Cypher output files. We can pass command line arguments using this notation:
1 stack build --fast --exec "KGCreator-exe test_data outtest"
The two command line arguments are:
- test_data which is the file path of a local directory containing the input files
- outtest which is the root file name for generated Neo4J Cypher and RDF output files
If you are using KGCreator in production, then you will want to copy the compiled and linked executable file KGCreator-exe to somewhere on your PATH like /usr/local/bin.
The following listing shows the file app/Main.hs, the main program for this example that handles command line arguments and calls two top level functions in src/toplevel/Apis.hs:
1 module Main where
2
3 import System.Environment (getArgs)
4 import Apis (processFilesToRdf, processFilesToNeo4j)
5
6 main :: IO ()
7 main = do
8 args <- getArgs
9 case args of
10 [] -> error "must supply an input directory containing text and meta files"
11 [_] -> error "in addition to an input directory, also specify a root file name f\
12 or the generated RDF and Cypher files"
13 [inputDir, outputFileRoot] -> do
14 processFilesToRdf inputDir $ outputFileRoot ++ ".n3"
15 processFilesToNeo4j inputDir $ outputFileRoot ++ ".cypher"
16 _ -> error "too many arguments"
Here we use getArgs in line8 to fetch a list of command line arguments and verify that at least two arguments have been provided. Then we call the functions processFilesToRdf and processFilesToNeo4j and the functions they call in the next three sections.
Utility Code for Generating RDF
The code for generating RDF and for generating Neo4J Cypher data is similar. We start with the code to generate RDF triples. Before we look at the code, let’s start with a few lines of generated RDF:
<http://dbpedia.org/resource/The_Wall_Street_Journal>
<http://knowledgebooks.com/schema/aboutCompanyName>
"Wall Street Journal" .
<https://newsshop.com/june/z902.html>
<http://knowledgebooks.com/schema/containsCountryDbPediaLink>
<http://dbpedia.org/resource/Canada> .
The next listing shows the file src/sw/GenTriples.hs that finds entities like broadcast network names, city names, company names, people’s names, political party names, and university names in text and generates RDF triple data. If you need to add more entity types for your own applications, then use the following steps:
- Look at the format of entity data for the NlpTool example and add names for the new entity type you are adding.
- Add a utility function to find instances of the new entity type to NlpTools. For example, if you are adding a new entity type “park names”, then copy the code for companyNames to parkNames, modify as necessary, and export parkNames.
- In the following code, add new code for the new entity helper function after lines 10, 97, 151, and 261. Use the code for companyNames as an example.
The map category_to_uri_map* created in lines 36 to 84 maps a topic name to a linked Data URI that describes the topic. For example, we would not refer to an information source as being about the topic “economics”, but would instead refer to a linked data URI like http://knowledgebooks.com/schema/topic/economics. The utility function uri_from_categor takes a text description of a topic like “economy” and converts it to an appropriate URI using the map category_to_uri_map*.
The utility function textToTriple takes a file path to a text input file and a path to meta file path, calculates the text string representing the generated triples for the input text file, and returns the result wrapped in an IO monad.
1 module GenTriples
2 ( textToTriples
3 , category_to_uri_map
4 ) where
5
6 import Categorize (bestCategories)
7 import Entities
8 ( broadcastNetworkNames
9 , cityNames
10 , companyNames
11 , countryNames
12 , peopleNames
13 , politicalPartyNames
14 , tradeUnionNames
15 , universityNames
16 )
17 import FileUtils
18 ( MyMeta
19 , filePathToString
20 , filePathToWordTokens
21 , readMetaFile
22 , uri
23 )
24 import Summarize (summarize, summarizeS)
25
26 import qualified Data.Map as M
27 import Data.Maybe (fromMaybe)
28
29 generate_triple :: [Char] -> [Char] -> [Char] -> [Char]
30 generate_triple s p o = s ++ " " ++ p ++ " " ++ o ++ " .\n"
31
32 make_literal :: [Char] -> [Char]
33 make_literal s = "\"" ++ s ++ "\""
34
35 category_to_uri_map :: M.Map [Char] [Char]
36 category_to_uri_map =
37 M.fromList
38 [ ("news_weather", "<http://knowledgebooks.com/schema/topic/weather>")
39 , ("news_war", "<http://knowledgebooks.com/schema/topic/war>")
40 , ("economics", "<http://knowledgebooks.com/schema/topic/economics>")
41 , ("news_economy", "<http://knowledgebooks.com/schema/topic/economics>")
42 , ("news_politics", "<http://knowledgebooks.com/schema/topic/politics>")
43 , ("religion", "<http://knowledgebooks.com/schema/topic/religion>")
44 , ( "religion_buddhism"
45 , "<http://knowledgebooks.com/schema/topic/religion/buddhism>")
46 , ( "religion_islam"
47 , "<http://knowledgebooks.com/schema/topic/religion/islam>")
48 , ( "religion_christianity"
49 , "<http://knowledgebooks.com/schema/topic/religion/christianity>")
50 , ( "religion_hinduism"
51 , "<http://knowledgebooks.com/schema/topic/religion/hinduism>")
52 , ( "religion_judaism"
53 , "<http://knowledgebooks.com/schema/topic/religion/judaism>")
54 , ("chemistry", "<http://knowledgebooks.com/schema/topic/chemistry>")
55 , ("computers", "<http://knowledgebooks.com/schema/topic/computers>")
56 , ("computers_ai", "<http://knowledgebooks.com/schema/topic/computers/ai>")
57 , ( "computers_ai_datamining"
58 , "<http://knowledgebooks.com/schema/topic/computers/ai/datamining>")
59 , ( "computers_ai_learning"
60 , "<http://knowledgebooks.com/schema/topic/computers/ai/learning>")
61 , ( "computers_ai_nlp"
62 , "<http://knowledgebooks.com/schema/topic/computers/ai/nlp>")
63 , ( "computers_ai_search"
64 , "<http://knowledgebooks.com/schema/topic/computers/ai/search>")
65 , ( "computers_ai_textmining"
66 , "<http://knowledgebooks.com/schema/topic/computers/ai/textmining>")
67 , ( "computers/programming"
68 , "<http://knowledgebooks.com/schema/topic/computers/programming>")
69 , ( "computers_microsoft"
70 , "<http://knowledgebooks.com/schema/topic/computers/microsoft>")
71 , ( "computers/programming/ruby"
72 , "<http://knowledgebooks.com/schema/topic/computers/programming/ruby>")
73 , ( "computers/programming/lisp"
74 , "<http://knowledgebooks.com/schema/topic/computers/programming/lisp>")
75 , ("health", "<http://knowledgebooks.com/schema/topic/health>")
76 , ( "health_exercise"
77 , "<http://knowledgebooks.com/schema/topic/health/exercise>")
78 , ( "health_nutrition"
79 , "<http://knowledgebooks.com/schema/topic/health/nutrition>")
80 , ("mathematics", "<http://knowledgebooks.com/schema/topic/mathematics>")
81 , ("news_music", "<http://knowledgebooks.com/schema/topic/music>")
82 , ("news_physics", "<http://knowledgebooks.com/schema/topic/physics>")
83 , ("news_sports", "<http://knowledgebooks.com/schema/topic/sports>")
84 ]
85
86 uri_from_category :: [Char] -> [Char]
87 uri_from_category key =
88 fromMaybe ("\"" ++ key ++ "\"") $ M.lookup key category_to_uri_map
89
90 textToTriples :: FilePath -> [Char] -> IO [Char]
91 textToTriples file_path meta_file_path = do
92 word_tokens <- filePathToWordTokens file_path
93 contents <- filePathToString file_path
94 putStrLn $ "** contents:\n" ++ contents ++ "\n"
95 meta_data <- readMetaFile meta_file_path
96 let people = peopleNames word_tokens
97 let companies = companyNames word_tokens
98 let countries = countryNames word_tokens
99 let cities = cityNames word_tokens
100 let broadcast_networks = broadcastNetworkNames word_tokens
101 let political_parties = politicalPartyNames word_tokens
102 let trade_unions = tradeUnionNames word_tokens
103 let universities = universityNames word_tokens
104 let a_summary = summarizeS contents
105 let the_categories = bestCategories word_tokens
106 let filtered_categories =
107 map (uri_from_category . fst) $
108 filter (\(name, value) -> value > 0.3) the_categories
109 putStrLn "\nfiltered_categories:"
110 print filtered_categories
111 --putStrLn "a_summary:"
112 --print a_summary
113 --print $ summarize contents
114
115 let summary_triples =
116 generate_triple
117 (uri meta_data)
118 "<http://knowledgebooks.com/schema/summaryOf>" $
119 "\"" ++ a_summary ++ "\""
120 let category_triples =
121 concat
122 [ generate_triple
123 (uri meta_data)
124 "<http://knowledgebooks.com/schema/news/category/>"
125 cat
126 | cat <- filtered_categories
127 ]
128 let people_triples1 =
129 concat
130 [ generate_triple
131 (uri meta_data)
132 "<http://knowledgebooks.com/schema/containsPersonDbPediaLink>"
133 (snd pair)
134 | pair <- people
135 ]
136 let people_triples2 =
137 concat
138 [ generate_triple
139 (snd pair)
140 "<http://knowledgebooks.com/schema/aboutPersonName>"
141 (make_literal (fst pair))
142 | pair <- people
143 ]
144 let company_triples1 =
145 concat
146 [ generate_triple
147 (uri meta_data)
148 "<http://knowledgebooks.com/schema/containsCompanyDbPediaLink>"
149 (snd pair)
150 | pair <- companies
151 ]
152 let company_triples2 =
153 concat
154 [ generate_triple
155 (snd pair)
156 "<http://knowledgebooks.com/schema/aboutCompanyName>"
157 (make_literal (fst pair))
158 | pair <- companies
159 ]
160 let country_triples1 =
161 concat
162 [ generate_triple
163 (uri meta_data)
164 "<http://knowledgebooks.com/schema/containsCountryDbPediaLink>"
165 (snd pair)
166 | pair <- countries
167 ]
168 let country_triples2 =
169 concat
170 [ generate_triple
171 (snd pair)
172 "<http://knowledgebooks.com/schema/aboutCountryName>"
173 (make_literal (fst pair))
174 | pair <- countries
175 ]
176 let city_triples1 =
177 concat
178 [ generate_triple
179 (uri meta_data)
180 "<http://knowledgebooks.com/schema/containsCityDbPediaLink>"
181 (snd pair)
182 | pair <- cities
183 ]
184 let city_triples2 =
185 concat
186 [ generate_triple
187 (snd pair)
188 "<http://knowledgebooks.com/schema/aboutCityName>"
189 (make_literal (fst pair))
190 | pair <- cities
191 ]
192 let bnetworks_triples1 =
193 concat
194 [ generate_triple
195 (uri meta_data)
196 "<http://knowledgebooks.com/schema/containsBroadCastDbPediaLink>"
197 (snd pair)
198 | pair <- broadcast_networks
199 ]
200 let bnetworks_triples2 =
201 concat
202 [ generate_triple
203 (snd pair)
204 "<http://knowledgebooks.com/schema/aboutBroadCastName>"
205 (make_literal (fst pair))
206 | pair <- broadcast_networks
207 ]
208 let pparties_triples1 =
209 concat
210 [ generate_triple
211 (uri meta_data)
212 "<http://knowledgebooks.com/schema/containsPoliticalPartyDbPediaLink>"
213 (snd pair)
214 | pair <- political_parties
215 ]
216 let pparties_triples2 =
217 concat
218 [ generate_triple
219 (snd pair)
220 "<http://knowledgebooks.com/schema/aboutPoliticalPartyName>"
221 (make_literal (fst pair))
222 | pair <- political_parties
223 ]
224 let unions_triples1 =
225 concat
226 [ generate_triple
227 (uri meta_data)
228 "<http://knowledgebooks.com/schema/containsTradeUnionDbPediaLink>"
229 (snd pair)
230 | pair <- trade_unions
231 ]
232 let unions_triples2 =
233 concat
234 [ generate_triple
235 (snd pair)
236 "<http://knowledgebooks.com/schema/aboutTradeUnionName>"
237 (make_literal (fst pair))
238 | pair <- trade_unions
239 ]
240 let universities_triples1 =
241 concat
242 [ generate_triple
243 (uri meta_data)
244 "<http://knowledgebooks.com/schema/containsUniversityDbPediaLink>"
245 (snd pair)
246 | pair <- universities
247 ]
248 let universities_triples2 =
249 concat
250 [ generate_triple
251 (snd pair)
252 "<http://knowledgebooks.com/schema/aboutTradeUnionName>"
253 (make_literal (fst pair))
254 | pair <- universities
255 ]
256 return $
257 concat
258 [ people_triples1
259 , people_triples2
260 , company_triples1
261 , company_triples2
262 , country_triples1
263 , country_triples2
264 , city_triples1
265 , city_triples2
266 , bnetworks_triples1
267 , bnetworks_triples2
268 , pparties_triples1
269 , pparties_triples2
270 , unions_triples1
271 , unions_triples2
272 , universities_triples1
273 , universities_triples2
274 , category_triples
275 , summary_triples
276 ]
The code in this file could be shortened but having repetitive code for each entity type hopefully makes it easier for you to understand how it works:
This code processes text from a given file and generates RDF triples (subject-predicate-object statements) based on the extracted information.
Key Functionality
category_to_uri_map: A map defining the correspondence between categories and their URIs.uri_from_category: Retrieves the URI associated with a category, or returns the category itself in quotes if not found in the map.-
textToTriples:- Takes file paths for the text and metadata files.
- Extracts various entities (people, companies, countries, etc.) and categories from the text.
-
Generates RDF triples representing:
- Summary of the text
- Categories associated with the text
- Links between the text’s URI and identified entities (people, companies, etc.)
- Additional information about each identified entity (e.g., name)
- Returns a concatenated string of all generated triples.
Pattern
The code repeatedly follows this pattern for different entity types:
- Identify entities of a certain type (e.g.,
peopleNames). - Generate triples linking the text’s URI to the entity’s URI.
- Generate triples providing additional information about the entity itself.
Purpose
This code is designed for knowledge extraction and representation. It aims to transform unstructured text into structured RDF data, making it suitable for semantic web applications or knowledge graphs.
Note:
- The code relies on external modules (
Categorize,Entities,FileUtils,Summarize) for specific functionalities like categorization, entity recognition, file handling, and summarization. - The quality of the generated triples will depend on the accuracy of these external modules.
Utility Code for Generating Cypher Input Data for Neo4J
Now we will generate Neo4J Cypher data. In order to keep the implementation simple, both the RDF and Cypher generation code starts with raw text and performs the NLP analysis to find entities. This example could be refactored to perform the NLP analysis just one time but in practice you will likely be working with either RDF or NEO4J and so you will probably extract just the code you need from this example (i.e., either the RDF or Cypher generation code).
Before we look at the code, let’s start with a few lines of generated Neo4J Cypher import data:
1 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCompanyDbPediaLink]->(Wall_Stree\
2 t_Journal)
3 CREATE (Canada:Entity {name:"Canada", uri:"<http://dbpedia.org/resource/Canada>"})
4 CREATE (newsshop_com_june_z902_html_news)-[:ContainsCountryDbPediaLink]->(Canada)
5 CREATE (summary_of_abcnews_go_com_US_violent_long_lasting_tornadoes_threaten_oklahom\
6 a_texas_storyid63146361:Summary {name:"summary_of_abcnews_go_com_US_violent_long_las
7 ting_tornadoes_threaten_oklahoma_texas_storyid63146361", uri:"<https://abcnews.go.co
8 m/US/violent-long-lasting-tornadoes-threaten-oklahoma-texas/story?id=63146361>", sum
9 mary:"Part of the system that delivered severe weather to the central U.S. over the
10 weekend is moving into the Northeast today, producing strong to severe storms -- dam
11 aging winds, hail or isolated tornadoes can't be ruled out. Severe weather is foreca
12 st to continue on Tuesday, with the western storm moving east into the Midwest and p
13 arts of the mid-Mississippi Valley."})
The following listing shows the file src/sw/GenNeo4jCypher.hs. This code is very similar to the code for generating RDF in the last section. The same notes for adding your own new entity notes in the last section are also relevant here.
Notice that we import in line 29 the map category_to_uri_map that was defined in the last section. The function neo4j_category_node_defs defined in lines 35 to 43 creates category graph nodes for each category in the map category_to_uri_map. These nodes will be referenced by graph nodes created in the functions create_neo4j_node, create_neo4j_lin, create_summary_node, and create_entity_node. The top level function is textToCypher that is similar to the function textToTriples in the last section.
1 {-# LANGUAGE OverloadedStrings #-}
2
3 module GenNeo4jCypher
4 ( textToCypher
5 , neo4j_category_node_defs
6 ) where
7
8 import Categorize (bestCategories)
9 import Data.List (isInfixOf)
10 import Data.Char (toLower)
11 import Data.String.Utils (replace)
12 import Entities
13 ( broadcastNetworkNames
14 , cityNames
15 , companyNames
16 , countryNames
17 , peopleNames
18 , politicalPartyNames
19 , tradeUnionNames
20 , universityNames
21 )
22 import FileUtils
23 ( MyMeta
24 , filePathToString
25 , filePathToWordTokens
26 , readMetaFile
27 , uri
28 )
29 import GenTriples (category_to_uri_map)
30 import Summarize (summarize, summarizeS)
31
32 import qualified Data.Map as M
33 import Data.Maybe (fromMaybe)
34 import Database.SQLite.Simple
35
36 -- for debug:
37 import Data.Typeable (typeOf)
38
39 neo4j_category_node_defs :: [Char]
40 neo4j_category_node_defs =
41 replace
42 "/"
43 "_"
44 $ concat
45 [ "CREATE (" ++ c ++ ":CategoryType {name:\"" ++ c ++ "\"})\n"
46 | c <- M.keys category_to_uri_map
47 ]
48
49 uri_from_category :: p -> p
50 uri_from_category s = s -- might want the full version from GenTriples
51
52 repl :: Char -> Char
53 repl '-' = '_'
54 repl '/' = '_'
55 repl '.' = '_'
56 repl c = c
57
58 filterChars :: [Char] -> [Char]
59 filterChars = filter (\c -> c /= '?' && c /= '=' && c /= '<' && c /= '>')
60
61 create_neo4j_node :: [Char] -> ([Char], [Char])
62 create_neo4j_node uri =
63 let name =
64 (map repl (filterChars
65 (replace "https://" "" (replace "http://" "" uri)))) ++
66 "_" ++
67 (map toLower node_type)
68 node_type =
69 if isInfixOf "dbpedia" uri
70 then "DbPedia"
71 else "News"
72 new_node =
73 "CREATE (" ++
74 name ++ ":" ++
75 node_type ++ " {name:\"" ++ (replace " " "_" name) ++
76 "\", uri:\"" ++ uri ++ "\"})\n"
77 in (name, new_node)
78
79 create_neo4j_link :: [Char] -> [Char] -> [Char] -> [Char]
80 create_neo4j_link node1 linkName node2 =
81 "CREATE (" ++ node1 ++ ")-[:" ++ linkName ++ "]->(" ++ node2 ++ ")\n"
82
83 create_summary_node :: [Char] -> [Char] -> [Char]
84 create_summary_node uri summary =
85 let name =
86 "summary_of_" ++
87 (map repl $
88 filterChars (replace "https://" "" (replace "http://" "" uri)))
89 s1 = "CREATE (" ++ name ++ ":Summary {name:\"" ++ name ++ "\", uri:\""
90 s2 = uri ++ "\", summary:\"" ++ summary ++ "\"})\n"
91 in s1 ++ s2
92
93 create_entity_node :: ([Char], [Char]) -> [Char]
94 create_entity_node entity_pair =
95 "CREATE (" ++ (replace " " "_" (fst entity_pair)) ++
96 ":Entity {name:\"" ++ (fst entity_pair) ++ "\", uri:\"" ++
97 (snd entity_pair) ++ "\"})\n"
98
99 create_contains_entity :: [Char] -> [Char] -> ([Char], [Char]) -> [Char]
100 create_contains_entity relation_name source_uri entity_pair =
101 let new_person_node = create_entity_node entity_pair
102 new_link = create_neo4j_link source_uri
103 relation_name
104 (replace " " "_" (fst entity_pair))
105 in
106 (new_person_node ++ new_link)
107
108 entity_node_helper :: [Char] -> [Char] -> [([Char], [Char])] -> [Char]
109 entity_node_helper relation_name node_name entity_list =
110 concat [create_contains_entity
111 relation_name node_name entity | entity <- entity_list]
112
113 textToCypher :: FilePath -> [Char] -> IO [Char]
114 textToCypher file_path meta_file_path = do
115 let prelude_nodes = neo4j_category_node_defs
116 putStrLn "+++++++++++++++++ prelude node defs:"
117 print prelude_nodes
118 word_tokens <- filePathToWordTokens file_path
119 contents <- filePathToString file_path
120 putStrLn $ "** contents:\n" ++ contents ++ "\n"
121 meta_data <- readMetaFile meta_file_path
122 putStrLn "++ meta_data:"
123 print meta_data
124 let people = peopleNames word_tokens
125 let companies = companyNames word_tokens
126 putStrLn "^^^^ companies:"
127 print companies
128 let countries = countryNames word_tokens
129 let cities = cityNames word_tokens
130 let broadcast_networks = broadcastNetworkNames word_tokens
131 let political_parties = politicalPartyNames word_tokens
132 let trade_unions = tradeUnionNames word_tokens
133 let universities = universityNames word_tokens
134 let a_summary = summarizeS contents
135 let the_categories = bestCategories word_tokens
136 let filtered_categories =
137 map (uri_from_category . fst) $
138 filter (\(name, value) -> value > 0.3) the_categories
139 putStrLn "\nfiltered_categories:"
140 print filtered_categories
141 let (node1_name, node1) = create_neo4j_node (uri meta_data)
142 let summary1 = create_summary_node (uri meta_data) a_summary
143 let category1 =
144 concat
145 [ create_neo4j_link node1_name "Category" cat
146 | cat <- filtered_categories
147 ]
148 let pp = entity_node_helper "ContainsPersonDbPediaLink" node1_name people
149 let cmpny = entity_node_helper "ContainsCompanyDbPediaLink" node1_name companies
150 let cntry = entity_node_helper "ContainsCountryDbPediaLink" node1_name countries
151 let citys = entity_node_helper "ContainsCityDbPediaLink" node1_name cities
152 let bnet = entity_node_helper "ContainsBroadcastNetworkDbPediaLink"
153 node1_name broadcast_networks
154 let ppart = entity_node_helper "ContainsPoliticalPartyDbPediaLink"
155 node1_name political_parties
156 let tunion = entity_node_helper "ContainsTradeUnionDbPediaLink"
157 node1_name trade_unions
158 let uni = entity_node_helper "ContainsUniversityDbPediaLink"
159 node1_name universities
160 return $ concat [node1, summary1, category1, pp, cmpny, cntry, citys, bnet,
161 ppart, tunion, uni]
This code generates Cypher queries to create nodes and relationships in a Neo4j graph database based on extracted information from text.
Core Functionality:
neo4j_category_node_defs: Defines Cypher statements to create nodes for predefined categories.uri_from_category: Placeholder, potentially for full URI mapping (not used in this code).create_neo4j_node: Creates a Cypher statement to create a node representing either a DbPedia entity or a News article, based on the URI.create_neo4j_link: Creates a Cypher statement to create a relationship between two nodes.create_summary_node: Creates a Cypher statement to create a node representing a summary of the text.create_entity_node: Creates a Cypher statement to create a node representing an entity.create_contains_entity: Creates Cypher statements to create an entity node and link it to a source node with a specified relationship.entity_node_helper: Generates Cypher statements for creating entity nodes and relationships for a list of entities.-
textToCypher:- Processes text from a file and its metadata.
- Extracts various entities and categories from the text.
-
Generates Cypher statements to:
- Create nodes for the text itself, its summary, and identified categories.
- Create nodes and relationships for entities (people, companies, etc.) mentioned in the text.
- Returns a concatenated string of all generated Cypher statements.
Purpose:
This code is designed to transform text into a structured representation within a Neo4j graph database. This allows for querying and analyzing relationships between entities and categories extracted from the text.
Because the top level function is textToCypher returns a string wrapped in a monad, it is possible to add “debug”“ print statements in textToCypher. I left many such debug statements in the example code to help you understand the data that is being operated on. I leave it as an exercise to remove these print statements if you use this code in your own projects and no longer need to see the debug output.
Top Level API Code for Handling Knowledge Graph Data Generation
So far we have looked at processing command line arguments and processing individual input files. Now we look at higher level utility APIs for processing an entire directory of input files. The following listing shows the file API.hs that contains the two top level helper functions we saw in app/Main.hs.
The functions processFilesToRdf and processFilesToNeo4j both have the function type signature FilePath->FilePath->IO() and are very similar except for calling different helper functions to generate RDF triples or Cypher input graph data:
1 module Apis
2 ( processFilesToRdf
3 , processFilesToNeo4j
4 ) where
5
6 import FileUtils
7 import GenNeo4jCypher
8 import GenTriples (textToTriples)
9
10 import qualified Database.SQLite.Simple as SQL
11
12 import Control.Monad (mapM)
13 import Data.String.Utils (replace)
14 import System.Directory (getDirectoryContents)
15
16 import Data.Typeable (typeOf)
17
18 processFilesToRdf :: FilePath -> FilePath -> IO ()
19 processFilesToRdf dirPath outputRdfFilePath = do
20 files <- getDirectoryContents dirPath :: IO [FilePath]
21 let filtered_files = filter isTextFile files
22 let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
23 putStrLn "full_paths:"
24 print full_paths
25 let r =
26 [textToTriples fp1 (replace ".txt" ".meta" fp1)
27 |
28 fp1 <- full_paths] :: [IO [Char]]
29 tripleL <-
30 mapM (\fp -> textToTriples fp (replace ".txt" ".meta" fp)) full_paths
31 let tripleS = concat tripleL
32 putStrLn tripleS
33 writeFile outputRdfFilePath tripleS
34
35 processFilesToNeo4j :: FilePath -> FilePath -> IO ()
36 processFilesToNeo4j dirPath outputRdfFilePath = do
37 files <- getDirectoryContents dirPath :: IO [FilePath]
38 let filtered_files = filter isTextFile files
39 let full_paths = [dirPath ++ "/" ++ fn | fn <- filtered_files]
40 putStrLn "full_paths:"
41 print full_paths
42 let prelude_node_defs = neo4j_category_node_defs
43 putStrLn
44 ("+++++ type of prelude_node_defs is: " ++
45 (show (typeOf prelude_node_defs)))
46 print prelude_node_defs
47 cypher_dataL <-
48 mapM (\fp -> textToCypher fp (replace ".txt" ".meta" fp)) full_paths
49 let cypher_dataS = concat cypher_dataL
50 putStrLn cypher_dataS
51 writeFile outputRdfFilePath $ prelude_node_defs ++ cypher_dataS
Since both of these functions return IO monads, I could add “debug” print statements that should be helpful in understanding the data being operated on.
The code defines two functions for processing text files in a directory:
processFilesToRdf: Processes text files and their corresponding metadata files (with.metaextension) in a given directory. It converts the content into RDF triples usingtextToTriplesand writes the concatenated triples to an output RDF file.processFilesToNeo4j: Processes text files and metadata files to generate Cypher statements for Neo4j. It usestextToCypherto create Cypher data from file content, combines it with predefined Neo4j category node definitions, and writes the result to an output file.
Key Points
File Handling: It utilizes
getDirectoryContentsfor file listing,filterfor selecting text files, andwriteFilefor output.Data Transformation:
textToTriplesandtextToCypherare functions that convert text content into RDF triples and Cypher statements, respectively.Metadata Handling: It expects metadata files with the same base name as the text files but with a
.metaextension.Output: The generated RDF triples or Cypher statements are written to specified output files.
neo4j_category_node_defs: A variable holding predefined Cypher node definitions for Neo4j categories.
This code relies on external modules like FileUtils, GenNeo4jCypher, GenTriples, and Database.SQLite.Simple for specific functionalities.
Wrap Up for Automating the Creation of Knowledge Graphs
The code in this chapter will provide you with a good start for creating both test knowledge graphs and for generating data for production. In practice, generated data should be reviewed before use and additional data manually generated as needed. It is good practice to document required manual changes because this documentation can be used in the requirements for updating the code in this chapter to more closely match your knowledge graph requirements.