Getting Setup To Use Graph and Relational Databases
I use several types of data stores in my work but for the purposes of this book we can explore knowledge representation using two key platforms:
- SQLite for relational knowledge representation using the SQL query language.
The next chapter covers RDF and the SPARQL query language in more detail.
The examples for this chapter are in the directory source-code/knowledge_representation.
In technical terms, knowledge representation using graph and relational databases involves the use of graph structures and relational data models to represent and organize knowledge in a structured, computationally efficient, and easily accessible way.
A graph structure is a collection of nodes (also known as vertices) and edges (also known as arcs) that connect the nodes. Each node and edge in a graph can have properties, such as labels and attributes which provide information about the entities they represent. Graphs can be used to represent knowledge in a variety of ways, such as through semantic networks and using ontologies to define terms, classes, types, etc.
Relational databases, on the other hand, use a tabular data model to represent knowledge. The basic building block of a relational database is the table, which is a collection of rows (also known as tuples) and columns (also known as attributes). Each row represents an instance of an entity, and the columns provide information about the properties of that entity. Relationships between entities can also be represented by foreign keys, which link one table to another.
Combining these two technologies, knowledge can be represented as a graph of interconnected entities, where each entity is stored in a relational database table and connected to other entities through relationships represented by edges in the graph. This allows for efficient querying and manipulation of knowledge, as well as the ability to integrate and reason over large amounts of information.
Querying Wikidata with SPARQL and Python
Wikidata is a free, open knowledge base maintained by the Wikimedia Foundation. It contains structured data about millions of entities — people, places, organizations, scientific concepts, and more — all accessible through a public SPARQL endpoint. Unlike DBPedia, which extracts structured data from Wikipedia infoboxes, Wikidata is a curated knowledge base where the data is entered and maintained directly.
The Python SPARQLWrapper library makes it straightforward to query any SPARQL endpoint, including Wikidata:
1 uv pip install sparqlwrapper
Finding Information About a Person
Let’s query Wikidata for information about a specific person. Wikidata uses numeric entity identifiers (like Q937 for Albert Einstein) and property identifiers (like P19 for “place of birth”):
1 # wikidata_person.py - Query Wikidata for information about a person
2
3 from SPARQLWrapper import SPARQLWrapper, JSON
4 from pprint import pprint
5
6 sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
7 sparql.addCustomHttpHeader("User-Agent", "PythonAIBook/1.0")
8
9 queryString = """
10 SELECT ?personLabel ?birthPlaceLabel ?birthDate ?occupationLabel
11 WHERE {
12 ?person wdt:P31 wd:Q5 . # instance of human
13 ?person rdfs:label "Albert Einstein"@en .
14 OPTIONAL { ?person wdt:P19 ?birthPlace . }
15 OPTIONAL { ?person wdt:P569 ?birthDate . }
16 OPTIONAL { ?person wdt:P106 ?occupation . }
17 SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
18 }
19 LIMIT 10
20 """
21
22 sparql.setQuery(queryString)
23 sparql.setReturnFormat(JSON)
24 results = sparql.queryAndConvert()
25
26 for r in results["results"]["bindings"]:
27 print(f" Name: {r['personLabel']['value']}")
28 if 'birthPlaceLabel' in r:
29 print(f" Born: {r['birthPlaceLabel']['value']}")
30 if 'birthDate' in r:
31 print(f" Date: {r['birthDate']['value'][:10]}")
32 if 'occupationLabel' in r:
33 print(f" Occupation: {r['occupationLabel']['value']}")
34 print()
The output shows Wikidata returning multiple results — one per occupation — for the same person:
1 $ uv run wikidata_person.py
2 Name: Albert Einstein
3 Born: Ulm
4 Date: 1879-03-14
5 Occupation: scientist
6
7 Name: Albert Einstein
8 Born: Ulm
9 Date: 1879-03-14
10 Occupation: physicist
11
12 Name: Albert Einstein
13 Born: Ulm
14 Date: 1879-03-14
15 Occupation: mathematician
16
17 Name: Albert Einstein
18 Born: Ulm
19 Date: 1879-03-14
20 Occupation: inventor
21 ...
Key things to notice about Wikidata’s SPARQL:
- wdt: properties represent direct claims (e.g., wdt:P19 is “place of birth”)
- wd: entities are Wikidata items (e.g., wd:Q5 is “human”)
- The SERVICE wikibase:label clause automatically resolves entity IDs to human-readable labels
- OPTIONAL prevents the query from failing when a property is missing
Querying Cities and Their Properties from DBPedia
DBPedia mirrors much of Wikipedia’s structured content as RDF triples. It uses different ontology conventions than Wikidata but is equally useful for knowledge representation tasks. Here we query DBPedia’s public SPARQL endpoint for cities and their populations:
1 # dbpedia_cities.py - Query DBPedia for city data
2
3 from SPARQLWrapper import SPARQLWrapper, JSON
4 from pprint import pprint
5
6 queryString = """
7 SELECT ?city_uri ?dbpedia_label ?population ?country_label
8 WHERE {
9 ?city_uri
10 <http://dbpedia.org/ontology/type>
11 <http://dbpedia.org/resource/City> .
12 ?city_uri
13 <http://dbpedia.org/property/populationEst>
14 ?population .
15 ?city_uri
16 <http://www.w3.org/2000/01/rdf-schema#label>
17 ?dbpedia_label FILTER (lang(?dbpedia_label) = 'en') .
18 OPTIONAL {
19 ?city_uri <http://dbpedia.org/ontology/country> ?country .
20 ?country <http://www.w3.org/2000/01/rdf-schema#label>
21 ?country_label FILTER (lang(?country_label) = 'en') .
22 }
23 }
24 ORDER BY DESC(?population)
25 LIMIT 10
26 """
27
28 sparql = SPARQLWrapper("http://dbpedia.org/sparql")
29 sparql.setQuery(queryString)
30 sparql.setReturnFormat(JSON)
31 results = sparql.queryAndConvert()
32
33 for r in results["results"]["bindings"]:
34 city = r['dbpedia_label']['value']
35 pop = int(r['population']['value'])
36 country = r.get('country_label', {}).get('value', 'unknown')
37 print(f" {city} ({country}): population {pop:,}")
The output (results may vary as DBPedia data is updated):
1 $ uv run dbpedia_cities.py
2 Fort Worth, Texas (United States): population 1,008,106
3 Charlotte, North Carolina (unknown): population 911,311
4 Detroit (unknown): population 645,705
5 Gombe, Nigeria (Nigeria): population 446,800
6 Ilesa (Nigeria): population 416,000
7 Pittsburgh (unknown): population 307,668
8 Durham, North Carolina (United States): population 296,186
9 Toledo, Ohio (United States): population 265,638
10 Winston-Salem, North Carolina (United States): population 252,975
11 Huntsville, Alabama (unknown): population 249,102
When I use RDF data from public SPARQL endpoints like DBPedia or Wikidata in applications, I start by using the web-based SPARQL clients for these services, find useful entities, manually look to see what properties are defined for those entities, and then write custom SPARQL queries to fetch the data I need. The web-based query editors at query.wikidata.org and dbpedia.org/sparql are invaluable for this exploratory process.
We will use more SPARQL queries in the next chapter.
The SQLite Relational Database for Knowledge Representation
The SQLite database is included in the standard Python distribution, making it the zero-setup option for persistent data storage. While graph databases naturally express relationships between entities, relational databases can also serve as effective knowledge representations when the schema is designed to capture entity types, attributes, and relationships.
A Reusable SQLite Library
We start with a simple reusable library for SQLite using the standard library sqlite3:
1 # sqlite_lib.py - Reusable SQLite helper functions
2
3 from sqlite3 import connect, version
4
5 def create_db(db_file_path):
6 """Create a database and return the connection."""
7 conn = connect(db_file_path)
8 return conn
9
10 def connection(db_file_path):
11 """Create and return a database connection."""
12 return connect(db_file_path)
13
14 def query(conn, sql, variable_bindings=None):
15 """Execute a SQL query and return all results."""
16 cur = conn.cursor()
17 if variable_bindings:
18 cur.execute(sql, variable_bindings)
19 else:
20 cur.execute(sql)
21 conn.commit()
22 return cur.fetchall()
Modeling a Knowledge Graph in SQLite
Relational databases become knowledge representation tools when we design tables to capture entities, their types, their attributes, and the relationships between them. Here is an example that builds a simple knowledge base about scientists, their fields, and their discoveries:
1 # sqlite_knowledge.py - Knowledge representation with SQLite
2
3 import sqlite3
4
5 def build_knowledge_base():
6 """Build a relational knowledge base about scientists and their work."""
7 conn = sqlite3.connect(":memory:")
8 cur = conn.cursor()
9
10 # Entity tables: each table represents a type of entity
11 cur.execute("""
12 CREATE TABLE scientists (
13 id INTEGER PRIMARY KEY,
14 name TEXT NOT NULL,
15 birth_year INTEGER,
16 nationality TEXT
17 )
18 """)
19
20 cur.execute("""
21 CREATE TABLE fields (
22 id INTEGER PRIMARY KEY,
23 name TEXT NOT NULL,
24 description TEXT
25 )
26 """)
27
28 cur.execute("""
29 CREATE TABLE discoveries (
30 id INTEGER PRIMARY KEY,
31 name TEXT NOT NULL,
32 year INTEGER,
33 description TEXT
34 )
35 """)
36
37 # Relationship tables: capture how entities are connected
38 cur.execute("""
39 CREATE TABLE scientist_field (
40 scientist_id INTEGER REFERENCES scientists(id),
41 field_id INTEGER REFERENCES fields(id),
42 PRIMARY KEY (scientist_id, field_id)
43 )
44 """)
45
46 cur.execute("""
47 CREATE TABLE scientist_discovery (
48 scientist_id INTEGER REFERENCES scientists(id),
49 discovery_id INTEGER REFERENCES discoveries(id),
50 PRIMARY KEY (scientist_id, discovery_id)
51 )
52 """)
53
54 # Populate with knowledge
55 cur.executemany("INSERT INTO scientists VALUES (?, ?, ?, ?)", [
56 (1, "Albert Einstein", 1879, "German"),
57 (2, "Marie Curie", 1867, "Polish"),
58 (3, "Richard Feynman", 1918, "American"),
59 ])
60
61 cur.executemany("INSERT INTO fields VALUES (?, ?, ?)", [
62 (1, "Physics", "Study of matter, energy, and their interactions"),
63 (2, "Chemistry", "Study of the composition and properties of matter"),
64 (3, "Quantum Mechanics", "Physics of atomic and subatomic systems"),
65 ])
66
67 cur.executemany("INSERT INTO discoveries VALUES (?, ?, ?, ?)", [
68 (1, "Special Relativity", 1905, "Time and space are relative"),
69 (2, "Radioactivity", 1898, "Discovery of radium and polonium"),
70 (3, "Quantum Electrodynamics", 1948, "Quantum theory of light and matter"),
71 ])
72
73 cur.executemany("INSERT INTO scientist_field VALUES (?, ?)", [
74 (1, 1), (1, 3), # Einstein: Physics, Quantum Mechanics
75 (2, 1), (2, 2), # Curie: Physics, Chemistry
76 (3, 1), (3, 3), # Feynman: Physics, Quantum Mechanics
77 ])
78
79 cur.executemany("INSERT INTO scientist_discovery VALUES (?, ?)", [
80 (1, 1), # Einstein -> Special Relativity
81 (2, 2), # Curie -> Radioactivity
82 (3, 3), # Feynman -> QED
83 ])
84
85 conn.commit()
86 return conn
87
88
89 def query_knowledge_base(conn):
90 """Demonstrate knowledge queries against the relational schema."""
91 cur = conn.cursor()
92
93 # Query 1: Who works in Quantum Mechanics?
94 print("Scientists in Quantum Mechanics:")
95 cur.execute("""
96 SELECT s.name, s.nationality
97 FROM scientists s
98 JOIN scientist_field sf ON s.id = sf.scientist_id
99 JOIN fields f ON sf.field_id = f.id
100 WHERE f.name = 'Quantum Mechanics'
101 """)
102 for row in cur.fetchall():
103 print(f" {row[0]} ({row[1]})")
104
105 # Query 2: What did each scientist discover?
106 print("\nDiscoveries by scientist:")
107 cur.execute("""
108 SELECT s.name, d.name, d.year, d.description
109 FROM scientists s
110 JOIN scientist_discovery sd ON s.id = sd.scientist_id
111 JOIN discoveries d ON sd.discovery_id = d.id
112 ORDER BY d.year
113 """)
114 for row in cur.fetchall():
115 print(f" {row[0]}: {row[1]} ({row[2]}) — {row[3]}")
116
117 # Query 3: Which fields overlap between scientists?
118 print("\nScientists who share a field:")
119 cur.execute("""
120 SELECT s1.name, s2.name, f.name
121 FROM scientist_field sf1
122 JOIN scientist_field sf2 ON sf1.field_id = sf2.field_id
123 AND sf1.scientist_id < sf2.scientist_id
124 JOIN scientists s1 ON sf1.scientist_id = s1.id
125 JOIN scientists s2 ON sf2.scientist_id = s2.id
126 JOIN fields f ON sf1.field_id = f.id
127 """)
128 for row in cur.fetchall():
129 print(f" {row[0]} & {row[1]}: {row[2]}")
130
131
132 conn = build_knowledge_base()
133 query_knowledge_base(conn)
134 conn.close()
The output shows how SQL JOIN queries traverse the relationships between entities, much like following edges in a graph:
1 Scientists in Quantum Mechanics:
2 Albert Einstein (German)
3 Richard Feynman (American)
4
5 Discoveries by scientist:
6 Marie Curie: Radioactivity (1898) — Discovery of radium and polonium
7 Albert Einstein: Special Relativity (1905) — Time and space are relative
8 Richard Feynman: Quantum Electrodynamics (1948) — Quantum theory of light and matter
9
10 Scientists who share a field:
11 Albert Einstein & Marie Curie: Physics
12 Albert Einstein & Richard Feynman: Physics
13 Albert Einstein & Richard Feynman: Quantum Mechanics
14 Marie Curie & Richard Feynman: Physics
The key insight is that the relationship tables (scientist_field, scientist_discovery) transform a flat relational database into a knowledge representation. Each relationship table captures a specific type of connection between entity types, and SQL JOINs let you traverse these connections to answer knowledge queries. While not as natural as a graph database for highly connected data, this pattern works well for structured knowledge with well-defined entity types and relationships.
We will combine the use of SQLite, RDF, SPARQL, and deep learning Natural Language Processing (NLP) libraries later in the book.
If you want to deepen your understanding of the standards behind the SPARQL queries we used in this chapter, the next chapter provides optional reference material on RDF data formats, RDFS sub-property hierarchies, the SPARQL query language in detail, and OWL reasoning. That background will help you write more sophisticated queries against Wikidata, DBPedia, and any other SPARQL endpoint.