Getting Setup To Use Graph and Relational Databases

I use several types of data stores in my work but for the purposes of this book we can explore knowledge representation using two key platforms:

RDF data via Python’s rdflib library and public SPARQL endpoints like Wikidata and DBPedia for graph-based knowledge representation.
SQLite for relational knowledge representation using the SQL query language.

The next chapter covers RDF and the SPARQL query language in more detail.

The examples for this chapter are in the directory source-code/knowledge_representation.

In technical terms, knowledge representation using graph and relational databases involves the use of graph structures and relational data models to represent and organize knowledge in a structured, computationally efficient, and easily accessible way.

A graph structure is a collection of nodes (also known as vertices) and edges (also known as arcs) that connect the nodes. Each node and edge in a graph can have properties, such as labels and attributes which provide information about the entities they represent. Graphs can be used to represent knowledge in a variety of ways, such as through semantic networks and using ontologies to define terms, classes, types, etc.

Relational databases, on the other hand, use a tabular data model to represent knowledge. The basic building block of a relational database is the table, which is a collection of rows (also known as tuples) and columns (also known as attributes). Each row represents an instance of an entity, and the columns provide information about the properties of that entity. Relationships between entities can also be represented by foreign keys, which link one table to another.

Combining these two technologies, knowledge can be represented as a graph of interconnected entities, where each entity is stored in a relational database table and connected to other entities through relationships represented by edges in the graph. This allows for efficient querying and manipulation of knowledge, as well as the ability to integrate and reason over large amounts of information.

Querying Wikidata with SPARQL and Python

Wikidata is a free, open knowledge base maintained by the Wikimedia Foundation. It contains structured data about millions of entities — people, places, organizations, scientific concepts, and more — all accessible through a public SPARQL endpoint. Unlike DBPedia, which extracts structured data from Wikipedia infoboxes, Wikidata is a curated knowledge base where the data is entered and maintained directly.

The Python SPARQLWrapper library makes it straightforward to query any SPARQL endpoint, including Wikidata:

1 uv pip install sparqlwrapper

Finding Information About a Person

Let’s query Wikidata for information about a specific person. Wikidata uses numeric entity identifiers (like Q937 for Albert Einstein) and property identifiers (like P19 for “place of birth”):

 1 # wikidata_person.py - Query Wikidata for information about a person
 2 
 3 from SPARQLWrapper import SPARQLWrapper, JSON
 4 from pprint import pprint
 5 
 6 sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
 7 sparql.addCustomHttpHeader("User-Agent", "PythonAIBook/1.0")
 8 
 9 queryString = """
10 SELECT ?personLabel ?birthPlaceLabel ?birthDate ?occupationLabel
11 WHERE {
12     ?person wdt:P31 wd:Q5 .            # instance of human
13     ?person rdfs:label "Albert Einstein"@en .
14     OPTIONAL { ?person wdt:P19 ?birthPlace . }
15     OPTIONAL { ?person wdt:P569 ?birthDate . }
16     OPTIONAL { ?person wdt:P106 ?occupation . }
17     SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
18 }
19 LIMIT 10
20 """
21 
22 sparql.setQuery(queryString)
23 sparql.setReturnFormat(JSON)
24 results = sparql.queryAndConvert()
25 
26 for r in results["results"]["bindings"]:
27     print(f"  Name: {r['personLabel']['value']}")
28     if 'birthPlaceLabel' in r:
29         print(f"  Born: {r['birthPlaceLabel']['value']}")
30     if 'birthDate' in r:
31         print(f"  Date: {r['birthDate']['value'][:10]}")
32     if 'occupationLabel' in r:
33         print(f"  Occupation: {r['occupationLabel']['value']}")
34     print()

The output shows Wikidata returning multiple results — one per occupation — for the same person:

 1 $ uv run wikidata_person.py
 2   Name: Albert Einstein
 3   Born: Ulm
 4   Date: 1879-03-14
 5   Occupation: scientist
 6 
 7   Name: Albert Einstein
 8   Born: Ulm
 9   Date: 1879-03-14
10   Occupation: physicist
11 
12   Name: Albert Einstein
13   Born: Ulm
14   Date: 1879-03-14
15   Occupation: mathematician
16 
17   Name: Albert Einstein
18   Born: Ulm
19   Date: 1879-03-14
20   Occupation: inventor
21   ...

Key things to notice about Wikidata’s SPARQL:

wdt: properties represent direct claims (e.g., wdt:P19 is “place of birth”)
wd: entities are Wikidata items (e.g., wd:Q5 is “human”)
The SERVICE wikibase:label clause automatically resolves entity IDs to human-readable labels
OPTIONAL prevents the query from failing when a property is missing

Querying Cities and Their Properties from DBPedia

DBPedia mirrors much of Wikipedia’s structured content as RDF triples. It uses different ontology conventions than Wikidata but is equally useful for knowledge representation tasks. Here we query DBPedia’s public SPARQL endpoint for cities and their populations:

 1 # dbpedia_cities.py - Query DBPedia for city data
 2 
 3 from SPARQLWrapper import SPARQLWrapper, JSON
 4 from pprint import pprint
 5 
 6 queryString = """
 7 SELECT ?city_uri ?dbpedia_label ?population ?country_label
 8 WHERE {
 9     ?city_uri
10         <http://dbpedia.org/ontology/type>
11         <http://dbpedia.org/resource/City> .
12     ?city_uri
13         <http://dbpedia.org/property/populationEst>
14         ?population .
15     ?city_uri
16          <http://www.w3.org/2000/01/rdf-schema#label>
17          ?dbpedia_label FILTER (lang(?dbpedia_label) = 'en') .
18     OPTIONAL {
19         ?city_uri <http://dbpedia.org/ontology/country> ?country .
20         ?country <http://www.w3.org/2000/01/rdf-schema#label>
21                  ?country_label FILTER (lang(?country_label) = 'en') .
22     }
23 }
24 ORDER BY DESC(?population)
25 LIMIT 10
26 """
27 
28 sparql = SPARQLWrapper("http://dbpedia.org/sparql")
29 sparql.setQuery(queryString)
30 sparql.setReturnFormat(JSON)
31 results = sparql.queryAndConvert()
32 
33 for r in results["results"]["bindings"]:
34     city = r['dbpedia_label']['value']
35     pop = int(r['population']['value'])
36     country = r.get('country_label', {}).get('value', 'unknown')
37     print(f"  {city} ({country}): population {pop:,}")

The output (results may vary as DBPedia data is updated):

 1 $ uv run dbpedia_cities.py
 2   Fort Worth, Texas (United States): population 1,008,106
 3   Charlotte, North Carolina (unknown): population 911,311
 4   Detroit (unknown): population 645,705
 5   Gombe, Nigeria (Nigeria): population 446,800
 6   Ilesa (Nigeria): population 416,000
 7   Pittsburgh (unknown): population 307,668
 8   Durham, North Carolina (United States): population 296,186
 9   Toledo, Ohio (United States): population 265,638
10   Winston-Salem, North Carolina (United States): population 252,975
11   Huntsville, Alabama (unknown): population 249,102

When I use RDF data from public SPARQL endpoints like DBPedia or Wikidata in applications, I start by using the web-based SPARQL clients for these services, find useful entities, manually look to see what properties are defined for those entities, and then write custom SPARQL queries to fetch the data I need. The web-based query editors at query.wikidata.org and dbpedia.org/sparql are invaluable for this exploratory process.

We will use more SPARQL queries in the next chapter.

The SQLite Relational Database for Knowledge Representation

The SQLite database is included in the standard Python distribution, making it the zero-setup option for persistent data storage. While graph databases naturally express relationships between entities, relational databases can also serve as effective knowledge representations when the schema is designed to capture entity types, attributes, and relationships.

A Reusable SQLite Library

We start with a simple reusable library for SQLite using the standard library sqlite3:

 1 # sqlite_lib.py - Reusable SQLite helper functions
 2 
 3 from sqlite3 import connect, version
 4 
 5 def create_db(db_file_path):
 6     """Create a database and return the connection."""
 7     conn = connect(db_file_path)
 8     return conn
 9 
10 def connection(db_file_path):
11     """Create and return a database connection."""
12     return connect(db_file_path)
13 
14 def query(conn, sql, variable_bindings=None):
15     """Execute a SQL query and return all results."""
16     cur = conn.cursor()
17     if variable_bindings:
18         cur.execute(sql, variable_bindings)
19     else:
20         cur.execute(sql)
21     conn.commit()
22     return cur.fetchall()

Modeling a Knowledge Graph in SQLite

Relational databases become knowledge representation tools when we design tables to capture entities, their types, their attributes, and the relationships between them. Here is an example that builds a simple knowledge base about scientists, their fields, and their discoveries:

  1 # sqlite_knowledge.py - Knowledge representation with SQLite
  2 
  3 import sqlite3
  4 
  5 def build_knowledge_base():
  6     """Build a relational knowledge base about scientists and their work."""
  7     conn = sqlite3.connect(":memory:")
  8     cur = conn.cursor()
  9 
 10     # Entity tables: each table represents a type of entity
 11     cur.execute("""
 12         CREATE TABLE scientists (
 13             id INTEGER PRIMARY KEY,
 14             name TEXT NOT NULL,
 15             birth_year INTEGER,
 16             nationality TEXT
 17         )
 18     """)
 19 
 20     cur.execute("""
 21         CREATE TABLE fields (
 22             id INTEGER PRIMARY KEY,
 23             name TEXT NOT NULL,
 24             description TEXT
 25         )
 26     """)
 27 
 28     cur.execute("""
 29         CREATE TABLE discoveries (
 30             id INTEGER PRIMARY KEY,
 31             name TEXT NOT NULL,
 32             year INTEGER,
 33             description TEXT
 34         )
 35     """)
 36 
 37     # Relationship tables: capture how entities are connected
 38     cur.execute("""
 39         CREATE TABLE scientist_field (
 40             scientist_id INTEGER REFERENCES scientists(id),
 41             field_id INTEGER REFERENCES fields(id),
 42             PRIMARY KEY (scientist_id, field_id)
 43         )
 44     """)
 45 
 46     cur.execute("""
 47         CREATE TABLE scientist_discovery (
 48             scientist_id INTEGER REFERENCES scientists(id),
 49             discovery_id INTEGER REFERENCES discoveries(id),
 50             PRIMARY KEY (scientist_id, discovery_id)
 51         )
 52     """)
 53 
 54     # Populate with knowledge
 55     cur.executemany("INSERT INTO scientists VALUES (?, ?, ?, ?)", [
 56         (1, "Albert Einstein", 1879, "German"),
 57         (2, "Marie Curie", 1867, "Polish"),
 58         (3, "Richard Feynman", 1918, "American"),
 59     ])
 60 
 61     cur.executemany("INSERT INTO fields VALUES (?, ?, ?)", [
 62         (1, "Physics", "Study of matter, energy, and their interactions"),
 63         (2, "Chemistry", "Study of the composition and properties of matter"),
 64         (3, "Quantum Mechanics", "Physics of atomic and subatomic systems"),
 65     ])
 66 
 67     cur.executemany("INSERT INTO discoveries VALUES (?, ?, ?, ?)", [
 68         (1, "Special Relativity", 1905, "Time and space are relative"),
 69         (2, "Radioactivity", 1898, "Discovery of radium and polonium"),
 70         (3, "Quantum Electrodynamics", 1948, "Quantum theory of light and matter"),
 71     ])
 72 
 73     cur.executemany("INSERT INTO scientist_field VALUES (?, ?)", [
 74         (1, 1), (1, 3),  # Einstein: Physics, Quantum Mechanics
 75         (2, 1), (2, 2),  # Curie: Physics, Chemistry
 76         (3, 1), (3, 3),  # Feynman: Physics, Quantum Mechanics
 77     ])
 78 
 79     cur.executemany("INSERT INTO scientist_discovery VALUES (?, ?)", [
 80         (1, 1),  # Einstein -> Special Relativity
 81         (2, 2),  # Curie -> Radioactivity
 82         (3, 3),  # Feynman -> QED
 83     ])
 84 
 85     conn.commit()
 86     return conn
 87 
 88 
 89 def query_knowledge_base(conn):
 90     """Demonstrate knowledge queries against the relational schema."""
 91     cur = conn.cursor()
 92 
 93     # Query 1: Who works in Quantum Mechanics?
 94     print("Scientists in Quantum Mechanics:")
 95     cur.execute("""
 96         SELECT s.name, s.nationality
 97         FROM scientists s
 98         JOIN scientist_field sf ON s.id = sf.scientist_id
 99         JOIN fields f ON sf.field_id = f.id
100         WHERE f.name = 'Quantum Mechanics'
101     """)
102     for row in cur.fetchall():
103         print(f"  {row[0]} ({row[1]})")
104 
105     # Query 2: What did each scientist discover?
106     print("\nDiscoveries by scientist:")
107     cur.execute("""
108         SELECT s.name, d.name, d.year, d.description
109         FROM scientists s
110         JOIN scientist_discovery sd ON s.id = sd.scientist_id
111         JOIN discoveries d ON sd.discovery_id = d.id
112         ORDER BY d.year
113     """)
114     for row in cur.fetchall():
115         print(f"  {row[0]}: {row[1]} ({row[2]}) — {row[3]}")
116 
117     # Query 3: Which fields overlap between scientists?
118     print("\nScientists who share a field:")
119     cur.execute("""
120         SELECT s1.name, s2.name, f.name
121         FROM scientist_field sf1
122         JOIN scientist_field sf2 ON sf1.field_id = sf2.field_id
123                                 AND sf1.scientist_id < sf2.scientist_id
124         JOIN scientists s1 ON sf1.scientist_id = s1.id
125         JOIN scientists s2 ON sf2.scientist_id = s2.id
126         JOIN fields f ON sf1.field_id = f.id
127     """)
128     for row in cur.fetchall():
129         print(f"  {row[0]} & {row[1]}: {row[2]}")
130 
131 
132 conn = build_knowledge_base()
133 query_knowledge_base(conn)
134 conn.close()

The output shows how SQL JOIN queries traverse the relationships between entities, much like following edges in a graph:

 1 Scientists in Quantum Mechanics:
 2   Albert Einstein (German)
 3   Richard Feynman (American)
 4 
 5 Discoveries by scientist:
 6   Marie Curie: Radioactivity (1898) — Discovery of radium and polonium
 7   Albert Einstein: Special Relativity (1905) — Time and space are relative
 8   Richard Feynman: Quantum Electrodynamics (1948) — Quantum theory of light and matter
 9 
10 Scientists who share a field:
11   Albert Einstein & Marie Curie: Physics
12   Albert Einstein & Richard Feynman: Physics
13   Albert Einstein & Richard Feynman: Quantum Mechanics
14   Marie Curie & Richard Feynman: Physics

The key insight is that the relationship tables (scientist_field, scientist_discovery) transform a flat relational database into a knowledge representation. Each relationship table captures a specific type of connection between entity types, and SQL JOINs let you traverse these connections to answer knowledge queries. While not as natural as a graph database for highly connected data, this pattern works well for structured knowledge with well-defined entity types and relationships.

We will combine the use of SQLite, RDF, SPARQL, and deep learning Natural Language Processing (NLP) libraries later in the book.

If you want to deepen your understanding of the standards behind the SPARQL queries we used in this chapter, the next chapter provides optional reference material on RDF data formats, RDFS sub-property hierarchies, the SPARQL query language in detail, and OWL reasoning. That background will help you write more sophisticated queries against Wikidata, DBPedia, and any other SPARQL endpoint.

Up next

Optional Material: A Deeper Dive Into Semantic Web and Linked Data