Relating to the Database
Relating to the Database
Buy on Leanpub

About this book

This book is intended as a first introduction to relational databases or for those who have a little experience to review the fundamentals of data modeling and SQL. It may be suitable as a textbook for a university course. In particular, it may be a good supplement for one of those overpriced, theory-heavy textbooks that universities like to assign. Due to its low price and its focus on practical data modeling patterns, students may use this book for additional practice and self-study.

Also, as far as I know, it’s the only such textbook to use PostgreSQL as the database of choice for examples. PostgreSQL or “Postgres” is an increasingly popular, free and open source database that runs on all the major platforms (Windows, Mac OS, Linux), making it a great choice for the classroom or independent learner.

Chapter 1: Introducing databases and PostgreSQL

How databases fit in

Imagine that you work in a small direct-response mail order company that takes orders from customers by phone. Each agent in the call center downstairs has a stack of paper order forms on his desk, and when he receives an order he writes down the product name(s), quantity ordered, and the customer’s address and payment information. He uses a calculator or computer to sum up the order total, and tells the customer how much they’ll be charged.

Periodically, a data entry worker visits the call center and picks up stacks of filled-in order forms. She enters each order’s details into a file on her desktop computer, perhaps a big Excel spreadsheet that she’s designed herself for this task. At the end of the day, when all the orders are entered, she sends the complete file to two other departments: fulfillment, which processes the customer’s credit card and packs and ships the orders, and accounting, which calculates each salesperson’s commission.

This is the kind of process that a small business might develop when it’s first getting started, and in fact, it’s exactly the process that I encountered when I was first learning about databases at a small company in Maine. Unfortunately, simple processes like this tend to get complicated as the company gets bigger, and can become impossible to maintain. Just a few of the challenges this company might face are:

  • When the business grows to the point that multiple data entry workers are needed, they must coordinate their work somehow. Perhaps each worker creates her own file, and they must combine them at the end of the day. There are many opportunities for errors to enter the system.
  • If a payment is declined, or if a customer returns an order, one of the old spreadsheets must be updated, but which one? The data is kept in the order it was entered, not alphabetized by customer, and if there are now multiple spreadsheets for each day of business, searching for the old order is a big challenge.
  • As a result of the update made by fulfillment or customer service, multiple versions of the data exist. Accounting still has the old data, and they need to be notified of the change so they can correct the salesperson’s commission. Even if they’re sent the modified file, how do they know which row(s) are changed? And as more changes are made over time, who is responsible for keeping the official record?
  • As the business grows, the order entry data may change. For example, product numbers or names may change, discounts or coupon codes may be introduced, or some kind of membership number may be offered to frequent customers. As the paper order forms change, the spreadsheets must also be modified. Consequently, today’s data files look different from last month’s and even more different from last year’s. As people change jobs and leave the company, can their replacements even read the older data?

As this small business becomes a medium-sized enterprise, the seeming convenience of using a spreadsheet can become a nightmare of data management. Before long, the data entry workers, accountants, and other departments may find they spend more time managing spreadsheets than doing their main jobs. And sooner or later, management may wish to use the historical order data for a new purpose, such as customer relationship management (CRM) or business intelligence (BI). What they’ll find is that the data is spread out over a huge number of files, with multiple versions of each file existing in different places, and older files having a different structure and meaning from newer files. What a mess!

A database

Imagine instead if there was a black box in the office into which all those order forms were fed. At any time, a person could ask the black box to retrieve any order record (or list of records) by time and date, by customer name, by product, or another attribute. Fulfillment could ask for the payments waiting to be processed, and it would get a printout of exactly that. Shipping could request invoices and mailing labels for orders ready to ship. Accounting could ask for the sum of order totals taken by each salesperson for a given time period. Moreover, changes could be filed in the black box and all subsequent requests would include the up-to-date, corrected information. The black box serves as the manager of, and the official system of record for, all the company’s order data.

That’s the big idea of a database. Instead of having every person or department or program keep its own copy of the data, a database serves as a system of record, a “single source of truth” that can always be accessed by everyone who needs it for their different purposes. A database stores some knowledge about the data’s structure and meaning, or metadata, so diverse users can know what they’re looking at. And most importantly, a database offers flexible but easy-to-use query methods so that users can request just the data they want, whether it’s a single record, a collection of data, or an aggregation into averages, counts, or sums.

A variety of databases

Databases come in many shapes and sizes and they serve a lot of different purposes. Most often, a database acts as a server, that is, a software program that is always on, waiting for requests and responding to them. It’s necessary for databases to always be on and available if people and other software systems are going to depend on them to store data. (Otherwise, those people and programs will have to store their own data locally, which defeats the purpose of a database – independent and shared data management.)

A database administrator (DBA) is therefore responsible for a key component of a business’s IT infrastructure. If the database goes down, a lot of other programs go down, so the DBA ought to learn best practices for managing security, keeping backups and recovering from disasters. Maintaining and upgrading a database has been compared to maintaining a sailing ship when it is out to sea. The DBA can’t just take the database out of service to work on it however he wants; the business must stay afloat.

Server-based databases vary widely in scale and scope. Some databases support a single application, such as a dynamic website, and these may run on the same physical computer as the application’s code. A larger database might run on a dedicated machine that multiple users access over the network; for example, a company may keep a database of customer relationship information which can be accessed by sales, marketing, and customer service systems.

Larger still are enterprise-scale databases that integrate a wide variety of subject areas. This category includes enterprise resource planning (ERP) systems that integrate several areas of business operations, and enterprise data warehouses (EDW) used for analysis and reporting of business performance. Enterprise-scale databases may run on mainframes or may be distributed over large clusters of dozens or hundreds of computers.

What you’ll learn from this book

This book will introduce you to relational databases, with data modeling and the SQL language first and foremost. It covers the scope of a typical first database course in an information systems or analytics program at the university level. It can be used as a textbook for an instructor-led course—instructors, please contact the author for an instructor’s guide, slides and materials—or used for self-guided study with or without the video lectures produced by the author (coming soon via Leanpub).

My main goal in creating this book is to make data modeling and SQL understandable to the reader, so it may serve as a good low-cost supplement for students who are struggling with a theory-heavy textbook and having a hard time getting the point. Instead of starting with loads of theory up-front, I’ll take a more pragmatic approach based on relational data modeling “patterns” and examples.

This book will also provide a lot of practice using SQL, the structured query language common to all relational databases, because the most frequent feedback I’ve heard from my undergraduate students (and from companies that hire them) is that they need more practice with SQL.

The database of choice for this book is PostgreSQL (often nicknamed “Postgres”), an open-source database that has become quite popular with developers in recent years. Compared to some other popular databases like SQL Server and MySQL, there aren’t many good books about Postgres, so I hope this book will be valuable if only for its examples. Postgres makes a good choice for teaching because it is free software (both “free as in free speech” and “free as in free beer”), and because it runs on all the major platforms—Windows, Mac OS, and Linux—so you can follow this book no matter what kind of computer you have handy.

Rest assured that the lessons of this book are transferable to other relational databases. Each of the major brands has its own quirks and special features, but this book mainly covers the fundamentals that apply everywhere. As currently planned, one chapter (Chapter 7) will exhibit some of PostgreSQL’s distinctive features and capabilities.

Lab 1: your first PostgreSQL database

Up and running with Postgres

PostgreSQL is available for free at www.postgresql.org and is extremely well documented there. Installation instructions will vary depending on your platform, and should be pretty straightforward. You can probably accept all the default configuration options. Be sure to remember the password you set during installation. You’re up and running when you can enter the command psql at your system’s command line to log in to the local PostgreSQL server. You will recognize this by a readout of the psql version number, a statement about how to get help, and a change to the command prompt. At the time of this writing, it looks like this for me (on my Mac):

$ psql
psql (9.6.4)
Type "help" for help.

joeclark=#

To quit psql, enter \q at the prompt. If the above doesn’t make any sense to you, don’t worry. Appendix A to this book includes my own tested instructions for installing Postgres on Windows, Mac, and Linux. You can also refer to the online documentation. Come back to this page when you’re up and running.

Relations are tables

Databases can be classified according to the types of abstractions they allow you to model your data with. In a relational database, data is modeled as a set of tables with structured rows and columns. Other data models are possible. In a document-oriented database such as MongoDB, data is modeled as documents with tree-like structures. In a graph database like Neo4J, data is structured as a network (a mathematical graph) with nodes and edges. Compared to those newer forms, the relational model is far more commonly seen and better understood, and is the most versatile. Relational databases have been tried and tested in business for nearly four decades, and are probably the best tool for the job in all but a few specialized cases.

The tables you find in a relational database are properly called relations. A relation is not just any table; it is a construct found in set theory and is defined by the following characteristics:

  • Every row has the same columns.
  • Column names must all be different.
  • Each column is defined to contain just one specific type of data.
  • Each row must be unique; usually we enforce this by adding a machine-generated ID number to each row, known as the primary key column.
  • No inherent ordering of rows or columns, or other information about how to display the data, is stored in the table.

Consequently, a relation is a simpler and less flexible structure than a table you might create in a spreadsheet program like Microsoft Excel. Spreadsheets allow you to mix data types, to have rows with different numbers of columns, and to decorate your data with display logic like fonts, colors, and sizing. Figures 1-1 and 1-2 illustrate the comparison with an example of sales data that might be recorded by a small outdoor sports mail-order business.

Figure 1-1: A typical spreadsheet data table
Figure 1-1: A typical spreadsheet data table

In a spreadsheet the user can be lax in data entry, for example omitting the state “TN” when we all know where Nashville is, or entering a quotation mark (meaning “ditto”) instead of spelling something out. Data types unanticipated at the time the table was designed could be inserted freely; for example, a three-letter Canadian province abbreviation could be inserted into a column meant for two-letter US states. Although these are convenient for data entry, they may lead to problems for computer systems that want to use the data (for example, to print mailing labels). The spreadsheet user can also decorate the data with fonts, styles, sizes and colors in order to make it more readable, and he can add extra information like a “grand total” row.

As seen in Figure 1-2, a database table (or relation) is much more strictly defined. Data types must be specified in advance for each column, guaranteeing uniformity. That means special cases must be anticipated before they occur. In this example, the database designer specified that state abbreviations must be exactly two characters, and that the price may be numeric (allowing fractions) rather than integer. In order to guarantee that each row is unique, and therefore can be looked up, the database user has added a primary key column and populated it with an auto-generated ID number.

Figure 1-2: The same table as it would exist in a relational database
Figure 1-2: The same table as it would exist in a relational database

No other information is found in the rows of a relation except the data itself: not fonts and styles, and not even the sort order. Totals, averages, and the like wouldn’t be stored in the table either, because rows correspond only to individual data “records”. Aggregated values like totals and averages could be calculated in a query or perhaps stored in additional tables created specifically for the purpose.

Creating a PostgreSQL database

Let’s fire up PostgreSQL and create a table. (I probably won’t use the word “relation” much after this, except for a bit of theory in Chapter 2. Where I write “table”, you should be able to figure out what I mean.)

First, a note about the term “database”. As I have described it above, a database is a system that organizes and stores data and, importantly, makes it available to people who need to search or retrieve it. Others more precise than I will distinguish between the database, which is the organized data store, and the database management system (DBMS) which is a program like PostgreSQL that creates the data store and grants us access to it. When we call Postgres (or Oracle, or SQL Server, etc.) a database, we are using the term more generally to include both the data store and the DBMS, since they go together.

To understand how we interact with PostgreSQL, though, you need a third definition of the term. In PostgreSQL, a database is a logical subdivision of the data store. You may create any number of tables grouped into databases on the same server. (For the purposes of this book’s labs, your personal computer is acting as a PostgreSQL server.) Table names must be unique within a database, but not within a server. If several examples in this book include a table called “customers”, you can avoid a conflict by creating a new database for each lab.

What we’ll do in this first lab, then, is:

  1. Create a PostgreSQL database called “lab1”
  2. Log in to that database with psql
  3. Create a table of Purchases
  4. Query the single-table database with SQL
Database creation

You can create a database from your operating system’s command line (i.e., before logging in to PostgreSQL with psql or another front-end tool), by using the command createdb. The basic structure of this command is createdb [OPTIONS] [DBNAME], and you can learn more by typing createdb --help at the command prompt. Thus, the following command creates a database called “lab1”:

$ createdb lab1

If necessary, you can also drop (i.e., delete) the new database from the commmand line, with:

$ dropdb lab1

For future reference, you can also create and delete databases using SQL once you’re logged in to psql: the CREATE DATABASE and DROP DATABASE commands, respectively. One way or another, create that database, which will be home to your first table.

Introducing psql

The command-line client for PostgreSQL is psql, and you launch it by entering psql on the command line. This will not open a new window, but rather you will see a brief welcome and the command prompt will be different from the operating system’s default prompt.

$ psql
psql (9.6.4)
Type "help" for help.

joeclark=#

The change in the command prompt means you’re in a different environment. Here, you can enter SQL queries or some commands specific to psql. The first thing I’d recommend you do is type help, which introduces you to a few of the latter. Most psql commands begin with a backslash (\) and you can get a full listing of them by entering the command \?. If you need to quit, \q is the command for that. If you want to, take some time now to explore the lists of SQL queries and psql commands possible.

joeclark=# help
You are using psql, the command-line interface to PostgreSQL.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

By default, when you start psql you’re connecting to the database with the same name as your PostgreSQL username. For those of you on Windows, if you followed my installation instructions (in Appendix A), that will be “postgres”. For Mac and Linux users, the default database will be named the same as your computer login name, hence my username “joeclark” appears in the code samples above. Any SQL commands you enter at the prompt will be executed on the default database’s tables, and that isn’t what we want. To switch over to the new database you created, use the \c command:

joeclark=# \c lab1
You are now connected to database "lab1" as user "joeclark".
lab1=#

Notice that the prompt changes to tell you which database you’re working in.

Creating a table

To create a table in the “lab1” database, we use the aptly-named SQL CREATE TABLE command. I will expand on its usage in Chapter 2, but the basic form of it is as follows:

CREATE TABLE table_name (
   column_name    data_type   [OPTIONS],
   ...
   );

As I mentioned, one of the special characteristics of a relation is that each column allows data only of a specified type. PostgreSQL offers a number of built-in data types, such as numeric, text, date, and more. I will discuss the choice of data type more in Chapters 2 and 8, but it need not delay an introductory example. There are many optional clauses available in the CREATE TABLE statement which can be discussed later or looked up in the documentation; the only one we need now is PRIMARY KEY, a flag which indicates that a particular column is going to contain unique values that may be used to look up specific rows later.

The command to create our table of Purchases is as follows. You may type this in at the psql prompt, even if it spans several lines. The code won’t execute until the semicolon (;) is reached. Mind the cases: in PostgreSQL the SQL keywords (i.e. CREATE TABLE, PRIMARY KEY, and the data types) may be uppercase or lowercase, but you should only use lower case letters and underscores (_) for the table and column names. PostgreSQL isn’t very sensitive to whitespace, so you can enter this code all on one line, or spread out over several lines, with indentation and tabs if you want them.

CREATE TABLE purchases (
order_id integer PRIMARY KEY,
city text,
state char(2),
product text,
category text,
price numeric );

If the command succeeded, you’ll see “CREATE TABLE” in the output. If there’s an error message instead, don’t worry, just try again. The most likely causes of errors are typos in the data types, the wrong number of commas, and uppercase letters in the table or column names. If the command worked but you defined the table incorrectly, the easiest solution is to start over by issuing the command DROP TABLE purchases; and creating the table anew.

You can confirm that the table exists with the psql command \dt, which displays a table of all the tables in the currently selected database:

lab1=# \dt
           List of relations
 Schema |   Name    | Type  |  Owner   
---------------------------------------
 public | purchases | table | joeclark
 (1 row)

That’s all there is to defining a table, at least an empty one. In order for us to demonstrate some SQL queries, though, we’ll need to store some data in the table with the SQL INSERT command. We’ll use the simplest form of this command, adding only one row at a time to the table, for example:

INSERT INTO purchases VALUES
(1001, 'Nashville', 'TN', 'Sea Kayak', 'Boating', 449);

Take note that text data must be wrapped in quotation marks ('), and numbers must not.

Writing INSERT commands by hand will quickly become tiresome, and is not the usual mode of entering data into a real database. Typically the database will support software (such as a web app, or an enterprise system) that generates data insertion and update commands automatically. Another way we might load a lot of data quickly is to read in a file containing (presumably machine-generated) INSERT commands. In psql you can execute SQL commands from a file using the \i command.

I have provided a script file containing 100 lines of purchase data on the GitHub repository that supports this book. You may find the file purchases.txt at https://github.com/joeclark-phd/databasebook-postgres, in the “psql_scripts” folder. If I have downloaded this file to a Windows laptop and saved it in the directory C:/psql_scripts, the command looks like this:

lab1=# \i c:/psql_scripts/purchases.sql
INSERT 0 1
...

Be sure to use the correct file path for your operating system and the location where you downloaded the script.

Regardless of how you insert data into the table, please add at least several records so that you can try some meaningful queries in the next section. If you are experiencing errors with INSERT commands, check that the number of values you’re inserting matches the number of columns in the table, that they’re in the right order, and that the order_id number is unique for each inserted row. If you run into problems, you can empty the table by entering the command DELETE FROM purchases; and start again.

Querying your data with SQL

SQL is the structured query language more or less common to all relational databases, and it really shines for its ability to extract just the data you want from a table or group of tables. What kinds of queries might you want to make of this data? You might want specific subsets of the data, such as all the orders for a particular product or in a particular state. Or you might want to aggregate the data, that is, sum or count or average them, perhaps in groups. Even with one simple table, there are quite a few ways to query it.

Let’s start with the basics. Your first query is the simplest: it just requests all the data.

SELECT * FROM purchases;

That’s quite a lot of rows, so I’ll give you a trick to shorten the results. Affix “LIMIT <number>” to the end of the query to get only the first several rows:

SELECT * FROM purchases
LIMIT 10;

The meaning of the “*” is “all columns”. It’s possible to request only certain columns, for example, let’s say you only want to know the cities and states that your customers are ordering from. Specify the desired columns in the “SELECT” clause:

SELECT city, state
FROM purchases
LIMIT 10;

Most of the time you don’t want every row, but want to select a subset of the data. This is accomplished with the “WHERE” clause of a query. You may request a single row by its primary key, for example:

SELECT *
FROM purchases
WHERE order_id = 1011;

Or you may give criteria that qualify more than one row, if you want to see a specific subset. For example:

SELECT *
FROM purchases
WHERE state='ME';

The criteria don’t have to be “equality” conditions, by the way. We can also use numerical inequalities; any row for which the inequality is “true” will be returned:

SELECT city, state, product
FROM purchases
WHERE price > 500;

Another condition you might use, for a primitive text search, is “LIKE”. The “%” character is a wildcard that matches any text. Thus, the following code returns all data where the product name ends in “Kayak” but would not return data where there was additional text after that word.

SELECT city, state, product
FROM purchases
WHERE product LIKE '%Kayak';
Aggregate queries

The queries above allow you to carve out subsets of the data by requesting only certain columns, certain rows, or both. In every example, though, the rows you get in the result are rows from the original table. Aggregate queries are those that generate data by combining the original rows via an aggregation function, usually SUM, COUNT, or AVG. Obviously the sum of two rows is one row, and is not identical to either of the original rows. The following query gives you the total dollar amount of all purchases in the table:

SELECT SUM(price)
FROM purchases;

No matter how many rows were in the original table, the query above returns just one row. Say… how many rows are in the original table?

SELECT COUNT(order_id)
FROM purchases;

The COUNT function actually counts the number of non-null values in the specified column, not the number of unique values. If you used COUNT(state) instead of COUNT(order_id) you’d get the same result. Even if you have five hundred purchases in 50 states, the COUNT would be 500, not 50. If you have set up the table to allow nulls (empty or missing values) in the specified column, then you might get a result of less than 500. If you want to be sure to get exactly the total number of rows, simply use COUNT(*).

A grand total (or count, or average) is interesting, but a lot of the time what we want to do are compare subtotals (or counts, or averages) for various groupings of the data. To do this, we introduce a “GROUP BY” clause. If we want to know how many purchases were made in each of several categories, we can group by the “category” column and count up the rows in each group:

SELECT category, COUNT(*)
FROM purchases
GROUP BY category;

If you want to know which products account for the largest portions of your revenue, you might group by product and sum the order prices. There are a lot of products, though, so it’s helpful to sort the results with an “ORDER BY” clause. Since the sum is the 2nd column of the result data, that’ll be “ORDER BY 2”.

SELECT product, SUM(price)
FROM purchases
GROUP BY product
ORDER BY 2;

To get just the top ten, here’s a trick: sort the data in descending (“DESC”) order, and “LIMIT” the results to just the first ten rows.

SELECT product, SUM(price)
FROM purchases
GROUP BY product
ORDER BY 2 DESC
LIMIT 10;

In Chapter 2 and beyond you’ll learn a lot more SQL, such as how to create queries that “join” multiple tables, and how to write queries that employ nested sub-queries. Even in this example, though, you’ve seen several of the main parts of a SELECT query, including the “WHERE”, “GROUP BY”, and “ORDER BY” clauses, and aggregate queries. You have begun to see that even a simple one-table database may be queried in several different ways, and that doing this with short SQL queries may be much easier than trying to wrangle the data in Excel.

I recommend that you attempt the exercises and challenges at the end of this chapter to get more practice with the basics of relational databases, SQL, and PostgreSQL specifically.

Summary

A database serves as a respository for data and a system to manage it, a “single source of truth” that can mitigate problems organizations face when working with information—coordination, version control, and information security, to name a few. A database becomes a key piece of business infrastructure when numerous people and systems depend on it. Accordingly, the database typically operates as a server, a system that is always on, always ready to respond to requests from other systems. Database administrators are the professionals who engineer and maintain databases. There are many types of databases; this book focuses on the most popular and versatile type—the relational database. PostgreSQL is an increasingly popular, open-source, and cross-platform relational database that should serve us well as a platform for learning relational modeling and the SQL query language. These concepts are easy to transfer to other relational databases once you learn them

Definitions

black box
a term sometimes used to hypothesize a device by describing what it would do without addressing how it might actually work
database
(in common usage) a system that organizes and stores data and, importantly, makes it available to people and systems that need to search or retrieve it
(more precisely) the structured data stored by such a system, which is created and accessed using a database management system (DBMS)
(in PostgreSQL) a logical grouping of tables within a PostgreSQL server
database administrator (DBA)
a professional responsible for developing, securing, and maintaining the database(s) that an organization depends on
database management system (DBMS)
a software system used to create, manage, and query a database
database server
a database that operates as a server; PostgreSQL acts as a server even when installed on a personal computer
display logic
information or code that specifies how data is to be presented to a human user; unlike spreadsheets, database tables do not convey their own display logic
document-oriented database
a non-relational database type in which data is logically modeled as collections of “documents” with tree-like structures
graph database
a non-relational database type in which data is logically modeled as a network diagram (a mathematical graph) with nodes and edges
metadata
information about data’s structure and meaning; examples include column names and data types
Postgres
nickname for PostgreSQL
PostgreSQL
an advanced open-source RDBMS increasingly popular with software developers; freely available at www.postgresql.org
primary key
a column in a data table that is guaranteed to have a unique value for each row, and therefore allow us to retrieve a specific piece of data
query
(v) to request specified data from a database
(n) a specification in code (e.g. SQL) of a request for data from a database
relation
a data “table” that conforms to a few criteria mentioned in this chapter and further detailed in Chapter 2
relational database
a database in which data is logically structured as a collection of relations (i.e. tables) and which conforms (more or less) to the principles proposed in E.F. Codd’s 1970 paper “A Relational Model of Data for Large Shared Data Banks”
relational database management system (RDBMS)
a DBMS for a relational database; notable examples include PostgreSQL, MySQL, Oracle, SQL Server, DB2, and Access
server
a software or hardware system that is always on, waiting to respond to requests from users or other systems
SQL
the “structured query language”, a declarative language for querying an RDBMS; mostly standardized, there are minor differences in the SQL “dialect” used by different database brands
table
the commonly term for a relation in a database

New psql commands

help
display a list of the most basic psql commands
\?
list all psql commands
\h
list all SQL commands supported by PostgreSQL
\c <database>
switch to working in the specified database
\dt
list all tables in the current database

New SQL syntax

TBD

Challenges

  1. Look up and read an encyclopedia entry on databases. Did you learn anything that you think ought to have been included in this chapter? Email the author or leave feedback on Leanpub!
  2. Find and read the official PostgreSQL documentation on the “CREATE TABLE” command. What are some of the other rules or constraints you can place on a column? Name three, and speculate on when they would be useful.
  3. Look up the documentation on data types supported by PostgreSQL. There are a lot of them. Make a short list (or a cheat sheet) of the five or six data types you think you’d use most often.
  4. Try to find out on your own how to use the “ALTER TABLE” command to add a new column to an existing table. Specifically, add a “date” column to the “purchases” table used as an example in this chapter.
  5. PostgreSQL offers a graphical user interface (GUI) called pgAdmin, which may have been installed with it. Find this program on your computer or install it, and figure out how to log in to your PostgreSQL database(s) with it.
  6. On your own, figure out how to use pgAdmin to create a new table in the “lab1” database. Make sure it has a primary key column. Can you add some sample data to the table without having to use SQL “INSERT” commands?
  7. See if you can find out how to open a window within pgAdmin to execute arbitrary SQL commands. Run some of the example SQL queries from this chapter through that interface.
  8. Other graphical clients for PostgreSQL (besides pgAdmin) are available, some free (or free to students) and some commercial. Find and examine one or two of these alternatives. What do you think of them?

Exercises

TBD

Think About It

  1. Not all databases are relational. How might you structure a database without using familiar tables, rows, and columns? What would be the advantages and disadvantages of such a data modeling approach?

Chapter 2: The Relational Model

There are other data models

For a couple of decades (roughly 1985-2005), the relational data model was the only game in town, you had to learn it, and there was no reason for a textbook to argue the point. Today, data engineers have a lot of other options. Document-oriented databases are booming in popularity with app developers for their ease of use; graph databases have captured the imagination of researchers and tinkerers because of their natural fit with social network applications; and the analytics world has found performance advantages to be gained with dimensional databases, column-family databases, and cluster-based Big Data platforms like Hadoop. These are meaningful advantages, so before we take it for granted that the relational model is the most important one to learn, it is necessary to remind ourselves what it is and what its unique advantages are.

A data model is a means of describing the data in a database without regard to the way it is actually physically stored. It provides abstractions that humans can work with—e.g. tables, documents, dimensions—instead of implementation artifacts like bytes, pointers, and disk sectors. This abstraction is quite important when you consider the growth and evolution of a database over time, and the number of applications that might come to depend on it. As databases are used, they grow larger, the types of data stored may grow or change, and it may be necessary for database administrators to optimize their performance by upgrading the technology in various ways. If the developers of applications that depend on these databases had written their code to interact with the data as it was stored on disk, their applications would break, and code would need to be rewritten, every time such a change was made. But because we have data models, this is not necessary. Application developers interact with the database using a “data language” that is independent of physical implementation; they query tables, rows, and columns for instance, instead of locations on disk.

The relational model’s particular strength is its ability to efficiently answer queries that were not foreseen at the time the database was developed. Before E.F. Codd’s landmark paper introducing the relational model, the leading approach to data modeling was a hierarchy or network in which you would follow links from one data point to another. Consider, for example, a hierarchical database of movies listed by director:

Figure 2-1. A hierarchical database of movies
Figure 2-1. A hierarchical database of movies

In this example, it would be very easy to query the database for a list of movies directed by Christopher Nolan; just start from his name and follow the pointers. By contrast, it would be quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. The database engine would have to essentially read through the entire database, the vast majority of which is not relevant to the query, following all paths from left to right in the diagram to make sure it had found all of the paths that end with his name. In a really huge database, such a query could be prohibitively expensive.

The problem is, as databases grow, you will always find that you want to make queries you didn’t anticipate at the time you created the database. Codd, a researcher at IBM, had this problem in mind when in a 1970 he published “A Relational Model of Data Storage”.

Figure 2-2. Edgar F. Codd
Figure 2-2. Edgar F. Codd

In the relational model, the database consists of a set of tables, one for each entity (or noun) described by data. Each table can be queried on its own, or related tables can be combined into one query, but there is no “parent” table and no network or hierarchy that must be traversed. You have already seen a glimpse of this in Chapter 1; now, let’s define the terms a bit more precisely.

Theoretical roots

Codd’s conception of a relation comes from set theory:

The term relation is used here in its accepted mathematical sense. Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the j th domain of R.

It is possible to visualize this as a table, hence the commonly-used language of tables, rows, and columns. In Figure 2-3, we see a relation defined by four sets—a set of movie titles, a set of movie studios, the set of integer years, and the set of director names. These indicate the possible values that may appear in any given data record; it’s not necessarily the case that all of the years of history (for example) will be found in the data. This is an important point: the relation (table) is “defined” by specifying the domains (columns) and not by the rows. You’ll see this in SQL’s CREATE TABLE command, which specifies column names and data types only.

Figure 2-3. A relation with four domains, three tuples
Figure 2-3. A relation with four domains, three tuples

The term tuple (derived from “double”, “triple”, “quadruple”, and so on, up to “n-tuple”) in mathematics refers to an ordered list of values. In this context, a tuple fits in to a relation if it contains one value for each domain, in the same ordering as the domains. So in the above example, a tuple must contain a movie title, a studio name, an integer, and a director’s name, in that order. Each such tuple is a “row” of the table, or a record of the data—in this case, it’s a movie that we’re interested in.

Because a relation is a set of tuples so defined, a number of constraints apply—some of which will be relaxed in practical implementations. These include:

  1. Each row must be unique.
  2. The order of rows is immaterial.
  3. The order of columns is significant.
  4. Each row must include a value for each column.

The first rule is easy to accidentally violate, for example in an order-taking system, where the same customer may purchase the same product on more than one occasion. It is common practice therefore to add a primary key (PK) column that contains a value guaranteed to be unique in the relation, such as an identification number.

The second rule implies that we must query the database without regard to the order in which data was entered, or any other order. One cannot query for the “next” value and reliably predict what result will be given. Attention must be paid to the WHERE clause of a SQL query to specify exactly what we want.

In practice, the third rule is ignored. Codd informs us that the mathematical term for a relation with no specific domain ordering is a relationship. Instead of referencing an element of a row by its position in the sequence, in practice we use column names. Database developers should give sensible names to their columns, particularly when there may be one or more with the same domain of potential values. For example, a table of customer records may have two or more “address” columns, one for billing and one for shipping. It would be wise to give these columns names that indicate both the domain and the role, such as “address_billing” and “address_shipping”.

The fourth rule may be relaxed as well. Allowing missing values (called nulls) in certain columns gives database developers the flexibility to include optional attributes, or to add a data record in a step-by-step way instead of all at once. Obviously nulls cannot be allowed in every column, or the first rule may be violated. Therefore most databases don’t allow nulls in the primary key column.

In a query result, which otherwise resembles a relation, even the first two rules may be violated. This can be demonstrated by the following database interaction:

ch2=# select studio from movies;
 year
---------
 Paramount
 Paramount
 Warner Bros
(3 rows)

ch2=# select movie, year from movies order by year;
 movie         | year
---------------+------
 Sahara        | 2005
 Inception     | 2010
 Interstellar  | 2014
(3 rows)

The first result contains non-unique rows. There’s a mathematical term for a set of n-tuples that admits duplicates: it’s called a bag. The second example result orders the rows in a meaningful sequence.

A relational database is a database of relations

We can therefore think about a relational database as a collection of tables (technically relations (even more technically, relationships)) that, together, describe the important facts in whatever context we’re interested in: a business, an application, a research project, or whatever. In some cases, a table may be useful on its own, but in many tasks you will want to query two or more related tables together. Yes, tables are related: customers are assigned to salespeople, employees belong to departments, products have bills of materials that go into their manufacture.

Two aspects of the context must be represented in the database design: entities and relationships. Entities are the nouns that matter: people, things, places, events, and concepts. Relationships are connections between entities which can usually be described by verbs. The most useful tool in developing a data model is an entity-relationship diagram (aka E-R diagram or ERD). See Figure 2-4. The boxes are entity types (although we usually just say “entities”), which correspond to tables. The lines indicate the relationship types (or “relationships”), which indicate how the rows of each table relate to the rows of the other tables.

figure 2-4. A sample ERD with 1:1, 1:M, and M:M relationships
figure 2-4. A sample ERD with 1:1, 1:M, and M:M relationships

Three types of relationships that differ in cardinality are depicted in Figure 2-4, a relational data model for a database of movies (like IMDB) diagrammed in “crow’s foot” notation. These are a “one-to-many” (or 1:M) relationship between Studio and Movie, a “one-to-one” (1:1) relationship between Script and Movie, and a “many-to-many” (M:M) relationship between Movie and Actor. Out of these three basic types, complex data models can be designed.

One-to-many

The most common type of relationship between tables is one-to-many (1:M), as seen in Figure 2-5. The rectangles in the diagram stand for entity types, which correspond to tables, and mean that any number of entity instances of each type may be in the database. We can assume that there are, or will eventually be, a large number of movies in the Movie table and large number of studios in the Studio table. The line with a crow’s foot at one end and a single tick mark at the other end indicates the type of relationship that may exist between rows of the Movie table and rows of the Studio table. Simply put, it says that each movie is related to just one studio, but that any given studio may be related to more than one movie. It is also common practice to add text to the diagram to explain why, or how, the two entity types might be related—in this case, studios produce movies (and movies are produced by studios).

Figure 2-5. A one-to-many relationship
Figure 2-5. A one-to-many relationship

When it comes time to create the actual tables, a one-to-many relationship is implemented by a simple and intuitive mechanism called a foreign key (FK). Remember that each table has a primary key, a column of data whose values are guaranteed to be unique—typically an ID number or code. To implement a 1:M relationship we add a special column to the table on the “many” side (in Figure 2-5, “Movie”) that holds references to primary key values in the other table. This could be achieved with the following SQL (note the REFERENCES clause):

CREATE TABLE studio (
  id  integer  PRIMARY KEY,
  name text );

CREATE TABLE movie (
  id  integer  PRIMARY KEY,
  title text,
  studio_id integer REFERENCES studio(id) );

If we view the tables with a few sample rows of data (Figure 2-6), the purpose of the foreign key column studio_id should be clear. At a glance you can see that Sahara and Interstellar were produced by Paramount (studio #1) and that Inception was produced by Warner Bros. (studio #2).

Figure 2-6. Sample tables in a 1:M relationship (emphasis on foreign key column)
Figure 2-6. Sample tables in a 1:M relationship (emphasis on foreign key column)

Notice that there is no foreign key in the “Studio” table. If you pencil in a “movie_id” column on Figure 2-6, or simply imagine one, the reason should be obvious. If a studio were assigned a “movie_id”, then a studio could only be related to one movie, and that clearly doesn’t mesh with reality or with our E-R diagram.

Many-to-many

The next most common kind of relationship between two entity types is many-to-many (M:M), as in the relationship between Movies and Actors: each movie involves more than one actor, and each actor can be in more than one movie. See figure 2-7.

Figure 2-7. A many-to-many relationship
Figure 2-7. A many-to-many relationship

Now, an M:M relationship is easy to draw on a diagram, but it’s a little more complicated to implement as real tables in the database. Take a minute to think about how you might do it, before reading about it below.

Did you come up with a solution?

Well, here’s the solution that I teach my students: to represent the many-to-many connection, you need a third table. This is the only kind of case where a table represents a relationship rather than an entity. This new table, called an associative relation, might have two columns only: each a foreign key to one of the tables in the relationship. Figure 2-8 illustrates some sample data for the simplest kind of associative table.

Figure 2-8. Movie_Actor creates the relationship between Movie and Actor
Figure 2-8. Movie_Actor creates the relationship between Movie and Actor

A question of style concerns whether you should depict the associative entity on your E-R diagram—essentially as a table with two 1:M relationships to the entities of interest—or simply use the double-crow’s-foot notation as in Figure 2-7.

To answer this, consider the opportunity afforded by the existence of an associative table. It need not be limited to two foreign key columns only, because a table can have any number of columns. You can therefore use columns of the associative table to give meaningful characteristics about the relationship. Such a relationship may in fact be a “thing” of its own, perhaps an intangible one. An M:M relationship between a Buyer and a Seller might be an entity called a Contract. An M:M relationship between a social media User and a Channel he follows could be called a Subscription. And in the case of the Actor-Movie relationship, we might call it a Role.

The answer to the style question is that, if the associative table can be construed as an associative entity, a thing or concept (or noun) meaningful to the business, it makes sense to add it to the diagram. If the associative entity is meaningless to anyone except the guy writing database code, you would in most cases leave it off the diagram.

See Figure 2-9 for an example diagram including the associative entity. Note that instead of an M:M relationship you now see two 1:M relationships. In this example, the rectangles representing entities also show the names of some columns that would belong to each table; this is also a common style of E-R diagramming.

Figure 2-9. Many-to-many relationship with meaningful associative entity.
Figure 2-9. Many-to-many relationship with meaningful associative entity.

An implementation in SQL could look like this:

CREATE TABLE movie (
  id integer PRIMARY KEY,
  title text,
  year integer );

CREATE TABLE actor (
  id integer PRIMARY KEY,
  name text );

CREATE TABLE role (
  id integer PRIMARY KEY,
  movie_id integer REFERENCES movie(id),
  actor_id integer REFERENCES actor(id),
  character_name text );

Notice that there are two foreign keys in the role table.

One-to-one

One-to-one (1:1) relationships are much less common than 1:M and M:M relationships, simply because if two types of entities are related this way it’s often easier to put them together in one table. For example, a person has only one Social Security number, and a Social Security number has only one person, so you would typically have a table for People and include a Social Security number column rather than having two tables.

One reason you might store the data in separate tables is if some of the data needs to be stored in a different (physical) space, or stored with different security settings. Physical database optimization is beyond the scope of this chapter (although we’ll come back to it in Chapter 8) but I’ll give you and example. In many web applications, each user has a profile picture. Picture files (JPEG, PNG, or whatever) are quite large on disk, and to include them in the Users table of the database would make that table larger (perhaps by orders of magnitude) hence making queries slower. To give your website users a quicker response time, you might store the pictures in a separate ProfilePicture table, and relate them by foreign key to the corresponding rows of the Users table.

Another use case for a 1:1 relationship is when one of the tables contains “optional” data. For example, let’s say our movie database contains 10,000 rows, and we have scripts for 1,000 of those movies. Rather than have a script column in the Movie table, with 9,000 NULL entries making that table bigger and slower, we could have a separate Script table with only the rows it needs. In this case, each row of Script relates to exactly one row of Movie, but each row of Movie may relate to zero or one row of Script. It’s not exactly a parnership among equals: Movie is the main table in the relationship, and Script is a “dependent”.

Figure 2-10. One-to-one relationship between Script (optional) and Movie (required)
Figure 2-10. One-to-one relationship between Script (optional) and Movie (required)

Figure 2-10 uses a version of the crow’s foot notation that indicates the difference. The double tick mark next to Movie means that, for any row of Script, there must be at least one and no more than one row of Movie. The circle (or zero) with a tick mark next to Script means that for any row of Movie, there may be zero or one rows of Script. (The simpler notation was seen earlier, in Figure 2-4.)

To implement this in SQL, even though you could get away with putting the foreign key on either side of this relationship, or on both, the best practice is to put a foreign key in the “dependent” table. In this case, that table is Script. Essentially a 1:1 relationship is a special case of 1:M, where the foreign key is simply constrained to be unique. Since primary keys are unique, the most practical way to do this is for the rows of the dependent table to have the same primary key values found in the “independent” table, and for its primary key to double as a foreign key, like so:

CREATE TABLE movie (
  id integer PRIMARY KEY,
  title text,
  year integer );

CREATE TABLE script (
  movie_id integer PRIMARY KEY REFERENCES movie(id),
  author text,
  body text );

A selection of the data, as in Figure 2-11, shows that the Script table borrows its primary key values from Movie but does not have a row for every value.

Figure 2-11. Sample data from tables in a 1:1 relationship
Figure 2-11. Sample data from tables in a 1:1 relationship

Unary relationships

All of the relationships in the E-R diagrams seen so far have been binary relationships, meaning they include two tables, and are said to have degree of two. Other degrees are possible. Frequently we also see unary relationships, where rows of one table are related to other rows of the same table. Unary relationships can be of any cardinality we’ve seen so far: 1:M, M:M, or 1:1. Remember that it is not the tables that are related, per se, but the rows.

An example of a unary 1:M relationship is posed by Figure 2-12, which depicts a movie studio ownership relationship. Studios are often owned by other studios, as for example Disney owns Lucasfilm and Sony owns Columbia. This can be expressed as a foreign key in the Studio table referencing an “owner”. (Note that it can’t be done with a foreign key referencing the “owned” studio; the FK belongs on the “dependent” side of a 1:M relationship.)

Figure 2-12. A unary 1:M relationship
Figure 2-12. A unary 1:M relationship

Here it is in SQL. The NULL keyword indicates that owner is an optional field; it may contain NULL if a particular studio has no owner recorded in our database.

CREATE TABLE studio (
  id  integer  PRIMARY KEY,
  name text,
  owner integer NULL REFERENCES studio(id) );

As in other cases, sample data may be helpful to illustrate how the implementation works.

Figure 2-13. Sample data for a unary 1:M relationship
Figure 2-13. Sample data for a unary 1:M relationship

A unary many-to-many relationship could easily represent ties of affiliation between individuals or organizations. Hollywood being the way it is, we could define a M:M relationship between Actors called “was married to”. Like binary M:M relationships, this would require an associative table (which may or may not be diagrammed). See Figure 2-14.

Figure 2-14. Two ways to diagram Hollywood marriages as a unary M:M relationship.
Figure 2-14. Two ways to diagram Hollywood marriages as a unary M:M relationship.

An implementation in SQL could look like this:

CREATE TABLE actor (
  id integer PRIMARY KEY,
  name text );
        
CREATE TABLE marriage (
  id integer PRIMARY KEY,
  husband integer REFERENCES actor(id),
  wife integer REFERENCES actor(id) );

And some sample data illustrates how it works:

ch2=# select * from actor;
  id  |        name
------+--------------------
 9001 | Brad Pitt
 9002 | Angelina Jolie
 9003 | Jennifer Aniston
 9004 | Billy Bob Thornton
(4 rows)

ch2=# select * from marriage;
 id | husband | wife
----+---------+------
  1 |    9001 | 9003
  2 |    9004 | 9002
  3 |    9001 | 9002
(3 rows)

Finally, an example of a unary 1:1 relationship that might be found in our movie database as a reference from a sequel to its predecessor. This would be implemented as an optional column in the Movie table called “sequel_to”, referencing another row of the Movie table, like so:

ch2=# select * from movie where title like 'Rocky%';
 id |   title   | sequel_to
----+-----------+-----------
  1 | Rocky     |
  2 | Rocky II  |         1
  3 | Rocky III |         2
  4 | Rocky IV  |         3
  5 | Rocky V   |         4
(5 rows)

Ternary and n-ary relationships

Virtually all of the relationships you’ll find in most E-R diagrams are binary (connecting two tables) or unary (connecting rows within one table). It is nevertheless possible, however uncommon, to imagine a relationship connecting three tables (“ternary”), four (“quarternary”), or any arbitrary number of tables (“n-ary”). This simply means that one of each entity type is required to create a case of the relationship.

In the context of movies, we might consider a distribution deal to be one such relationship. A movie distribution deal is an agreement between a producer (Studio) and a third party (Distributor) to distribute a movie (Movie) in a particular territory (Territory). Figure 2-15 illustrates this as a quarternary many-to-many-to-many-to-many (M:M:M:M) relationship.

Figure 2-15. An example of a quarternary M:M:M:M relationship.
Figure 2-15. An example of a quarternary M:M:M:M relationship.

It is rare to see this kind of relationship diagrammed, though. The only way to implement a relationship of degree > 2 is to use an associative table with foreign key columns for all the participants. Therefore, in most cases you will simply model a new associative entity—for example “Deal” or “Contract”—which contains details as well as 1:M relationships with each of the related tables.

Lab 2: Creating and querying a relational database

In Lab 1 you created a one-table database and were introduced to some of the basic things you can do with SQL queries. Now that you have learned the basics of how multiple tables can be related to one another within a relational database, you will want to see some richer and more realistic examples. Moreover, you’ll need to familiarize yourself with the main way that tables are queried together: the SQL join.

What we’ll do in this lab is:

  1. Create a new database called “lab2”.
  2. Define several tables to learn the features of the CREATE TABLE command.
  3. Learn how to code different types of JOIN queries

In order to get started, first create a new empty Postgres database called “lab2”, log in to your Postgres server using psql, and switch your context to the new database—much as we did in Lab 1:

$ createdb lab2
$ psql
psql (9.6.1)
Type "help" for help.

joeclark=# \c lab2
You are now connected to database "lab2" as user "joeclark".
lab2=#

The CREATE TABLE command

SQL is divided into two main types of commands, called data definition language (DDL) and data manipulation language (DML) respectively. DDL commands are those used to design and structure the tables that constitute the database, and the chief among them is CREATE TABLE. (Two others you may frequently encounter are ALTER TABLE and DROP TABLE.) The basic form of a CREATE TABLE command in Postgres is as follows:

CREATE TABLE table_name (
  column_name data_type [constraints/options],
  column_name data_type [constraints/options],
  ...,
  [constraints]
);

I’ve kept it simple here in order to make a clear introduction, and will reveal more options as we move on. The complete specification of CREATE TABLE can be found in the PostgreSQL online documentation, which is excellent. You have already seen several examples of CREATE TABLE, one of which I’ll reproduce here:

CREATE TABLE studio (
  studio_id integer PRIMARY KEY,
  name text 
);

In this example you saw only one of PostgreSQL’s optional constraints: PRIMARY KEY. Yes, a “PK” is a constraint on the data. By that we mean that it sets up a rule that will lead to errors if we try to insert bad data—here’s one of the advantages of databases as opposed to spreadsheets: they tell us when we make mistakes. In this case, the rule is the entity integrity rule, which dictates that the primary key column must contain a unique (not null) value for every row. Look what happens if I try to add two movie studios with the same “studio_id”:

lab2=# insert into studio (studio_id,name) values (1,'Disney');
INSERT 0 1
lab2=# insert into studio (studio_id,name) values (1,'Warner Bros');
ERROR:  duplicate key value violates unique constraint "studio_pkey"
DETAIL:  Key (studio_id)=(1) already exists.

Three other constraints you’ll see me use in this lab are NOT NULL, which means a particular column can’t be empty, REFERENCES which sets up a foreign key relationship, and CHECK which allows for arbitrary validation of the data.

Developing your database iteratively

As you work through this and other labs, you’ll soon find that you’ve made mistakes or come up with better ideas, and want to start over. Typing those CREATE TABLE commands into psql gets tedious and is error-prone. As in most other types of programming, the solution is to write your SQL code in a text file (you may call it a script) that you can save, modify, and re-use. You can tell Postgres to run the whole script at once with a one-line command. This allows you to iterate toward a design that works.

To continue with this lab, create a text file using any text editor you like, such as Notepad++ on Windows, TextWrangler on the Mac, or Vim for you Linux geeks. One trick that I find handy is to precede my CREATE TABLE code with corresponding DROP TABLE commands. That’s because I expect to run this script over and over again, tweaking it until I get it right, and you have to delete a table before you can (re)create one with the same name.

File lab2.sql:

DROP TABLE studio;
CREATE TABLE studio (
  studio_id integer PRIMARY KEY,
  name text
);

There are two ways to tell Postgres to run this script. From within the psql environment, use the \i command and the location of the script:

lab2=# \i c:/psql_scripts/lab2.sql
DROP TABLE
CREATE TABLE

Or if you’re not logged in to psql, you can use your operating system’s command line, specifying the database name (after “-d”) and the script file (after “-f”):

$ psql -d lab2 -f lab2.sql
DROP TABLE
CREATE TABLE

If you made any mistakes, don’t worry about it! Correct your code and run the script again, as many times as you need until it works.

Data types in PostgreSQL

Recall that each column in a relation is defined by a domain. When defining a table, we constrain the domain of values that may be stored in a column by specifying a data type. In the “studio” table, you saw two data types used: the studio’s ID number is defined as an integer and its name is text. If you tried to insert the wrong type of data into either column (such as a decimal number, perhaps) you’d get an error. In order to design your database well, you should become familiar with the main data types available in Postgres.

When it comes to numbers, there is a trade-off between precision, accuracy, and database performance. The integer data type stores whole numbers in the range -2147483648 to +2147483647 with absolute precision but doesn’t allow fractions. The real data type can hold a decimal number of any magnitude with about 6 significant digits of precision, but small inaccuracies can result from the necessity of rounding. Both require four bytes of storage on disk. A third type, numeric, can be made precise and accurate up to an arbitrary number of digits before and after a decimal point, but it is much larger on disk and hence slower to process. The numeric type might be useful for storing currency amounts because businesses can’t afford any kind of inaccuracy when counting money.

A variation on integer that will become very useful is the serial type; this specifies an integer column that will be automatically filled in, when a new row is created, with the next whole number in sequence: 1, 2, 3, and so on. It can be very handy for a primary key column, where you don’t care about the actual value except that it must be unique.

Text in a computer is stored as a sequence or string of characters—mostly letters, numbers, spaces, and punctuation—and the data types for text differ in whether you want to constrain the length of the text. In standard SQL, the two main types are char(n) and varchar(n). The char(n) type specifies text that has exactly “n” characters. Like the bed of Procrustes, the database will cut off the end of a string that is too long, or stretch one out that is too short (by adding spaces to the end of it). You might use char(2) to store a state abbreviation in an address. The varchar(n) type holds text of any length up to “n”, so varchar(20) is probably adequate to store last names, and varchar(64) might be enough to accommodate titles of books. You might use varchar if you need to limit the length of the data, for example to make sure it will look good on a computer screen or print on a shipping label.

In most databases, char offers a performance advantage over varchar when the data is predictably the same length: for each row of data, char(10) would require 10 bytes while varchar(10) would require about 2 bytes to say how long the text is plus 0-10 bytes to store the text. When the text is of widely varying length, varchar would have an advantage because it doesn’t pad the shorter values with spaces. However, in Postgres, due to clever engineering there is actually no such performance trade-off. In fact, Postgres offers a third type, simply called text, which allows character strings of unlimited length and is no slower than the others. So unless you have a reason to limit the length of a text value, or feel strongly about sticking to standard SQL, use Postgres’s text type.

Some of the other data types you will often find useful are boolean (a true/false value), date, time, and timestamp (the latter combines a date and a time of day). Postgres is also known for offering some highly unorthodox (to the SQL world) data types, such as Arrays, XML documents, and JSON, but those are beyond the scope of this chapter.

I’ve added the following to lab2.txt:

DROP TABLE person;
CREATE TABLE person (
  person_id serial PRIMARY KEY,
  first_name text,
  last_name text NOT NULL,
  sex char(1) CHECK (sex='M' or sex='F'),
  birthdate date
);

You can see that I’ve used a few more features of the DDL here. I’ve added a NOT NULL constraint to the “last_name” column, so it can’t be empty. (Any of the other fields will accept nulls, effectively making them “optional”.) The CHECK constraint will validate data in its column according to any arbitrary logical test. In this case we’re checking that “sex” is either “M” or “F”; the table will throw an error message if you try to enter anything else. It does accept a null value, though. You also see examples in this code of the serial, char, and date data types which we haven’t used before.

Referential integrity

Previously we saw the entity integrity rule in action—that every row must have a unique value for its PK. Another vitally important integrity constraint for relational databases is the referential integrity rule, which states that a foreign key value can only match a valid primary key value from the referenced table. (Nulls may be allowed by the database designer, but values that don’t match existing PKs cannot be.) It also happens that you cannot define a foreign key column if the referenced table doesn’t exist, so this rule has an effect on the sequence in which we must create the database and insert data.

Case in point: although we are building a database of movies, we could not create the “movie” table first, because we know it’s going to reference certain other tables such as “studio”. Now that “studio” exists, we can define “movie” like so:

DROP TABLE movie;
CREATE TABLE movie (
  movie_id serial PRIMARY KEY,
  title text,
  year integer CHECK (year>1900 and year<2100),
  studio_id integer REFERENCES studio(studio_id),
  director_id integer REFERENCES person(person_id)
);

If you try to create the “movie” table before the “studio” and “person” tables exist, PostgreSQL will refuse to do it, and give you an error message, because the foreign key constraints on the “studio_id” and “director_id” columns won’t make sense to it. What you might not have guessed is that if you try to drop the “studio” or “person” table after creating “movie”, you’ll also get an error message. When dropping tables, you must drop the referencing tables before you drop the referenced tables. In order for our script to work, we have to re-arrange it so that the command DROP TABLE movie; comes before the other DROP commands. My script now looks like this (with column definitions omitted):

DROP TABLE movie;
DROP TABLE studio;
DROP TABLE person;

CREATE TABLE studio ( ... );
CREATE TABLE person ( ... );
CREATE TABLE movie ( ... );

The last table created is the first table deleted.

Sequencing of INSERT commands is also important. We cannot add a movie before its studio exists, because “studio_id” in the “movie” table must match a real “studio_id” in the “studio” table. Ditto for directors. (We can create a movie before its actors have been added, because there’s no direct FK relationship to actors.)

By the way, there are two common ways to write the PostgreSQL INSERT command: single-row and multi-row insertions. In either case, you first specify the columns to add data to, and then provide the values for the new row(s). Single-row insertions look like this:

INSERT INTO studio (studio_id, name) VALUES (1,'Disney');
INSERT INTO studio (studio_id, name) VALUES (2,'Paramount');
INSERT INTO studio (studio_id, name) VALUES (3,'Warner Bros');

INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Christopher','Nolan','M','1970-07-30');
INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Breck','Eisner','M','1970-12-24');
INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Brad','Bird','M','1957-09-24');

Notice that when inserting to the “person” table, we didn’t specify a “person_id”. We could have if we’d wanted to, but because the PK is a serial data type, it will automatically number the new rows for us. You can check the numbers with a simple SELECT query in psql. Your numbers might be different from mine if you have created and deleted other data previously, so be sure to check:

lab2=# select * from person;
 person_id | first_name  | last_name | sex | birthdate
-----------+-------------+-----------+-----+------------
         1 | Christopher | Nolan     | M   | 1970-07-30
         2 | Breck       | Eisner    | M   | 1970-12-24
         3 | Brad        | Bird      | M   | 1957-09-24

Multi-row insertions look like the following. Make sure you check the director’s “person_id” PKs and use the right ones in your code:

INSERT INTO movie (title,year,rating,studio_id,director_id) VALUES
 ('Sahara',2005,'PG-13',2,2),
 ('Interstellar',2014,'PG-13',2,1),
 ('Inception',2010,'PG-13',3,1),
 ('The Incredibles',2004,'PG',1,3),
 ('Ratatouille',2007,'G',1,3);

Commas separate each row’s values, and a semicolon ends the command.

To complete the database for our lab, let’s create the one-to-one and many-to-many relationships. The “script” table is in a 1:1 relationship with “movie”: there is zero or one script per movie. We will not include the full screenplay in this example database, just the screenwriter’s name.

CREATE TABLE script (
  movie_id integer PRIMARY KEY REFERENCES movie(movie_id),
  screenwriter text NOT NULL
);
INSERT INTO script (movie_id,screenwriter) VALUES (1,'Donnelly');

Actors are related to movies in this database via an M:M relationship: each actor may be in multiple movies and each movie may include multiple actors. As diagrammed in Figure 2-9, we implement this by creating an associative table called “role” which has foreign keys to both “person” and “movie”.

CREATE TABLE role (
  role_id serial PRIMARY KEY,
  movie_id integer REFERENCES movie(movie_id),
  actor_id integer REFERENCES person(person_id),
  character_name text
);

INSERT INTO person (first_name, last_name, sex, birthdate) VALUES
 ('Leonardo','DiCaprio','M','1974-11-11'),
 ('Joseph','Gordon-Levitt','M','1981-02-17'),
 ('Matthew','McConaughey','M','1969-11-04'),
 ('Anne','Hathaway','F','1982-11-12'),
 ('Penelope','Cruz','F','1974-04-28'),
 ('Lou','Romano','M','1972-04-15');

INSERT INTO role (movie_id, actor_id, character_name) VALUES
 (1,6,'Dirk'), (1,8,'Eva'), (2,6,'Coop'), (2,7,'Brand'),
 (3,4,'Cobb'), (3,2,'Arthur'), (4,9,'Bernie'), (5,9,'Linguini');

Feel free to add more movies and actors if you like. This example is just a tiny prototype of what you might find behind the scenes of IMDB.com. The complete code for my lab2.txt is available on this book’s GitHub repo. Next, we’ll start writing queries that join tables.

Queries that JOIN tables

What makes a relational database more than just a collection of single-table databases is the capability to join tables and query them together. We can combine the “studio” and “movie” tables with a query like this one, which I’ll explain below:

SELECT title, year, rating, studio.name AS studio
FROM studio NATURAL JOIN movie;

Tables are joined using the FROM clause. Instead of identifying one table, we can list two (or more) separated by commas. What the database does when you query multiple tables is first take the Cartesian product of the two. Essentially what this means is that it combines each row of the first table with each row of the second. If the “studio” table has three rows of data and “movie” has five, the Cartesian product has fifteen. Observe:

lab2=# SELECT name FROM studio;
    name
-------------
 Disney
 Paramount
 Warner Bros
(3 rows)

lab2=# SELECT title FROM movie;
      title
-----------------
 Sahara
 Interstellar
 Inception
 The Incredibles
 Ratatouille
(5 rows)

lab2=# SELECT name, title FROM studio, movie;
    name     |      title
-------------+-----------------
 Disney      | Sahara
 Disney      | Interstellar
 Disney      | Inception
 Disney      | The Incredibles
 Disney      | Ratatouille
 Paramount   | Sahara
 Paramount   | Interstellar
 Paramount   | Inception
 Paramount   | The Incredibles
 Paramount   | Ratatouille
 Warner Bros | Sahara
 Warner Bros | Interstellar
 Warner Bros | Inception
 Warner Bros | The Incredibles
 Warner Bros | Ratatouille
(15 rows)

Clearly, of course, most of these combinations don’t make any sense. The Incredibles is a Disney picture, so there’s no circumstance where you’d want a row matching it up with Paramount or Warner Bros. Recall from Chapter 1’s lab that the WHERE clause allows us to filter the rows of a result. What you’d want to do now is to keep only those matchups where the PK “studio_id” of the “studio” data equals the FK “studio_id” of the “movie” table, like so:

lab2=# SELECT name, title FROM studio, movie
lab2-# WHERE studio.studio_id = movie.studio_id;
    name     |      title
-------------+-----------------
 Paramount   | Sahara
 Paramount   | Interstellar
 Warner Bros | Inception
 Disney      | The Incredibles
 Disney      | Ratatouille
(5 rows)

That’s more like it! In my WHERE clause I identified the columns of interest by the combination of table name and column name, e.g., “studio.studio_id”. This is only necessary where the column name alone would be ambiguous otherwise, that is when two or more tables have column names in common; it’s an option available to you in other cases.

SQL also offers a JOIN keyword that you can use to make it more explicit what you’re doing. The following two SQL statements do exactly the same thing:

SELECT * FROM studio, movie WHERE studio.studio_id = movie.movie_id;
SELECT * FROM studio JOIN movie ON studio.studio_id = movie.movie_id;

This situation—a join on a one-to-many relationship where the FK and PK columns have exactly the same names—is so common that it’s known to SQL users as a natural join, and in fact, PostgreSQL offers a keyword for it so you don’t have to do the tedious typing of the equality condition. The following command does the same thing as the two above:

SELECT * FROM studio NATURAL JOIN movie;

The result of this query keeps the same column names from the two joined tables, and “name” becomes a bit ambiguous. We can use the AS keyword to rename a result column. Being more specific about what we want as the result brings us back to the first example above:

SELECT title, year, rating, studio.name AS studio
FROM studio NATURAL JOIN movie;

While we’re at it, why not pull in the director’s name, too? It’s also derived from a 1:M relationship, but the PK “person_id” and FK “director_id” aren’t the same, so we can’t use a natural join.

lab2=# SELECT title, year, rating, name AS studio, last_name AS director
lab2-# FROM studio NATURAL JOIN movie 
lab2-# JOIN person ON person_id = director_id;
      title      | year | rating |   studio    | director
-----------------+------+--------+-------------+----------
 Sahara          | 2005 | PG-13  | Paramount   | Eisner
 Interstellar    | 2014 | PG-13  | Paramount   | Nolan
 Inception       | 2010 | PG-13  | Warner Bros | Nolan
 The Incredibles | 2004 | PG     | Disney      | Bird
 Ratatouille     | 2007 | G      | Disney      | Bird

A many-to-many relationship, as we’ve seen, is effectively implemented as two one-to-many relationships between the principal tables and an associative table. To get all appearances of each actor in a movie’s cast, we use the “role” table as the associative one. Join “movie” to “role” and then to “person” (or do it the other way around).

SELECT first_name, last_name, title, character_name
FROM movie, role, person
WHERE movie.movie_id = role.movie_id
AND role.actor_id = person.person_id;

At the start of this chapter, I said that in an old-fashioned hierarchical database such as pictured in Figure 2-1, it would be very easy to query the database for a list of movies directed by Christopher Nolan, but quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. Now we see that by breaking up the data so there’s one table for each entity, and utilizing the power of joins, either of those queries can be done in a brief SQL snippet:

lab2=# SELECT title
lab2-# FROM movie, role, person
lab2-# WHERE movie.movie_id = role.movie_id
lab2-# AND role.actor_id = person.person_id
lab2-# AND last_name = 'McConaughey';
    title
--------------
 Sahara
 Interstellar
(2 rows)

lab2=# SELECT title
lab2-# FROM movie JOIN person
lab2-# ON movie.director_id = person.person_id
lab2-# WHERE last_name = 'Nolan';
    title
--------------
 Interstellar
 Inception
(2 rows)

Inner and outer joins

You’ll see a lot more SQL tricks involving joins throughout this book. One last twist I’d like to add in this chapter is the concept of inner and outer joins. Natural joins, and indeed all of the joins so far, are inner joins because they only return those rows of the original tables that participate in the relationship. To illustrate this, look what happens when you join “movie” with “script”, keeping in mind that we only created one row of “script”:

lab2=# SELECT title, year, rating, screenwriter
lab2-# FROM movie NATURAL JOIN script;
 title  | year | rating | screenwriter
--------+------+--------+--------------
 Sahara | 2005 | PG-13  | Donnelly
(1 row)

The four movies in our database that don’t have matching rows in “script” do not appear in the result. In any case where the relationship is optional, you can imagine a sort of Venn diagram: there may be several rows of Table A and several of Table B but only a few combinations where an “A” is related to a “B”. Sometimes, though, we want the complete set of rows of one of our tables. For example, we might want the list of all movies, showing the screenwriter’s name if we know it. That’s called an outer join and you can think of it as taking the entirety of one of the circles in the Venn diagram. An outer join is a LEFT OUTER JOIN if the first table (the one mentioned before “JOIN”) is the one you want to include all of. In this way, we can get the full list of movies with screenwriters’ names.

lab2=# SELECT title, year, rating, screenwriter
lab2-# FROM movie LEFT OUTER JOIN script
lab2-# ON script.movie_id = movie.movie_id;
      title      | year | rating | screenwriter
-----------------+------+--------+--------------
 Sahara          | 2005 | PG-13  | Donnelly
 Interstellar    | 2014 | PG-13  |
 Ratatouille     | 2007 | G      |
 The Incredibles | 2004 | PG     |
 Inception       | 2010 | PG-13  |
(5 rows)

There are also RIGHT OUTER JOINs and FULL OUTER JOINs. For more on joins, see the PostgreSQL documentation. Have you imagined how you could code a self-join which joins a table to itself?

Summary

The relational data model is one of several data modeling paradigms used by databases today, albeit the most popular one. At the heart of the relational model is a construct called a relation, which we typically call a “table”, although it has some strict constraints such that not any table will qualify. The particular strength of the relational model is that it enables database users to efficiently execute queries that were not anticipated at the time the database was designed. This flexibility is obtained by breaking down the data into numerous tables, one for each “entity” or meaningful noun concept in the domain, and by relating these tables to each other by means of foreign key columns. The relationships that may be defined differ in cardinality and degree. Cardinality refers to how many rows of each table participate in a relationship, and the possibilities are one-to-many (1:M), many-to-many (M:M), and one-to-one (1:1). Degree refers to how many tables are related. Binary (two table) and unary (one table) relationships are the most common ones you will encounter. A relational database model can be visualized as an entity-relationship (E-R) diagram, and you are encouraged to become familiar with at least one E-R diagram notation, such as the “crow’s foot” notation summarized in Figure 2-16.

Figure 2-16. Symbols in the crow's foot notation for E-R diagrams
Figure 2-16. Symbols in the crow’s foot notation for E-R diagrams

Summary

TBD

Definitions

TBD

New psql commands

TBD

New SQL syntax

TBD

Challenges

  1. In the lab we constrained the “sex” column of the “person” table to be an uppercase “M” or “F”. If a lowercase value were provided, we would see an error. Can you rewrite the constraint such that it would accept either case?
  2. Try to use ALTER TABLE to add an ownership relationship to the “studio” table in our lab, as diagrammed in Figure 2-12. This is tricky because you already have some data in the table, so you need to think about how to avoid a referential integrity error from those pre-existing rows that don’t have FKs.

Exercises

TBD

Think about it

TBD

Chapter 3: Querying the data

A declarative query language

You have already seen some of the Structured Query Language (SQL) which is used to express queries in Postgres (and every other relational database that I know of) and you’re going to see a lot more in this book’s chapters. You have “programmed” several queries but here’s one thing you may not know: SQL is not a programming language. A computer program written in a language like Python, Java, or C++ is imperative—it gives a computer a sequence of instructions to carry out until it finished. SQL, by contrast, is a declarative language. In SQL queries, you describe the result that you want, not how the computer should obtain it. That turned out to be a genius move by the creators of the first relational databases.

Inside a DBMS like Postgres is a special function called the query optimizer which processes a SQL query and generates an execution plan for how best to obtain the desired result. In a complex query that incorporates multiple tables, there may be several steps in the plan, some slow and some quicker. These operations may include full table scans (reading an entire table from disk; slow), index scans (much quicker), and different types of join operations (see Table 3-1). Doing them in a certain order may be faster than doing them in another order, and this can make a big difference in a database with millions or billions of rows. Because SQL is declarative, the query optimizer has the freedom to choose the most efficient sequence.

Table 3-1: Sample of primitive operations in query execution
Operation Meaning
Full table scan Read every row in the table and find the one(s) specified by the query.
Index scan (aka “seek”) Search an index to quickly find locations of the rows specified by the query. A database index is conceptually like the index in the back of a book; it makes finding the right “page” much quicker, more so when the book is longer.
Table access Go directly to the location of the specified row(s) and read the data.
Hash join A two-phase algorithm to quickly join two tables based on an equality condition.
Nested loop join A slower join algorithm that accommodates inequality conditions and other unusual joins.

Here is a key point that I’ll come back to repeatedly: you should take advantage of the work that the database developers have already done. Yes, you could write your own execution plan, or your own program for processing the data, but it would take a lot of time and you might not get it right. Database engines are designed by some of the smartest computer scientists in the world and honed by practical experience for years, and they have very likely anticipated queries like yours. Give the database the freedom to optimize, and it will generally do an excellent job.

Relational operations

Relations (remember, this is the mathematical term for what we’re calling “tables”) are sets of tuples, as discussed in Chapter 2. There are a number of mathematical operations that can be performed on them, with the interesting property of closure: the result of each relational operation is itself a relation. The clauses of a SQL query can be interpreted as a specification of relational operations to be performed on the specified tables. Interestingly, just as you might simplify a complicated equation in high school algebra before solving it, the query optimizer might use relational algebra to build its execution plan—choosing which operations to perform first in order to reduce the amount of computation it will have to do to finish the job.

The key relational operations identified by E. F. Codd and derived from set theory are the projection, selection, and Cartesian product operations, but to this database developers have added several more very useful operations, particularly extended projection, aggregation, grouping, and sorting. See Table 3-2.

Table 3-2: Important relational operations in SQL queries
Operation SQL clause Symbol Meaning
Projection SELECT \(\sigma\) Return only the specified columns
Selection WHERE \(\Pi\) Return only the rows that match specified criteria
Cartesian product CROSS JOIN \(\times\) Return every combination of a row from table 1 with a row from table 2
Natural join NATURAL JOIN \(\Join\) Return all combinations of rows in specified tables that are equal on their common column
Extended projection SELECT \(\sigma\) Generate new columns in the resulting table, such as the results of calculations or logical tests
Aliasing optional AS \(\rho\) Assign a (new) name to a column in the resulting table
Aggregation SUM, COUNT, AVG, etc. \(G_{f(x)}\) Replace original rows with a single row containing the computed result
Grouping GROUP BY \(_xG\) In combination with aggregation, split the original data into subsets to yield subtotals, subaverages, or whatever
Sorting ORDER BY n/a Re-arrange the rows in a specific order

Basic relational operations from set theory

Projection is the operation of reducing a table to a subset of its columns, and in SQL it is expressed as a list of columns following the SELECT keyword, for example:

SELECT name, age
FROM players;

Selection is the operation of reducing a table to a subset of its rows, and in SQL it is expressed as a logical test (for equality or inequality) follwing the WHERE keyword. Multiple conditions may be combined into one with the AND and OR keywords if needed. For example:

SELECT *
FROM players
WHERE team='Patriots' AND position='QB';

These are certainly the most common operations, and most queries will employ both. Consider the query

SELECT name, age
FROM players
WHERE team='Patriots' AND position='QB';

This query may be expressed in relational algebra as \(\Pi_{name,age}(\sigma_{team=Patriots \land position=QB}(players))\). This formulation implies that the selection operation should be computed first, and then the projection operation. But because it is an algebra, and because the outcome of every operation is another relation, we could just as easily flip it around, i.e.: \(\sigma_{team=Patriots \land position=QB}(\Pi_{name,age}(players))\). This kind of flexibility gives the query opimizer room to make choices that speed up the query.

The third of the “original” relational operations is the Cartesian product operation which joins every row of one table with every row of a second table. The Cartesian product is expressed in PostgreSQL as CROSS JOIN and one way it sometimes comes in handy is to generate a cross-tabulation of the rows of two tables. For example, if you want a report to yield some statistics about every football team in every year (perhaps to build a line graph?), the core of the query might be:

SELECT * FROM teams CROSS JOIN seasons;

Or in relational algebra, \(teams\times seasons\). The more common type of join, as discussed in Chapter 2, is a natural join, where each row of one table is joined with only the rows of the other table that have matching values of a specific column (i.e., a foreign key - primary key relationship). In Postgres there is actually a NATURAL JOIN keyword that works when the columns literally have the same name. If they have different names (for example, if a “players” table has a FK called “team_id” but in the “teams” table it’s simply called “id”), you can use either a JOIN clause or a WHERE condition to effect the join. These are three ways you might perform a natural join on two tables in SQL:

SELECT * FROM players NATURAL JOIN teams;
SELECT * FROM players JOIN teams ON player.team_id=team.id;
SELECT * FROM players, teams WHERE player.team_id=team.id;

In relational algebra, the natural join is expressed as \(players\Join _{team\_id=id} teams\); the subscript expressing the join condition can be omitted if the FK-PK relationship is obvious. You could perform a natural join by first taking the Cartesian product and then selecting the rows where the FK matches the PK, Ã la \(\sigma _{team\_id=id} (players \times teams)\), and in theory this is what the database engine is doing. In practice, the query optimizer will use an algorithm like a hash join to perform an equality join much more quickly.

Inequality joins are also possible. If you want to join each player with teams he is not on, in order to perform some kind of comparison, you might do the following:

SELECT * FROM players JOIN teams ON players.team_id != teams.id;

In relational algebra notation this is \(players\Join _{team\_id \neq id} teams\). Such a join is generally going to be quite expensive in computational terms because the database engine must perform a nested loop: for each row of the “players” table it must loop through the entire “teams” table to find relevant rows.

Extensions to the relational toolkit

Although relational modeling and relational algebra originate in set theory, database developers and users have made numerous pragmatic extensions to the original theory-derived set of methods we can apply. After all, a database isn’t an academic exercise, but a practical business tool.

The idea behind extended projection is that a query can give us not only a selection of columns from the original table(s), but can also produce new columns as a result of calculations or logical tests. For example:

SELECT running_yards + passing_yards FROM game_results;

A closely associated idea is that of aliasing (also known as the “rename” operation), an operation that changes the name of a column or assigns a name to a column that doesn’t have one (such as the calculated column above). In Postgres, you can use the optional AS keyword, or simply provide an alias after specifying the column:

SELECT running_yards + passing yards AS total_yards FROM game_results;
SELECT first_name || ' ' || last_name AS full_name FROM players;
SELECT age > 35 oldguy FROM players;

The last query above is an example that contains a logical test, “age > 35”, and the result will be a column called “oldguy” that contains Boolean values: “true” and “false”. The AS keyword is omitted; it is optional, but may make your queries easier to read.

Another powerful extension to relational algebra is aggregation; this allows us to generate new rows that do not come from the original tables, but instead are the result of calculations over some or all of the original rows. The aggregations you will use most frequently are SUM and COUNT, but a number of other aggregate functions in Postgres are available, particularly for statistics such as MIN, MAX, AVG and so on.

Aggregation works together with the operation of grouping, which identifies the set(s) of rows to be aggregated together. If no grouping condition is set by a GROUP BY clause in the SQL, all rows are aggregated into one result. To count the number of teams, for example, we could simply do:

SELECT COUNT(*) FROM teams;

If we want to compute aggregates for subsets of the data, we use GROUP BY to generate a group for each distinct value of a particular column, for example:

SELECT team, AVG(salary) FROM players GROUP BY team;

In relational algebra notation, the grouping and aggregation operations are denoted by a capital “G”; the grouping column in a preceding subscript and the aggregation function in the following subscript. The above example would be expressed \(_{team} G _{AVG(salary)} (players)\).

A final important operation is sorting. Although in theory the order of rows in a relation is meaningless (it’s a set), in practice we want to sort the rows into some kind of meaningful order. There is no standard notation for this operation in relational algebra, but in SQL it’s expressed in the ORDER BY clause of a query:

SELECT * FROM players ORDER BY last_name, first_name;

The relational operations listed here are the key components from which you’ll build most of your queries, and they are common to virtually all relational databases. In Chapter 5, we’ll introduce you to some additional relational patterns that are useful in special cases, and in Chapter 7 we’ll explore some of the special features of PostgreSQL that distinguish it from other relational databases.

Queries within queries

SQL queries can be much more complex than you have seen so far, when the answer to one query depends on the answer to others. In such cases, a complete query called a subquery can be nested within another. Subqueries are most often found in the WHERE, FROM, and SELECT clauses. For example, this query names all players who earn a salary greater than the average:

SELECT last_name, first_name, salary FROM players
WHERE salary >
  (SELECT AVG(salary) FROM players);

The inner query, “SELECT AVG(salary) FROM players” is evaluated first and yields a single numerical result. Then the outer query is evaluated to finish processing the query. Just like in a mathematical expression, parentheses set the subquery apart from the clauses of the surrounding query. (I also indented the subquery, but this is not necessary; PostgreSQL like most databases is indifferent to whitespace like spaces, tabs, and line breaks.)

Subqueries in the FROM clause, when evaluated, produce result sets that act like tables in the outer query. To identify all the quarterbacks in the AFC East, we could join the “players” table with the result of a subquery that selects all teams in that division:

SELECT last_name
FROM players JOIN 
  (SELECT * FROM teams WHERE conference='AFC' AND division='East') AS afceast_te\
ams
  ON players.team=afceast_teams.team_id
WHERE position='QB';

In this example you saw that the AS keyword can be used to assign a name to a table, just as previously we saw it used to assign a name to a column. When using a subquery as a table in the FROM clause, it must be given a name. The AS keyword is optional, though I think it makes the query easier to read.

Another place you will often see a subquery is the SELECT clause. Such a query will almost always yield a single numeric result, because in the SELECT clause it generates a value for a single column of query output. Here is a simple example that lists the total salary budget for each team. (Note that this could also be accomplished using a join and a GROUP BY… there are more ways than one to solve many SQL problems.)

SELECT city, team_name,
  (SELECT SUM(salary) FROM players WHERE players.team=teams.team_id) AS total_pa\
yroll
FROM teams;

This is a case of a correlated subquery, meaning that this subquery depends on some value from the outer query (namely teams.team_id). Therefore, the subquery will be executed numerous times, once for each row of the “teams” table. The queries in the previous two examples are noncorrelated (or uncorrelated) subqueries which need only be executed once. Correlated subqueries are potentially much slower, so it could be preferable to solve this problem with a join instead of a subquery, but that’s not possible in every case. My philosophy is to trust the query optimizer (and its developers) to find the fastest way to execute the query, and not over-think the SQL: if it ain’t broke, don’t fix it.

Finally, subqueries can be nested within subqueries. Indeed, there may be several levels of such nesting, making for some pretty complicated queries. As you solved complicated equations in high school by first “simplifying the expression”, the query optimizer may have many ways to simplify and sequence things behind the scenes so that the query’s result can be obtained as efficiently as possible.

Lab

TODO:

This time psql directly to the lab3 database.

re-introduce term DML. you have already seen INSERT and SELECT.

Sports database of matches and wins.

Find average number of points scored per team per game.

Find highest-scoring player on every team.

Do an inequality join to identify coach based on start date (arbitrary).

Show teams’ ranks with window function.

Join with generated calendar to show wins/losses per month.

Include some compound logical tests with AND and OR.

Show how to explain a query, impose your own dumb query plan, and to time a query.

Normalizing a data model

coming soon

More relational patterns

coming soon

Logic in the database

coming soon

Advanced PostgreSQL tricks

coming soon

Just enough physical database design

coming soon

Dimensional modeling for analytics

coming soon

Databases in the cloud

coming soon

Appendix A: Installing PostgreSQL and accessing it from the command line

In this chapter, I’ll walk you through installing the latest version of PostgreSQL on Windows 10, Mac OS X, and select Linux distributions. As stated in Chapter 1, the goal is to get to the point that you can enter the command psql at your system’s command line and connect to a PostgreSQL server running in your local environment.

Why type commands the command line? Although there are point-and-click graphical interfaces available for PostgreSQL, it is difficult for me or any teacher to give clear instructions about how to point and where to click. These interfaces may also be different for users on different platforms. The command line interface (CLI) to PostgreSQL is the same for all platforms and allows me to give precise instructions. It will also ensure that you get lots of practice writing SQL. As a bonus feature, non-technical people who see you doing this will think you’re a super-hacker.

Setting up PostgreSQL in Windows 10

Obtain and install PostgreSQL

First, visit www.postgresql.org for the latest information about PostgreSQL. Take note of the latest stable version number, which at the time of this writing appears under the heading “LATEST RELEASES”. There will probably also be an announcement about an unfinished future version, in alpha or beta test stage, but I recommend you avoid that one. From the homepage, follow the “Download” link(s) until you find a Windows installer for the latest version. At the time of this writing, the first option listed on the website is an interactive installer provided by EnterpriseDB, a company that promotes PostgreSQL.

Unless your computer is very old, it should be a 64-bit system, so you’ll want to choose the “Win x86-64” installer. Again, make sure you’re not downloading the unstable alpha or beta, but are getting the newest mature release. The download may take some time if you have a slow connection. When you have downloaded the installer, run it.

Figure A-1: The PostgreSQL installer
Figure A-1: The PostgreSQL installer

The installer has several screens that ask you how to configure PostgreSQL, but for the most part you can accept the defaults. On about the fourth screen, you’ll be asked to create a password for the database superuser (known as “postgres”). This is the user account that can create and delete databases, and other users. Since you’re just practicing on (presumably) your personal computer, it’s best to pick something easy to remember and that you don’t mind other people seeing. By all means, though, don’t forget what password you entered.

Figure A-2: Setting a superuser password
Figure A-2: Setting a superuser password

You should be able to click “next” through all of the remaining screens, and watch the installer do its job. After installing the software, the Setup Wizard will give you the option to launch something called “Stack Builder”, to augment PostgreSQL with some optional add-ons. You can uncheck the box and skip that step. If you do find yourself in Stack Builder, you can get out of it by clicking “Cancel” if you wish.

Figure A-3: Installation finished! You can skip Stack Builder.
Figure A-3: Installation finished! You can skip Stack Builder.

Access psql from the Windows command prompt

To get to the Windows command line, search for the program called “cmd.exe”. In Windows 10 you can do this by clicking the “Windows” icon to launch the main menu, and then typing “cmd”. This should work for other recent versions of Windows, too. I recommend pinning the icon to your taskbar, so you can access it quickly in the future.

Figure A-4: The Windows command prompt, cmd.exe
Figure A-4: The Windows command prompt, cmd.exe

Upon opening this application, you’ll be presented with a text-only window into which you can enter commands one at a time. Type help and hit Enter for a list of basic commands you can use to navigate this interface. Most likely you’ll find cd (change directory) and dir (list the files in a directory) useful for navigating folders and getting to your work.

Test your PostgreSQL installation by typing psql -V. If the output of this command is the PostgreSQL version number, try then entering psql to log in to the database server. If psql opens without an error message, you’re done with installation, and may go back to Chapter 1 to begin the lab.

More likely, though, you’ll get an error message saying that psql isn’t recognized as command or program. Failing that, you’ll get an error message saying that your password is wrong. In either case, this means there’s one more step to do: we need to set two important environment variables.

Setting environment variables in Windows 10

Environment variables are data that are accessible to all programs running on your system, and they’re usually used for things that multiple programs need to know about (such as the location of your database) but which may be different from one computer to the next. By using an environment variable to store this information, you don’t have to include it in code. Therefore a program could be developed on one computer and deployed to another that has things set up differently, and the original code wouldn’t need to be altered. In the code, instead of saying “connect to the database located in directory C:/X/Y/Z”, you’d say “connect to the database at environment variable DATABASE_LOCATION”.

The first environment variable we need to get right is the PATH. When you type the name of a program, like PostgreSQL’s command line interface psql, the operating system looks in a number of directories to find a program of that name. If it doesn’t find one, you’ll see an error message saying “psql is not recognized as an internal or external command”. The PATH is the environment variable that stores the list of directories to search in. To configure it correctly, first find out where psql lives on your system, and then add that directory to the PATH.

Figure A-5: This error means my PATH is not configured.
Figure A-5: This error means my PATH is not configured.

Finding psql is the easy part. It’s most likely in the directory C:\Program Files\PostgreSQL\9.6\bin (change “9.6” if necessary, depending on the version you installed). If not, use your system’s search tools or poke around until you find it. By the way, the true name of this file is “psql.exe”, with the “.exe” suffix signifying to Windows that it is an executable program. Depending on your folder settings, you may or may not see the suffix.

To alter the PATH, navigate to the “System” control panel. Windows 10 makes this a little more complicated than in earlier versions of Windows. You can either do a search for “Control Panel” and then click to “System”, or you can go to the Settings panel, click “System”, then scroll down and click “About”, then scroll down and click “System Info”. Either way you’ll find yourself on a page that displays basic information about your computer, its processor, memory, and operating system version among other details. On the left, click “Advanced system settings”. A new window will appear. Near the bottom of that window, click the button for “Environment Variables…”.

Figure A-6: Windows environment variables settings
Figure A-6: Windows environment variables settings

In the environment variable control panel (see Figure A-6), you’ll see a list of variables specific to your user account, and a list of variables that apply to the whole system (all possible users). You can alter the PATH (or “Path”, “path”, “pAtH”, etc.) in either of these lists, or create it if it doesn’t already exist. If one already exists, select it and click “Edit…”. Add a new listing to the end of the PATH for the directory where psql lives. Do not delete existing entries! They are probably important for other programs you use.

Figure A-7: Adding a new entry to the PATH; in previous versions of Windows this is a long, messy text box but it serves the same purpose.
Figure A-7: Adding a new entry to the PATH; in previous versions of Windows this is a long, messy text box but it serves the same purpose.

Click ‘OK’. You may now open a new command prompt and try psql again, and you’ll most likely get the confusing error illustrated in figure A-8. The program will ask you for the password–the same password you dutifully wrote down during installation (see above)–and then will tell you that “authentication failed for user YourName”. What this means is not that you got the password wrong, but that Postgres got the username wrong. It assumed your database username was the same as your Windows username. Glancing back up at figure A-2, you’ll see that the database user is simply named “postgres”.

Figure A-8: This error means that PostgreSQL guessed the username wrong.
Figure A-8: This error means that PostgreSQL guessed the username wrong.

There’s a workaround for this problem: you can append “-U postgres” to all database commands to specify the username, e.g.:

psql -U postgres

However, that’s a bit tedious, and there’s a better solution. When psql runs, it checks for a username stored in an environment variable called PGUSER before guessing that you want to use your Windows username.

To set this environment variable, click “New…” and enter the name PGUSER with the value postgres. Then click “OK”.

Figure A-9: Setting the PGUSER environment variable.
Figure A-9: Setting the PGUSER environment variable.

After configuring the PATH and PGUSER environment variables, you must click “OK” to close the Environment Variables control panel. Then open a new command prompt and try psql. If you successfully log in to the database server (as in Figure A-10), you’re up and running!

Figure A-10: Up and running with Postgres on Windows 10
Figure A-10: Up and running with Postgres on Windows 10

Setting up PostgreSQL on a Mac

Obtain and install PostgreSQL

The home of PostgreSQL is www.postgresql.org and, although it’s not necessary to go there to obtain Postgres for the Mac,I recommend that you make a visit to that website as your first step anyway. There you can learn about the latest version, locate documentation and other resources, and find the latest information about installation, if this book is out of date by the time you read it. Take note of the latest stable version number, which at the time of this writing appears under the heading “LATEST RELEASES”. There will probably also be an announcement about an unfinished future version, in alpha or beta test stage, but I recommend you avoid that one.

For Mac users, there are actually quite a few ways to obtain Postgres, including an installer program provided by EnterpriseDB, an “app”, and the Homebrew package manager. I’ve found the Homebrew installation quite painless, and it looks like Homebrew may have other handy uses (such as starting and stopping the database server), so that’s the approach I recommend for you.

Figure A-11: Homewbrew's website http://brew.sh/
Figure A-11: Homewbrew’s website http://brew.sh/

If you don’t have Homebrew yet, visit their website at brew.sh for instructions. The easiest way to install Homebrew is to open up a Terminal window and paste the code provided on the web page. This downloads an installation script and runs it on your computer.

Installing Postgres with Homebrew couldn’t be easier. Assuming you have installed Homebrew, go to a command line and type

brew install postgresql
Figure A-12: Installing PosgreSQL with Homebrew
Figure A-12: Installing PosgreSQL with Homebrew

Upon your hitting ENTER, Homebrew should proceed to download Postgres, install it, and configure it for use. You will see a number of messages scrolling by (see for example, Figure A-13) and a new prompt will appear when installation is complete.

Figure A-13: A successful installation using Homebrew
Figure A-13: A successful installation using Homebrew

Starting the server and creating a database

You can test that you installed the software by entering psql -V at the command line. The system should respond by telling you the version of Postgres that you installed. (Tip: if this fails, try closing the Terminal window and opening a new one, then re-enter the command.)

Figure A-14: `psql` works, so the software was installed right.
Figure A-14: psql works, so the software was installed right.

On the Mac, though, there are a couple of additional things to do. Remember that psql is just a utility for talking to the database; for it to be useful we must first start the database. As you might have seen in the messages output by the installation, we can start the server with a Homebrew command like this:

brew services start postgresql

The PostgreSQL server is now running in the background on your computer, and it will keep running even if you restart the computer later. To stop this background server, you can enter the “stop” command at any time:

brew services stop postgresql

There’s one last thing to do now. By default, if you simply enter psql at the command line, Postgres tries to open a database with the same name as your Mac OS username. Since that database doesn’t exist yet, it will fail. The simple solution is to use createdb <username> (see Chapter 1) to create the database it’s looking for. Figure A-15 illustrates the error messages you may encounter, and the steps needed to successfully start the server and create the default database.

Figure A-15: Starting the server and creating the default database
Figure A-15: Starting the server and creating the default database

Now your Mac should be ready to play with all the examples in this book.

Setting up PostgreSQL in Linux

(to do)

About the Author

Joseph W. Clark, Ph.D. has researched and taught information systems and data analytics topics since 2006, most recently at the University of Maine. He was one of the first generation of Web developers in the 1990s, and has been fascinated with databases and data modeling since he first learned how relational databases could power dynamic websites, around 1999. His academic interests lately have been at the intersection of data analytics and entrepreneurship, and new types of workflows such as Agile, Lean, and Design Thinking. His most ambitious project yet is raising four children with his beautiful wife, Xiaofang.