Table of Contents
- About this book
- Using Docker to install and run PostgreSQL
- Chapter 1: Introducing databases and PostgreSQL
- Chapter 2: The Relational Model
- Chapter 3: Querying the data
- Normalizing a data model
- More relational patterns
- Logic in the database
- Advanced PostgreSQL tricks
- Just enough physical database design
- Dimensional modeling for analytics
- Databases in the cloud
- Appendix A: Installing PostgreSQL and accessing it from the command line
- About the Author
About this book
This book is intended as a first introduction to relational databases or for those who have a little experience to review the fundamentals of data modeling and SQL. It may be suitable as a textbook for a university course. In particular, it may be a good supplement for one of those overpriced, theory-heavy textbooks that universities like to assign. Due to its low price and its focus on practical data modeling patterns, students may use this book for additional practice and self-study.
Also, as far as I know, it’s the only such textbook to use PostgreSQL as the database of choice for examples. PostgreSQL or “Postgres” is an increasingly popular, free and open source database that runs on all the major platforms (Windows, Mac OS, Linux), making it a great choice for the classroom or independent learner.
An open-source book
As of February 2020, I am re-starting this book project as a kind-of, sort-of, open source project. The manuscript, along with its graphics and sample code, can be viewed in its Github repository. Although I retain my copyright to the work, I welcome contributions, typo corrections, and improvements to the sample code. I’ll add more details about how you can contribute to the book as the project develops.
Companion videos
I am producing video lessons as I write the chapters of this book. You can view them at your own pace on YouTube – the playlist is at https://www.youtube.com/playlist?list=PL9hjQ6vJTCde9_qagxk82jRVILkzJRy6Y.
Structure of chapters
Ten chapters are currently planned (however, as the book is unfinished, the plan may change). Here is the plan:
- Introducing databases and PostgreSQL. We start by putting databases into their proper context as one of the fundamental business information technologies, and by getting you oriented to PostgreSQL. Examples will have you creating and querying single-table databases (which, frankly, may be all you need for some simple business applications!).
- The relational model. We’ll explain the theory behind the design of relational databases and discuss their advantages, then introduce the building blocks of multi-table databases. You’ll learn how tables are connected by keys, work through several examples of entity-relationship diagrams, and then query these databases with all kinds of SQL JOINs.
- Querying the data. This chapter introduces the basic operations of relational algebra, and how they can be assembled together to create SQL queries ranging from simple to complex. New skills will include subqueries, grouping, and functions for working with text and dates.
- Normalizing a data model. Now that you’ve got the fundamentals down, we’ll take a look at the kinds of errors and inefficiences that can creep into production databases, and learn a set of tricks data modelers call “normalization” that can prevent them.
- More relational patterns. Here we’ll consider some common situations that real-world databases must model, and reveal some frequently-used database design patterns that fit each situation. This chapter may be seen as a cookbook of best practices from the field.
- Logic in the database. Beyond one-off queries, we can enhance our databases with re-usable programs such as views, triggers, and stored procedures. Chapter 6 will discuss several reasons why that might be a good idea, and will teach you how to do it.
- Advanced PostgreSQL tricks. Every database has some special features that distinguish it from others in the same category. In this chapter we’ll take Postgres through its paces, examining signature features such as array variables, JSON data types, and recursive queries.
- Just enough physical design. To get the best performance out of your database, it helps to know something about the physical plan of how the data is written to disk. We’ll learn how to profile database performance and improve it with the right data types and indexes. Database security and some other administrative issues will also be discussed.
- Dimensional modeling for analytics. Analyzing large amounts of data is a very different problem from efficiently carrying out transactions, and has motivated the design of a different way to model data–the star schema. We’ll see why this is a valid case for breaking all the rules of normalization that we learned earlier, and implement a basic star schema with an ETL routine in Postgres.
- Database-driven web applications with Python. Databases don’t exist in a vacuum; they’re usually part of an application. This chapter shows you the whole picture of how a web application might be programmed in Python to link to a Postgres backend and make its data available dynamically to end users. Then we’ll upload it to the cloud. At the end of it all, you’ll have a plan that you can re-use to go out and make your own apps.
- Appendix A: Installation. Setting up PostgreSQL is the first hurdle to jump, unless you have a computer lab administrator who has done it all for you. In this appendix I’ve collected a few sets of instructions for installing PostgreSQL and setting it up for this course.
Case studies
Along the way, I’ll take the reader through a number of case studies. These will be relatively complete databases for realistic businesses–an e-commerce app, a social network, a manufacturing (MRP) system, and a learning management system–that we’ll build up in chunks, a little bit each chapter. How you’ll experience each case study will vary.
Two cases, an auction e-commerce app called AUCTagon and a social networking app called FrienduLATER, will be built up by the author within the book’s chapters. These are demonstration cases that will allow us to give concrete examples of each chapter’s concepts, and the code to reproduce them will be available on the book’s Github repo.
Two other cases will be described and offered as homework.
LESSONbinder will be a learning management system (LMS) of the type that most colleges use today. It will be up to you, the reader, to implement its database. My “answers” to this homework case study will be available only in the instructor manual for faculty using this course to teach. Instructors, go ahead and e-mail me to request a copy.
PalletTABLE will be a more challenging exercise, a database to support a manufacturing planning system (a type of system commonly called manufacturing resource planning or MRP). My implementation of a solution to this exercise will be available in a separate online resource that solo learners can request, separate so you’ll be encouraged to work through the problem yourself before looking “in the back of the book”.
How to use the book
Chapters 1 through 4 cover the core knowledge areas that anyone working with databases should know, and they give lots of examples of SQL queries. A typical college course on databases will thoroughly drill these concepts, so if you’re a student in one of those classes, be sure you know them well. At the end of each chapter you’ll find a glossary of new terms and a review of new SQL syntax that was covered, as well as problems and challenges for review. Answers to some of the questions will be available in a separate online resource.
Chapter 5 and 6 build on the fundamentals to provide you with a repertoire of smart solutions to common problems. Turn to these if you think you know all about database design and SQL but are still struggling to figure out how to use a database in the real world. Chapter 7 adds to this some of the special things that Postgres can do differently than other relational databases such as MySQL and SQL Server.
Chapter 8 and 9 are probably not covered by your college course (unless they’re using Relating To The Database as the text!) but both are intended to enhance your understanding of databases by considering certain special and different perspectives. It might be safe to skip over them or come back to them later when you’re ready to broaden your mind.
Chapter 10 is meant to put it all together into a working dynamic website. If you think you’ve got enough database skill but are stumped as to how to assemble a complete tech stack for your big idea, chapter 10’s examples using Python (one of the easiest and most popular languages for new developers) will give you a starting point. Feel free to jump to this chapter at any point, once you’ve got down the basics covered in the first few chapters.
Instructors
Instructors, I will be developing resources for you including instructional videos, slide decks, test banks, and answer keys. When these are available, you’ll be able to contact me to request them. My goal is to make it easy for you to put together a basic course in databases with PostgreSQL with a minimum of difficulty for you. You can use as much or as little of the material as you like.
Using Docker to install and run PostgreSQL
Most business databases are meant to run on servers maintained by system administrators, which means that installing and setting them up can be complicated and can require a fair amount of tweaking. (Desktop databases like Microsoft Access and SQLite are exceptions.) After experimenting with several ways of installing Postgres on Windows, Mac, and Linux machines, and getting frustrated with issues like version updates, I’ve come to the conclusion that it’s better to let someone else do the work for you. That suggests two options: spin up a database “in the cloud”, or run it in a “container”. If your professor takes the first approach, he’ll have databases set up before the class begins, and he’ll give you the code to connect to your database. The second approach requires Docker, a neat tool that lets developers “containerize” applications with all the configuration they need, and your computer to download and run those containers without any fuss.
Using Docker to spin up a database on your computer is easy. Go to https://docker.com and download Docker Desktop for Mac or Windows. Once it’s up and running, open up a command line (i.e. “Powershell” in Windows or “Terminal” on the Mac), and type the following:
As you watch, Docker downloads the image or pattern for a Postgres container and then launches it. You can list running containers by entering docker ps
at the prompt. You will see that Docker has given the container a randomly generated Container ID and Name. In my case the container ID is “1c9a696e4023” and the name is “lucid_yalow”. You can log in to your running container by using the name, the ID, or a portion of the ID. Here’s the command you need to get to the psql
prompt:
Obviously, substitute your container name where I have “lucid_yalow”. (I also could have used part of my container ID, such as “1c9”.) Now you’re ready to execute Postgres commands or enter SQL. When you’re done, enter \q
to exit psql
.
You can leave the container running in the background, and it will persist even after the system restarts, but if you find it slows down your computer you can stop it like so:
And you can re-start it later with:
The container should keep the tables, data, etc., even after it is stopped and started, but if you ever want a clean, new database with nothing in it, just repeat the command beginning with docker container run
above to start a new container from the same image. There’s a lot more that you can do with Docker containers and databases, but this should be enough to get you started and avoid having to struggle with configuration.
If you don’t want to use Docker but would prefer to install the PostgreSQL server using a more conventional installer, see the appendix “Installation” at the end of this e-book for instructions for Mac, Windows, and Linux, or if my instructions are out of date, search for help online.
Chapter 1: Introducing databases and PostgreSQL
How databases fit in
Imagine that you work in a small direct-response mail order company that takes orders from customers by phone. Each agent in the call center downstairs has a stack of paper order forms on his desk, and when he receives an order he writes down the product name(s), quantity ordered, and the customer’s address and payment information. He uses a calculator or computer to sum up the order total, and tells the customer how much they’ll be charged.
Periodically, a data entry worker visits the call center and picks up stacks of filled-in order forms. She enters each order’s details into a file on her desktop computer, perhaps a big Excel spreadsheet that she’s designed herself for this task. At the end of the day, when all the orders are entered, she sends the complete file to two other departments: fulfillment, which processes the customer’s credit card and packs and ships the orders, and accounting, which calculates each salesperson’s commission.
This is the kind of process that a small business might develop when it’s first getting started, and in fact, it’s exactly the process that I encountered when I was first learning about databases at a small company in Maine. Unfortunately, simple processes like this tend to get complicated as the company gets bigger, and can become impossible to maintain. Just a few of the challenges this company might face are:
- When the business grows to the point that multiple data entry workers are needed, they must coordinate their work somehow. Perhaps each worker creates her own file, and they must combine them at the end of the day. There are many opportunities for errors to enter the system.
- If a payment is declined, or if a customer returns an order, one of the old spreadsheets must be updated, but which one? The data is kept in the order it was entered, not alphabetized by customer, and if there are now multiple spreadsheets for each day of business, searching for the old order is a big challenge.
- As a result of the update made by fulfillment or customer service, multiple versions of the data exist. Accounting still has the old data, and they need to be notified of the change so they can correct the salesperson’s commission. Even if they’re sent the modified file, how do they know which row(s) are changed? And as more changes are made over time, who is responsible for keeping the official record?
- As the business grows, the order entry data may change. For example, product numbers or names may change, discounts or coupon codes may be introduced, or some kind of membership number may be offered to frequent customers. As the paper order forms change, the spreadsheets must also be modified. Consequently, today’s data files look different from last month’s and even more different from last year’s. As people change jobs and leave the company, can their replacements even read the older data?
As this small business becomes a medium-sized enterprise, the seeming convenience of using a spreadsheet can become a nightmare of data management. Before long, the data entry workers, accountants, and other departments may find they spend more time managing spreadsheets than doing their main jobs. And sooner or later, management may wish to use the historical order data for a new purpose, such as customer relationship management (CRM) or business intelligence (BI). What they’ll find is that the data is spread out over a huge number of files, with multiple versions of each file existing in different places, and older files having a different structure and meaning from newer files. What a mess!
A database
Imagine instead if there was a black box in the office into which all those order forms were fed. At any time, a person could ask the black box to retrieve any order record (or list of records) by time and date, by customer name, by product, or another attribute. Fulfillment could ask for the payments waiting to be processed, and it would get a printout of exactly that. Shipping could request invoices and mailing labels for orders ready to ship. Accounting could ask for the sum of order totals taken by each salesperson for a given time period. Moreover, changes could be filed in the black box and all subsequent requests would include the up-to-date, corrected information. The black box serves as the manager of, and the official system of record for, all the company’s order data.
That’s the big idea of a database. Instead of having every person or department or program keep its own copy of the data, a database serves as a system of record, a “single source of truth” that can always be accessed by everyone who needs it for their different purposes. A database stores some knowledge about the data’s structure and meaning, or metadata, so diverse users can know what they’re looking at. And most importantly, a database offers flexible but easy-to-use query methods so that users can request just the data they want, whether it’s a single record, a collection of data, or an aggregation into averages, counts, or sums.
A variety of databases
Databases come in many shapes and sizes and they serve a lot of different purposes. Most often, a database acts as a server, that is, a software program that is always on, waiting for requests and responding to them. It’s necessary for databases to always be on and available if people and other software systems are going to depend on them to store data. (Otherwise, those people and programs will have to store their own data locally, which defeats the purpose of a database – independent and shared data management.)
A database administrator (DBA) is therefore responsible for a key component of a business’s IT infrastructure. If the database goes down, a lot of other programs go down, so the DBA ought to learn best practices for managing security, keeping backups and recovering from disasters. Maintaining and upgrading a database has been compared to maintaining a sailing ship when it is out to sea. The DBA can’t just take the database out of service to work on it however he wants; the business must stay afloat.
Server-based databases vary widely in scale and scope. Some databases support a single application, such as a dynamic website, and these may run on the same physical computer as the application’s code. A larger database might run on a dedicated machine that multiple users access over the network; for example, a company may keep a database of customer relationship information which can be accessed by sales, marketing, and customer service systems.
Larger still are enterprise-scale databases that integrate a wide variety of subject areas. This category includes enterprise resource planning (ERP) systems that integrate several areas of business operations, and enterprise data warehouses (EDW) used for analysis and reporting of business performance. Enterprise-scale databases may run on mainframes or may be distributed over large clusters of dozens or hundreds of computers.
What you’ll learn from this book
This book will introduce you to relational databases, with data modeling and the SQL language first and foremost. It covers the scope of a typical first database course in an information systems or analytics program at the university level. It can be used as a textbook for an instructor-led course—instructors, please contact the author for an instructor’s guide, slides and materials—or used for self-guided study with or without the video lectures produced by the author (coming soon on Youtube and Leanpub).
My main goal in creating this book is to make data modeling and SQL understandable to the reader, so it may serve as a good low-cost supplement for students who are struggling with a theory-heavy textbook and having a hard time getting the point. Instead of starting with loads of theory up-front, I’ll take a more pragmatic approach based on relational data modeling “patterns” and examples.
This book will also provide a lot of practice using SQL, the structured query language common to all relational databases, because the most frequent feedback I’ve heard from my undergraduate students (and from companies that hire them) is that they need more practice with SQL.
The database of choice for this book is PostgreSQL (often nicknamed “Postgres”), an open-source database that has become quite popular with developers in recent years. Compared to some other popular databases like SQL Server and MySQL, there aren’t many good books about Postgres, so I hope this book will be valuable if only for its examples. Postgres makes a good choice for teaching because it is free software (both “free as in free speech” and “free as in free beer”), and because it runs on all the major platforms—Windows, Mac OS, and Linux—so you can follow this book no matter what kind of computer you have handy.
Rest assured that the lessons of this book are transferable to other relational databases. Each of the major brands has its own quirks and special features, but this book mainly covers the fundamentals that apply everywhere. As currently planned, one chapter (Chapter 7) will exhibit some of PostgreSQL’s distinctive features and capabilities.
Lab 1: your first PostgreSQL database
Up and running with Postgres
PostgreSQL is available for free at www.postgresql.org and is extremely well documented there. Installation instructions will vary depending on your platform, and should be pretty straightforward. You can probably accept all the default configuration options. Be sure to remember the password you set during installation. You’re up and running when you can enter the command psql
at your system’s command line to log in to the local PostgreSQL server. You will recognize this by a readout of the psql
version number, a statement about how to get help, and a change to the command prompt. At the time of this writing, it looks like this for me (on my Mac):
To quit psql
, enter \q
at the prompt. If the above doesn’t make any sense to you, don’t worry. Appendix A to this book includes my own tested instructions for installing Postgres on Windows, Mac, and Linux. You can also refer to the online documentation. Come back to this page when you’re up and running.
Relations are tables
Databases can be classified according to the types of abstractions they allow you to model your data with. In a relational database, data is modeled as a set of tables with structured rows and columns. Other data models are possible. In a document-oriented database such as MongoDB, data is modeled as documents with tree-like structures. In a graph database like Neo4J, data is structured as a network (a mathematical graph) with nodes and edges. Compared to those newer forms, the relational model is far more commonly seen and better understood, and is the most versatile. Relational databases have been tried and tested in business for nearly four decades, and are probably the best tool for the job in all but a few specialized cases.
The tables you find in a relational database are properly called relations. A relation is not just any table; it is a construct derived from set theory and adheres to the following constraints:
- Every row has the same columns.
- Column names must all be different.
- Each column is defined to contain just one specific type of data.
- Each row must be unique; usually we enforce this by adding a machine-generated ID number to each row, known as the primary key column.
- No inherent ordering of rows or columns, or other information about how to display the data, is stored in the table.
Consequently, a relation is a simpler and less flexible structure than a table you might create in a spreadsheet program like Microsoft Excel. Spreadsheets allow you to mix data types, to have rows with different numbers of columns, and to decorate your data with display logic like fonts, colors, and sizing. Figures 1-1 and 1-2 illustrate the comparison with an example of sales data that might be recorded by a small outdoor sports mail-order business.
In a spreadsheet the user can be lax in data entry, for example omitting the state “TN” when we all know where Nashville is, or entering a quotation mark (meaning “ditto”) instead of spelling something out. Data types unanticipated at the time the table was designed could be inserted freely; for example, a three-letter Canadian province abbreviation could be inserted into a column meant for two-letter US states. Although these are convenient for data entry, they may lead to problems for computer systems that want to use the data (for example, to print mailing labels). The spreadsheet user can also decorate the data with fonts, styles, sizes and colors in order to make it more readable, and he can add extra information like a “grand total” row.
As seen in Figure 1-2, a database table (or relation) is much more strictly defined. Data types must be specified in advance for each column, guaranteeing uniformity. That means special cases must be anticipated before they occur. In this example, the database designer specified that state abbreviations must be exactly two characters, and that the price may be numeric
(allowing fractions) rather than integer
. In order to guarantee that each row is unique, and therefore can be looked up, the database user has added a primary key column and populated it with an auto-generated ID number.
No other information is found in the rows of a relation except the data itself: not fonts and styles, and not even the sort order. Totals, averages, and the like wouldn’t be stored in the table either, because rows correspond only to individual data “records”. Aggregated values like totals and averages could be calculated in a query or perhaps stored in additional tables created specifically for the purpose.
Creating a PostgreSQL database
Let’s fire up PostgreSQL and create a table. (I probably won’t use the word “relation” much after this, except for a bit of theory in Chapter 2. Where I write “table”, you should be able to figure out what I mean.)
First, a note about the term “database”. As I have described it above, a database is a system that organizes and stores data and, importantly, makes it available to people who need to search or retrieve it. Others more precise than I will distinguish between the database, which is the organized data store, and the database management system (DBMS) which is a program like PostgreSQL that creates the data store and grants us access to it. When we call Postgres (or Oracle, or SQL Server, etc.) a database, we are using the term more generally to include both the data store and the DBMS, since they go together.
To understand how we interact with PostgreSQL, though, you need a third definition of the term. In PostgreSQL, a database is a logical subdivision of the data store. You may create any number of tables grouped into databases on the same server. (For the purposes of this book’s labs, your personal computer is acting as a PostgreSQL server.) Table names must be unique within a database, but not within a server. If several examples in this book include a table called “customers”, you can avoid a conflict by creating a new database for each lab.
What we’ll do in this first lab, then, is:
- Create a PostgreSQL database called “auctagon”
- Log in to that database with
psql
- Create a table of Purchases
- Query the single-table database with SQL
Database creation
You can create a database from your operating system’s command line (i.e., before logging in to PostgreSQL with psql
or another front-end tool), by using the command createdb
. The basic structure of this command is createdb [OPTIONS] [DBNAME]
, and you can learn more by typing createdb --help
at the command prompt. Thus, the following command creates a database called “auctagon”:
If necessary, you can also drop (i.e., delete) the new database from the commmand line, with:
For future reference, you can also create and delete databases using SQL once you’re logged in to psql
: the CREATE DATABASE and DROP DATABASE commands, respectively. One way or another, create that database, which will be home to your first table.
Introducing psql
The command-line client for PostgreSQL is psql
, and you launch it by entering psql
on the command line. This will not open a new window, but rather you will see a brief welcome and the command prompt will be different from the operating system’s default prompt.
The change in the command prompt means you’re in a different environment. Here, you can enter SQL queries or some commands specific to psql
. The first thing I’d recommend you do is type help
, which introduces you to a few of the latter. Most psql
commands begin with a backslash (\) and you can get a full listing of them by entering the command \?
. If you need to quit, \q
is the command for that. If you want to, take some time now to explore the lists of SQL queries and psql
commands possible.
By default, when you start psql
you’re connecting to the database with the same name as your PostgreSQL username. If you installed Postgres using Docker, or used an installer on Windows, that will probably be “postgres”. If you installed it on a Mac or Linux machine, the default database may have the same name as your computer login name. Any SQL commands you enter at the prompt will be executed on the default database’s tables, and that isn’t what we want. To switch over to the new database you created, use the \c
command:
Notice that the prompt changes to tell you which database you’re working in.
Creating a table
To create a table in the “auctagon” database, we use the aptly-named SQL CREATE TABLE
command. I will expand on its usage in Chapter 2, but the basic form of it is as follows:
As I mentioned, one of the special characteristics of a relation is that each column allows data only of a specified type. PostgreSQL offers a number of built-in data types, such as numeric
, text
, date
, and more. I will discuss the choice of data type more in Chapters 2 and 8, but it need not delay an introductory example. There are many optional clauses available in the CREATE TABLE
statement which can be discussed later or looked up in the documentation; the only one we need now is PRIMARY KEY
, a flag which indicates that a particular column is going to contain unique values that may be used to look up specific rows later.
The command to create our table of Listings is as follows. You may type this in at the psql
prompt, even if it spans several lines. The code won’t execute until the semicolon (;
) is reached. Mind the cases: in PostgreSQL the SQL keywords (i.e. CREATE TABLE
, PRIMARY KEY
, and the data types) may be uppercase or lowercase, but you should only use lower case letters and underscores (_
) for the table and column names. PostgreSQL isn’t very sensitive to whitespace, so you can enter this code all on one line, or spread out over several lines, with indentation and tabs if you want them.
If the command succeeded, you’ll see “CREATE TABLE
” in the output. If there’s an error message instead, don’t worry, just try again. The most likely causes of errors are typos in the data types, the wrong number of commas, and uppercase letters in the table or column names. If the command worked but you defined the table incorrectly, the easiest solution is to start over by issuing the command DROP TABLE purchases;
and creating the table anew.
You can confirm that the table exists with the psql
command \dt
, which displays a table of all the tables in the currently selected database:
That’s all there is to defining a table, at least an empty one. In order for us to demonstrate some SQL queries, though, we’ll need to store some data in the table with the SQL INSERT
command. We’ll use the simplest form of this command, adding only one row at a time to the table, for example:
Take note that text data must be wrapped in quotation marks ('
), and numbers must not.
Writing INSERT
commands by hand will quickly become tiresome, and is not the usual mode of entering data into a real database. Typically the database will support software (such as a web app, or an enterprise system) that generates data insertion and update commands automatically. Another way we might load a lot of data quickly is to read in a file containing (presumably machine-generated) INSERT
commands. In psql
you can execute SQL commands from a file using the \i
command.
I have provided a script file containing 300 lines of purchase data on the GitHub repository that supports this book. You may find the file AUCTagon1.sql
at https://github.com/joeclark-phd/databasebook-postgres, in the “psql_scripts” folder. If I have downloaded this file to a Windows laptop and saved it in the directory C:/psql_scripts
, the command looks like this:
Be sure to use the correct file path for your operating system and the location where you downloaded the script.
Regardless of how you insert data into the table, please add at least several records so that you can try some meaningful queries in the next section. If you are experiencing errors with INSERT
commands, check that the number of values you’re inserting matches the number of columns in the table, that they’re in the right order, and that the order_id
number is unique for each inserted row. If you run into problems, you can empty the table by entering the command DELETE FROM purchases;
and start again.
Querying your data with SQL
SQL is the structured query language more or less common to all relational databases, and it really shines for its ability to extract just the data you want from a table or group of tables. What kinds of queries might you want to make of this data? You might want specific subsets of the data, such as all the orders for a particular product or in a particular state. Or you might want to aggregate the data, that is, sum or count or average them, perhaps in groups. Even with one simple table, there are quite a few ways to query it.
Let’s start with the basics. Your first query is the simplest: it just requests all the data.
That’s quite a lot of rows, so I’ll give you a trick to shorten the results. Affix “LIMIT <number>” to the end of the query to get only the first several rows:
The meaning of the “*
” is “all columns”. It’s possible to request only certain columns, for example, let’s say you only want to know the title and starting price of each auction listing. Specify the desired columns in the “SELECT” clause:
An optional addition to the query is the “ORDER BY” clause, which (as you might guess) determines the sort order of the results. Since we are limiting the results to only 10 rows, this query would give you the ten lowest-priced auctions:
Most of the time you don’t want every row, but want to select a subset of the data. This is accomplished with the “WHERE” clause, certainly one of the most important clauses in a query. You may request a single row by its primary key, for example:
Or you may give criteria that qualify more than one row, if you want to see a specific subset. For example:
The criteria don’t have to be “equality” conditions, by the way. We can also use numerical inequalities; any row for which the inequality is “true” will be returned:
You can stitch together multiple conditions with “AND” and “OR” to make a more complex query. For example, if you want to find all auctions ending within a particular date range, you can use something like this:
Another condition you might use, for a primitive text search, is “LIKE”. The “%
” character is a wildcard that matches any text. Thus, the following code returns all auctions that end with “Handplane”:
Substitute “ILIKE” for a case-insensitive match.
Aggregate queries
The queries above allow you to carve out subsets of the data by requesting only certain columns, certain rows, or both. In every example, though, the rows you get in the result are rows from the original table. Aggregate queries are those that generate data by combining the original rows via an aggregation function, usually SUM
, COUNT
, or AVG
. Obviously the sum of two rows is one row, and is not identical to either of the original rows. The following query gives you the total dollar amount of all purchases in the table:
No matter how many rows were in the original table, the query above returns just one row. Say… how many rows are in the original table?
The COUNT
function actually counts the number of non-null values in the specified column, not the number of unique values. If you used COUNT(seller_name)
instead of COUNT(order_id)
you’d get the same result. Even if you have five hundred listings from 25 sellers, the COUNT
would be 500, not 25. If you have set up the table to allow nulls (empty or missing values) in the specified column, then you might get a result of less than 500. If you want to be sure to get exactly the total number of rows, simply use COUNT(*)
.
A grand total (or count, or average) is interesting, but a lot of the time what we want to do are compare subtotals (or counts, or averages) for various groupings of the data. To do this, we introduce a “GROUP BY” clause. If we want to know how many listings were made by each seller, we can group by the “seller_name” column and count up the rows in each group:
If you want to know which sellers are listing the biggest-ticket items, you might group by seller and take an average of the starting prices. There are a lot of sellers, though, so it’d helpful to sort the results with an “ORDER BY” clause. Since the average is the unnamed second column of the result set, that’ll be “ORDER BY 2”.
To get just the top ten, here’s a trick: sort the data in descending (“DESC”) order, and “LIMIT” the results to just the first ten rows.
In Chapter 2 and beyond you’ll learn a lot more SQL, such as how to create queries that “join” multiple tables, and how to write queries that employ nested sub-queries. Even in this example, though, you’ve seen several of the main parts of a SELECT
query, including the “WHERE”, “GROUP BY”, and “ORDER BY” clauses, and aggregate queries. You have begun to see that even a simple one-table database may be queried in several different ways, and that doing this with short SQL queries may be much easier than trying to wrangle the data in Excel.
I recommend that you attempt the exercises and challenges at the end of this chapter to get more practice with the basics of relational databases, SQL, and PostgreSQL specifically.
Summary
A database serves as a respository for data and a system to manage it, a “single source of truth” that can mitigate problems organizations face when working with information—coordination, version control, and information security, to name a few. A database becomes a key piece of business infrastructure when numerous people and systems depend on it. Accordingly, the database typically operates as a server, a system that is always on, always ready to respond to requests from other systems. Database administrators are the professionals who engineer and maintain databases. There are many types of databases; this book focuses on the most popular and versatile type—the relational database. PostgreSQL is an increasingly popular, open-source, and cross-platform relational database that should serve us well as a platform for learning relational modeling and the SQL query language. These concepts are easy to transfer to other relational databases once you learn them
Definitions
- black box
- a term sometimes used to hypothesize a device by describing what it would do without addressing how it might actually work
- database
- (in common usage) a system that organizes and stores data and, importantly, makes it available to people and systems that need to search or retrieve it
- (more precisely) the structured data stored by such a system, which is created and accessed using a database management system (DBMS)
- (in PostgreSQL) a logical grouping of tables within a PostgreSQL server
- database administrator (DBA)
- a professional responsible for developing, securing, and maintaining the database(s) that an organization depends on
- database management system (DBMS)
- a software system used to create, manage, and query a database
- database server
- a database that operates as a server; PostgreSQL acts as a server even when installed on a personal computer
- display logic
- information or code that specifies how data is to be presented to a human user; unlike spreadsheets, database tables do not convey their own display logic
- document-oriented database
- a non-relational database type in which data is logically modeled as collections of “documents” with tree-like structures
- graph database
- a non-relational database type in which data is logically modeled as a network diagram (a mathematical graph) with nodes and edges
- metadata
- information about data’s structure and meaning; examples include column names and data types
- Postgres
- nickname for PostgreSQL
- PostgreSQL
- an advanced open-source RDBMS increasingly popular with software developers; freely available at www.postgresql.org
- primary key
- a column in a data table that is guaranteed to have a unique value for each row, and therefore allow us to retrieve a specific piece of data
- query
- (v) to request specified data from a database
- (n) a specification in code (e.g. SQL) of a request for data from a database
- relation
- a data “table” that conforms to a few criteria mentioned in this chapter and further detailed in Chapter 2
- relational database
- a database in which data is logically structured as a collection of relations (i.e. tables) and which conforms (more or less) to the principles proposed in E.F. Codd’s 1970 paper “A Relational Model of Data for Large Shared Data Banks”
- relational database management system (RDBMS)
- a DBMS for a relational database; notable examples include PostgreSQL, MySQL, Oracle, SQL Server, DB2, and Access
- server
- a software or hardware system that is always on, waiting to respond to requests from users or other systems
- SQL
- the “structured query language”, a declarative language for querying an RDBMS; mostly standardized, there are minor differences in the SQL “dialect” used by different database brands
- table
- the commonly term for a relation in a database
New psql
commands
- help
- display a list of the most basic
psql
commands - \?
- list all
psql
commands - \h
- list all SQL commands supported by PostgreSQL
- \c <database>
- switch to working in the specified database
- \dt
- list all tables in the current database
New SQL syntax
TBD
Challenges
- Look up and read an encyclopedia entry on databases. Did you learn anything that you think ought to have been included in this chapter? Email the author or leave feedback on Leanpub!
- Find and read the official PostgreSQL documentation on the “CREATE TABLE” command. What are some of the other rules or constraints you can place on a column? Name three, and speculate on when they would be useful.
- Look up the documentation on data types supported by PostgreSQL. There are a lot of them. Make a short list (or a cheat sheet) of the five or six data types you think you’d use most often.
- Try to find out on your own how to use the “ALTER TABLE” command to add a new column to an existing table. Specifically, add a “date” column to the “purchases” table used as an example in this chapter.
- PostgreSQL offers a graphical user interface (GUI) called pgAdmin, which may have been installed with it. Find this program on your computer or install it, and figure out how to log in to your PostgreSQL database(s) with it.
- On your own, figure out how to use pgAdmin to create a new table in the “auctagon” database. Make sure it has a primary key column (i.e., a unique ID number or code). Can you add some sample data to the table without having to use SQL “INSERT” commands?
- See if you can find out how to open a window within pgAdmin to execute arbitrary SQL commands. Run some of the example SQL queries from this chapter through that interface.
- Other graphical clients for PostgreSQL (besides pgAdmin) are available, some free (or free to students) and some commercial. Find and examine one or two of these alternatives. What do you think of them?
Exercises
TODO
Supplemental Lab 1: Creating FrienduLATER
A second case study that runs through this textbook is a database for a social networking application called FrienduLATER. In this lab, we’re starting with a single table that stores messages posted to a user’s “wall” or message board. You’ll want to be able to query for messages posted since the last time you logged in, in reverse chronological order, to query for messages from a specific subset of users, and perhaps to find out the number of unread messages and the last time a user was active. This lab and a video to accompany it are coming soon.
TODO
Think About It
- Not all databases are relational. How might you structure a database without using familiar tables, rows, and columns? What would be the advantages and disadvantages of such a data modeling approach?
Chapter 2: The Relational Model
There are other data models
For a couple of decades (roughly 1985-2005), the relational data model was the only game in town, you had to learn it, and there was no reason for a textbook to argue the point. Today, data engineers have a lot of other options. Document-oriented databases are booming in popularity with app developers for their ease of use; graph databases have captured the imagination of researchers and tinkerers because of their natural fit with social network applications; and the analytics world has found performance advantages to be gained with dimensional databases, column-family databases, and cluster-based Big Data platforms like Hadoop. These are meaningful advantages, so before we take it for granted that the relational model is the most important one to learn, it is necessary to remind ourselves what it is and what its unique advantages are.
A data model is a means of describing the data in a database without regard to the way it is actually physically stored. It provides abstractions that humans can work with—e.g. tables, documents, dimensions—instead of implementation artifacts like bytes, pointers, and disk sectors. This abstraction is quite important when you consider the growth and evolution of a database over time, and the number of applications that might come to depend on it. As databases are used, they grow larger, the types of data stored may grow or change, and it may be necessary for database administrators to optimize their performance by upgrading the technology in various ways. If the developers of applications that depend on these databases had written their code to interact with the data as it was stored on disk, their applications would break, and code would need to be rewritten, every time such a change was made. But because we have data models, this is not necessary. Application developers interact with the database using a “data language” that is independent of physical implementation; they query tables, rows, and columns for instance, instead of locations on disk.
The relational model’s particular strength is its ability to efficiently answer queries that were not foreseen at the time the database was developed. Before E.F. Codd’s landmark paper introducing the relational model, the leading approach to data modeling was a hierarchy or network in which you would follow links from one data point to another. Consider, for example, a hierarchical database of movies listed by director:
In this example, it would be very easy to query the database for a list of movies directed by Christopher Nolan; just start from his name and follow the pointers. By contrast, it would be quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. The database engine would have to essentially read through the entire database, the vast majority of which is not relevant to the query, following all paths from left to right in the diagram to make sure it had found all of the paths that end with his name. In a really huge database, such a query could be prohibitively expensive.
The problem is, as databases grow, you will always find that you want to make queries you didn’t anticipate at the time you created the database. Codd, a researcher at IBM, had this problem in mind when in a 1970 he published “A Relational Model of Data Storage”.
In the relational model, the database consists of a set of tables, one for each entity (or noun) described by data. Each table can be queried on its own, or related tables can be combined into one query, but there is no “parent” table and no network or hierarchy that must be traversed. You have already seen a glimpse of this in Chapter 1; now, let’s define the terms a bit more precisely.
Theoretical roots
Codd’s conception of a relation comes from set theory:
The term relation is used here in its accepted mathematical sense. Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the j th domain of R.
It is possible to visualize this as a table, hence the commonly-used language of tables, rows, and columns. In Figure 2-3, we see a relation defined by four sets—a set of movie titles, a set of movie studios, the set of integer years, and the set of director names. These indicate the possible values that may appear in any given data record; it’s not necessarily the case that all of the years of history (for example) will be found in the data. This is an important point: the relation (table) is “defined” by specifying the domains (columns) and not by the rows. You’ll see this in SQL’s CREATE TABLE
command, which specifies column names and data types only.
The term tuple (derived from “double”, “triple”, “quadruple”, and so on, up to “n-tuple”) in mathematics refers to an ordered list of values. In this context, a tuple fits in to a relation if it contains one value for each domain, in the same ordering as the domains. So in the above example, a tuple must contain a movie title, a studio name, an integer, and a director’s name, in that order. Each such tuple is a “row” of the table, or a record of the data—in this case, it’s a movie that we’re interested in.
Because a relation is a set of tuples so defined, a number of constraints apply—some of which will be relaxed in practical implementations. These include:
- Each row must be unique.
- The order of rows is immaterial.
- The order of columns is significant.
- Each row must include a value for each column.
The first rule is easy to accidentally violate, for example in an order-taking system, where the same customer may purchase the same product on more than one occasion. It is common practice therefore to add a primary key (PK) column that contains a value guaranteed to be unique in the relation, such as an identification number.
The second rule implies that we must query the database without regard to the order in which data was entered, or any other order. One cannot query for the “next” value and reliably predict what result will be given. Attention must be paid to the WHERE
clause of a SQL query to specify exactly what we want.
In practice, the third rule is ignored. Codd informs us that the mathematical term for a relation with no specific domain ordering is a relationship. Instead of referencing an element of a row by its position in the sequence, in practice we use column names. Database developers should give sensible names to their columns, particularly when there may be one or more with the same domain of potential values. For example, a table of customer records may have two or more “address” columns, one for billing and one for shipping. It would be wise to give these columns names that indicate both the domain and the role, such as “address_billing” and “address_shipping”.
The fourth rule may be relaxed as well. Allowing missing values (called nulls) in certain columns gives database developers the flexibility to include optional attributes, or to add a data record in a step-by-step way instead of all at once. Obviously nulls cannot be allowed in every column, or the first rule may be violated. Therefore most databases don’t allow nulls in the primary key column.
In a query result, which otherwise resembles a relation, even the first two rules may be violated. This can be demonstrated by the following database interaction:
The first result contains non-unique rows. There’s a mathematical term for a set of n-tuples that admits duplicates: it’s called a bag. The second example result orders the rows in a meaningful sequence.
A relational database is a database of relations
We can therefore think about a relational database as a collection of tables (technically relations (even more technically, relationships)) that, together, describe the important facts in whatever context we’re interested in: a business, an application, a research project, or whatever. In some cases, a table may be useful on its own, but in many tasks you will want to query two or more related tables together. Yes, tables are related: customers are assigned to salespeople, employees belong to departments, products have bills of materials that go into their manufacture.
Two aspects of the context must be represented in the database design: entities and relationships. Entities are the nouns that matter: people, things, places, events, and concepts. Relationships are connections between entities which can usually be described by verbs. The most useful tool in developing a data model is an entity-relationship diagram (aka E-R diagram or ERD). See Figure 2-4. The boxes are entity types (although we usually just say “entities”), which correspond to tables. The lines indicate the relationship types (or “relationships”), which indicate how the rows of each table relate to the rows of the other tables.
Three types of relationships that differ in cardinality are depicted in Figure 2-4, a relational data model for a database of movies (like IMDB) diagrammed in “crow’s foot” notation. These are a “one-to-many” (or 1:M) relationship between Studio and Movie, a “one-to-one” (1:1) relationship between Script and Movie, and a “many-to-many” (M:M) relationship between Movie and Actor. Out of these three basic types, complex data models can be designed.
One-to-many
The most common type of relationship between tables is one-to-many (1:M), as seen in Figure 2-5. The rectangles in the diagram stand for entity types, which correspond to tables, and mean that any number of entity instances of each type may be in the database. We can assume that there are, or will eventually be, a large number of movies in the Movie table and large number of studios in the Studio table. The line with a crow’s foot at one end and a single tick mark at the other end indicates the type of relationship that may exist between rows of the Movie table and rows of the Studio table. Simply put, it says that each movie is related to just one studio, but that any given studio may be related to more than one movie. It is also common practice to add text to the diagram to explain why, or how, the two entity types might be related—in this case, studios produce movies (and movies are produced by studios).
When it comes time to create the actual tables, a one-to-many relationship is implemented by a simple and intuitive mechanism called a foreign key (FK). Remember that each table has a primary key, a column of data whose values are guaranteed to be unique—typically an ID number or code. To implement a 1:M relationship we add a special column to the table on the “many” side (in Figure 2-5, “Movie”) that holds references to primary key values in the other table. This could be achieved with the following SQL (note the REFERENCES clause):
If we view the tables with a few sample rows of data (Figure 2-6), the purpose of the foreign key column studio_id
should be clear. At a glance you can see that Sahara and Interstellar were produced by Paramount (studio #1) and that Inception was produced by Warner Bros. (studio #2).
Notice that there is no foreign key in the “Studio” table. If you pencil in a “movie_id” column on Figure 2-6, or simply imagine one, the reason should be obvious. If a studio were assigned a “movie_id”, then a studio could only be related to one movie, and that clearly doesn’t mesh with reality or with our E-R diagram.
Many-to-many
The next most common kind of relationship between two entity types is many-to-many (M:M), as in the relationship between Movies and Actors: each movie involves more than one actor, and each actor can be in more than one movie. See figure 2-7.
Now, an M:M relationship is easy to draw on a diagram, but it’s a little more complicated to implement as real tables in the database. Take a minute to think about how you might do it, before reading about it below.
…
Did you come up with a solution?
Well, here’s the solution that I teach my students: to represent the many-to-many connection, you need a third table. This is the only kind of case where a table represents a relationship rather than an entity. This new table, called an associative relation, might have two columns only: each a foreign key to one of the tables in the relationship. Figure 2-8 illustrates some sample data for the simplest kind of associative table.
A question of style concerns whether you should depict the associative entity on your E-R diagram—essentially as a table with two 1:M relationships to the entities of interest—or simply use the double-crow’s-foot notation as in Figure 2-7.
To answer this, consider the opportunity afforded by the existence of an associative table. It need not be limited to two foreign key columns only, because a table can have any number of columns. You can therefore use columns of the associative table to give meaningful characteristics about the relationship. Such a relationship may in fact be a “thing” of its own, perhaps an intangible one. An M:M relationship between a Buyer and a Seller might be an entity called a Contract. An M:M relationship between a social media User and a Channel he follows could be called a Subscription. And in the case of the Actor-Movie relationship, we might call it a Role.
The answer to the style question is that, if the associative table can be construed as an associative entity, a thing or concept (or noun) meaningful to the business, it makes sense to add it to the diagram. If the associative entity is meaningless to anyone except the guy writing database code, you would in most cases leave it off the diagram.
See Figure 2-9 for an example diagram including the associative entity. Note that instead of an M:M relationship you now see two 1:M relationships. In this example, the rectangles representing entities also show the names of some columns that would belong to each table; this is also a common style of E-R diagramming.
An implementation in SQL could look like this:
Notice that there are two foreign keys in the role
table.
One-to-one
One-to-one (1:1) relationships are much less common than 1:M and M:M relationships, simply because if two types of entities are related this way it’s often easier to put them together in one table. For example, a person has only one Social Security number, and a Social Security number has only one person, so you would typically have a table for People and include a Social Security number column rather than having two tables.
One reason you might store the data in separate tables is if some of the data needs to be stored in a different (physical) space, or stored with different security settings. Physical database optimization is beyond the scope of this chapter (although we’ll come back to it in Chapter 8) but I’ll give you an example. In many web applications, each user has a profile picture. Picture files (JPEG, PNG, or whatever) are quite large on disk, and to include them in the Users table of the database would make that table larger (perhaps by orders of magnitude) hence making queries slower. To give your website users a quicker response time, you might store the pictures in a separate ProfilePicture table, and relate them by foreign key to the corresponding rows of the Users table.
Another use case for a 1:1 relationship is when one of the tables contains “optional” data. For example, let’s say our movie database contains 10,000 rows, and we have scripts for 1,000 of those movies. Rather than have a script column in the Movie table, with 9,000 NULL entries making that table bigger and slower, we could have a separate Script table with only the rows it needs. In this case, each row of Script relates to exactly one row of Movie, but each row of Movie may relate to zero or one row of Script. It’s not exactly a parnership among equals: Movie is the main table in the relationship, and Script is a “dependent”.
Figure 2-10 uses a version of the crow’s foot notation that indicates the difference. The double tick mark next to Movie means that, for any row of Script, there must be at least one and no more than one row of Movie. The circle (or zero) with a tick mark next to Script means that for any row of Movie, there may be zero or one rows of Script. (The simpler notation was seen earlier, in Figure 2-4.)
To implement this in SQL, even though you could get away with putting the foreign key on either side of this relationship, or on both, the best practice is to put a foreign key in the “dependent” table. In this case, that table is Script. Essentially a 1:1 relationship is a special case of 1:M, where the foreign key is simply constrained to be unique. Since primary keys are unique, the most practical way to do this is for the rows of the dependent table to have the same primary key values found in the “independent” table, and for its primary key to double as a foreign key, like so:
A selection of the data, as in Figure 2-11, shows that the Script table borrows its primary key values from Movie but does not have a row for every value.
Unary relationships
All of the relationships in the E-R diagrams seen so far have been binary relationships, meaning they include two tables, and are said to have degree of two. Other degrees are possible. Frequently we also see unary relationships, where rows of one table are related to other rows of the same table. Unary relationships can be of any cardinality we’ve seen so far: 1:M, M:M, or 1:1. Remember that it is not the tables that are related, per se, but the rows.
An example of a unary 1:M relationship is posed by Figure 2-12, which depicts a movie studio ownership relationship. Studios are often owned by other studios, as for example Disney owns Lucasfilm and Sony owns Columbia. This can be expressed as a foreign key in the Studio table referencing an “owner”. (Note that it can’t be done with a foreign key referencing the “owned” studio; the FK belongs on the “dependent” side of a 1:M relationship.)
Here it is in SQL. The NULL
keyword indicates that owner
is an optional field; it may contain NULL if a particular studio has no owner recorded in our database.
As in other cases, sample data may be helpful to illustrate how the implementation works.
A unary many-to-many relationship could easily represent ties of affiliation between individuals or organizations. Hollywood being the way it is, we could define a M:M relationship between Actors called “was married to”. Like binary M:M relationships, this would require an associative table (which may or may not be diagrammed). See Figure 2-14.
An implementation in SQL could look like this:
And some sample data illustrates how it works:
Finally, an example of a unary 1:1 relationship that might be found in our movie database as a reference from a sequel to its predecessor. This would be implemented as an optional column in the Movie table called “sequel_to”, referencing another row of the Movie table, like so:
Ternary and n-ary relationships
Virtually all of the relationships you’ll find in most E-R diagrams are binary (connecting two tables) or unary (connecting rows within one table). It is nevertheless possible, however uncommon, to imagine a relationship connecting three tables (“ternary”), four (“quarternary”), or any arbitrary number of tables (“n-ary”). This simply means that one of each entity type is required to create a case of the relationship.
In the context of movies, we might consider a distribution deal to be one such relationship. A movie distribution deal is an agreement between a producer (Studio) and a third party (Distributor) to distribute a movie (Movie) in a particular territory (Territory). Figure 2-15 illustrates this as a quarternary many-to-many-to-many-to-many (M:M:M:M) relationship.
It is rare to see this kind of relationship diagrammed, though. The only way to implement a relationship of degree > 2 is to use an associative table with foreign key columns for all the participants. Therefore, in most cases you will simply model a new associative entity—for example “Deal” or “Contract”—which contains details as well as 1:M relationships with each of the related tables.
Lab 2: Creating and querying a relational database
In Lab 1 you created a one-table database and were introduced to some of the basic things you can do with SQL queries. Now that you have learned the basics of how multiple tables can be related to one another within a relational database, you will want to see some richer and more realistic examples. Moreover, you’ll need to familiarize yourself with the main way that tables are queried together: the SQL join.
What we’ll do in this lab is:
- Create a new database called “lab2”.
- Define several tables to learn the features of the CREATE TABLE command.
- Learn how to code different types of JOIN queries
In order to get started, first create a new empty Postgres database called “lab2”, log in to your Postgres server using psql
, and switch your context to the new database—much as we did in Lab 1:
The CREATE TABLE command
SQL is divided into two main types of commands, called data definition language (DDL) and data manipulation language (DML) respectively. DDL commands are those used to design and structure the tables that constitute the database, and the chief among them is CREATE TABLE
. (Two others you may frequently encounter are ALTER TABLE
and DROP TABLE
.) The basic form of a CREATE TABLE
command in Postgres is as follows:
I’ve kept it simple here in order to make a clear introduction, and will reveal more options as we move on. The complete specification of CREATE TABLE
can be found in the PostgreSQL online documentation, which is excellent. You have already seen several examples of CREATE TABLE
, one of which I’ll reproduce here:
In this example you saw only one of PostgreSQL’s optional constraints: PRIMARY KEY
. Yes, a “PK” is a constraint on the data. By that we mean that it sets up a rule that will lead to errors if we try to insert bad data—here’s one of the advantages of databases as opposed to spreadsheets: they tell us when we make mistakes. In this case, the rule is the entity integrity rule, which dictates that the primary key column must contain a unique (not null) value for every row. Look what happens if I try to add two movie studios with the same “studio_id”:
Three other constraints you’ll see me use in this lab are NOT NULL
, which means a particular column can’t be empty, REFERENCES
which sets up a foreign key relationship, and CHECK
which allows for arbitrary validation of the data.
Developing your database iteratively
As you work through this and other labs, you’ll soon find that you’ve made mistakes or come up with better ideas, and want to start over. Typing those CREATE TABLE
commands into psql gets tedious and is error-prone. As in most other types of programming, the solution is to write your SQL code in a text file (you may call it a script) that you can save, modify, and re-use. You can tell Postgres to run the whole script at once with a one-line command. This allows you to iterate toward a design that works.
To continue with this lab, create a text file using any text editor you like, such as Notepad++ on Windows, TextWrangler on the Mac, or Vim for you Linux geeks. One trick that I find handy is to precede my CREATE TABLE
code with corresponding DROP TABLE
commands. That’s because I expect to run this script over and over again, tweaking it until I get it right, and you have to delete a table before you can (re)create one with the same name.
File lab2.sql:
There are two ways to tell Postgres to run this script. From within the psql
environment, use the \i
command and the location of the script:
Or if you’re not logged in to psql
, you can use your operating system’s command line, specifying the database name (after “-d
”) and the script file (after “-f
”):
If you made any mistakes, don’t worry about it! Correct your code and run the script again, as many times as you need until it works.
Data types in PostgreSQL
Recall that each column in a relation is defined by a domain. When defining a table, we constrain the domain of values that may be stored in a column by specifying a data type. In the “studio” table, you saw two data types used: the studio’s ID number is defined as an integer and its name is text. If you tried to insert the wrong type of data into either column (such as a decimal number, perhaps) you’d get an error. In order to design your database well, you should become familiar with the main data types available in Postgres.
When it comes to numbers, there is a trade-off between precision, accuracy, and database performance. The integer
data type stores whole numbers in the range -2147483648 to +2147483647 with absolute precision but doesn’t allow fractions. The real
data type can hold a decimal number of any magnitude with about 6 significant digits of precision, but small inaccuracies can result from the necessity of rounding. Both require four bytes of storage on disk. A third type, numeric
, can be made precise and accurate up to an arbitrary number of digits before and after a decimal point, but it is much larger on disk and hence slower to process. The numeric
type might be useful for storing currency amounts because businesses can’t afford any kind of inaccuracy when counting money.
A variation on integer
that will become very useful is the serial
type; this specifies an integer column that will be automatically filled in, when a new row is created, with the next whole number in sequence: 1, 2, 3, and so on. It can be very handy for a primary key column, where you don’t care about the actual value except that it must be unique.
Text in a computer is stored as a sequence or string of characters—mostly letters, numbers, spaces, and punctuation—and the data types for text differ in whether you want to constrain the length of the text. In standard SQL, the two main types are char(n)
and varchar(n)
. The char(n)
type specifies text that has exactly “n” characters. Like the bed of Procrustes, the database will cut off the end of a string that is too long, or stretch one out that is too short (by adding spaces to the end of it). You might use char(2)
to store a state abbreviation in an address. The varchar(n)
type holds text of any length up to “n”, so varchar(20)
is probably adequate to store last names, and varchar(64)
might be enough to accommodate titles of books. You might use varchar
if you need to limit the length of the data, for example to make sure it will look good on a computer screen or print on a shipping label.
In most databases, char
offers a performance advantage over varchar
when the data is predictably the same length: for each row of data, char(10)
would require 10 bytes while varchar(10)
would require about 2 bytes to say how long the text is plus 0-10 bytes to store the text. When the text is of widely varying length, varchar
would have an advantage because it doesn’t pad the shorter values with spaces. However, in Postgres, due to clever engineering there is actually no such performance trade-off. In fact, Postgres offers a third type, simply called text
, which allows character strings of unlimited length and is no slower than the others. So unless you have a reason to limit the length of a text value, or feel strongly about sticking to standard SQL, use Postgres’s text
type.
Some of the other data types you will often find useful are boolean
(a true/false value), date
, time
, and timestamp
(the latter combines a date and a time of day). Postgres is also known for offering some highly unorthodox (to the SQL world) data types, such as Arrays, XML documents, and JSON, but those are beyond the scope of this chapter.
I’ve added the following to lab2.txt:
You can see that I’ve used a few more features of the DDL here. I’ve added a NOT NULL
constraint to the “last_name” column, so it can’t be empty. (Any of the other fields will accept nulls, effectively making them “optional”.) The CHECK
constraint will validate data in its column according to any arbitrary logical test. In this case we’re checking that “sex” is either “M” or “F”; the table will throw an error message if you try to enter anything else. It does accept a null value, though. You also see examples in this code of the serial
, char
, and date
data types which we haven’t used before.
Referential integrity
Previously we saw the entity integrity rule in action—that every row must have a unique value for its PK. Another vitally important integrity constraint for relational databases is the referential integrity rule, which states that a foreign key value can only match a valid primary key value from the referenced table. (Nulls may be allowed by the database designer, but values that don’t match existing PKs cannot be.) It also happens that you cannot define a foreign key column if the referenced table doesn’t exist, so this rule has an effect on the sequence in which we must create the database and insert data.
Case in point: although we are building a database of movies, we could not create the “movie” table first, because we know it’s going to reference certain other tables such as “studio”. Now that “studio” exists, we can define “movie” like so:
If you try to create the “movie” table before the “studio” and “person” tables exist, PostgreSQL will refuse to do it, and give you an error message, because the foreign key constraints on the “studio_id” and “director_id” columns won’t make sense to it. What you might not have guessed is that if you try to drop the “studio” or “person” table after creating “movie”, you’ll also get an error message. When dropping tables, you must drop the referencing tables before you drop the referenced tables. In order for our script to work, we have to re-arrange it so that the command DROP TABLE movie;
comes before the other DROP
commands. My script now looks like this (with column definitions omitted):
The last table created is the first table deleted.
Sequencing of INSERT
commands is also important. We cannot add a movie before its studio exists, because “studio_id” in the “movie” table must match a real “studio_id” in the “studio” table. Ditto for directors. (We can create a movie before its actors have been added, because there’s no direct FK relationship to actors.)
By the way, there are two common ways to write the PostgreSQL INSERT
command: single-row and multi-row insertions. In either case, you first specify the columns to add data to, and then provide the values for the new row(s). Single-row insertions look like this:
Notice that when inserting to the “person” table, we didn’t specify a “person_id”. We could have if we’d wanted to, but because the PK is a serial
data type, it will automatically number the new rows for us. You can check the numbers with a simple SELECT
query in psql. Your numbers might be different from mine if you have created and deleted other data previously, so be sure to check:
Multi-row insertions look like the following. Make sure you check the director’s “person_id” PKs and use the right ones in your code:
Commas separate each row’s values, and a semicolon ends the command.
To complete the database for our lab, let’s create the one-to-one and many-to-many relationships. The “script” table is in a 1:1 relationship with “movie”: there is zero or one script per movie. We will not include the full screenplay in this example database, just the screenwriter’s name.
Actors are related to movies in this database via an M:M relationship: each actor may be in multiple movies and each movie may include multiple actors. As diagrammed in Figure 2-9, we implement this by creating an associative table called “role” which has foreign keys to both “person” and “movie”.
Feel free to add more movies and actors if you like. This example is just a tiny prototype of what you might find behind the scenes of IMDB.com. The complete code for my lab2.txt is available on this book’s GitHub repo. Next, we’ll start writing queries that join tables.
Queries that JOIN
tables
What makes a relational database more than just a collection of single-table databases is the capability to join tables and query them together. We can combine the “studio” and “movie” tables with a query like this one, which I’ll explain below:
Tables are joined using the FROM
clause. Instead of identifying one table, we can list two (or more) separated by commas. What the database does when you query multiple tables is first take the Cartesian product of the two. Essentially what this means is that it combines each row of the first table with each row of the second. If the “studio” table has three rows of data and “movie” has five, the Cartesian product has fifteen. Observe:
Clearly, of course, most of these combinations don’t make any sense. The Incredibles is a Disney picture, so there’s no circumstance where you’d want a row matching it up with Paramount or Warner Bros. Recall from Chapter 1’s lab that the WHERE
clause allows us to filter the rows of a result. What you’d want to do now is to keep only those matchups where the PK “studio_id” of the “studio” data equals the FK “studio_id” of the “movie” table, like so:
That’s more like it! In my WHERE
clause I identified the columns of interest by the combination of table name and column name, e.g., “studio.studio_id
”. This is only necessary where the column name alone would be ambiguous otherwise, that is when two or more tables have column names in common; it’s an option available to you in other cases.
SQL also offers a JOIN
keyword that you can use to make it more explicit what you’re doing. The following two SQL statements do exactly the same thing:
This situation—a join on a one-to-many relationship where the FK and PK columns have exactly the same names—is so common that it’s known to SQL users as a natural join, and in fact, PostgreSQL offers a keyword for it so you don’t have to do the tedious typing of the equality condition. The following command does the same thing as the two above:
The result of this query keeps the same column names from the two joined tables, and “name” becomes a bit ambiguous. We can use the AS
keyword to rename a result column. Being more specific about what we want as the result brings us back to the first example above:
While we’re at it, why not pull in the director’s name, too? It’s also derived from a 1:M relationship, but the PK “person_id” and FK “director_id” aren’t the same, so we can’t use a natural join.
A many-to-many relationship, as we’ve seen, is effectively implemented as two one-to-many relationships between the principal tables and an associative table. To get all appearances of each actor in a movie’s cast, we use the “role” table as the associative one. Join “movie” to “role” and then to “person” (or do it the other way around).
At the start of this chapter, I said that in an old-fashioned hierarchical database such as pictured in Figure 2-1, it would be very easy to query the database for a list of movies directed by Christopher Nolan, but quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. Now we see that by breaking up the data so there’s one table for each entity, and utilizing the power of joins, either of those queries can be done in a brief SQL snippet:
Inner and outer joins
You’ll see a lot more SQL tricks involving joins throughout this book. One last twist I’d like to add in this chapter is the concept of inner and outer joins. Natural joins, and indeed all of the joins so far, are inner joins because they only return those rows of the original tables that participate in the relationship. To illustrate this, look what happens when you join “movie” with “script”, keeping in mind that we only created one row of “script”:
The four movies in our database that don’t have matching rows in “script” do not appear in the result. In any case where the relationship is optional, you can imagine a sort of Venn diagram: there may be several rows of Table A and several of Table B but only a few combinations where an “A” is related to a “B”. Sometimes, though, we want the complete set of rows of one of our tables. For example, we might want the list of all movies, showing the screenwriter’s name if we know it. That’s called an outer join and you can think of it as taking the entirety of one of the circles in the Venn diagram. An outer join is a LEFT OUTER JOIN
if the first table (the one mentioned before “JOIN
”) is the one you want to include all of. In this way, we can get the full list of movies with screenwriters’ names.
There are also RIGHT OUTER JOIN
s and FULL OUTER JOIN
s. For more on joins, see the PostgreSQL documentation. Have you imagined how you could code a self-join which joins a table to itself?
Summary
The relational data model is one of several data modeling paradigms used by databases today, albeit the most popular one. At the heart of the relational model is a construct called a relation, which we typically call a “table”, although it has some strict constraints such that not any table will qualify. The particular strength of the relational model is that it enables database users to efficiently execute queries that were not anticipated at the time the database was designed. This flexibility is obtained by breaking down the data into numerous tables, one for each “entity” or meaningful noun concept in the domain, and by relating these tables to each other by means of foreign key columns. The relationships that may be defined differ in cardinality and degree. Cardinality refers to how many rows of each table participate in a relationship, and the possibilities are one-to-many (1:M), many-to-many (M:M), and one-to-one (1:1). Degree refers to how many tables are related. Binary (two table) and unary (one table) relationships are the most common ones you will encounter. A relational database model can be visualized as an entity-relationship (E-R) diagram, and you are encouraged to become familiar with at least one E-R diagram notation, such as the “crow’s foot” notation summarized in Figure 2-16.
Summary
TBD
Definitions
TBD
New psql
commands
TBD
New SQL syntax
TBD
Challenges
- In the lab we constrained the “sex” column of the “person” table to be an uppercase “M” or “F”. If a lowercase value were provided, we would see an error. Can you rewrite the constraint such that it would accept either case?
- Try to use
ALTER TABLE
to add an ownership relationship to the “studio” table in our lab, as diagrammed in Figure 2-12. This is tricky because you already have some data in the table, so you need to think about how to avoid a referential integrity error from those pre-existing rows that don’t have FKs.
Exercises
TBD
Think about it
TBD
Chapter 3: Querying the data
A declarative query language
You have already seen some of the Structured Query Language (SQL) which is used to express queries in Postgres (and every other relational database that I know of) and you’re going to see a lot more in this book’s chapters. You have “programmed” several queries but here’s one thing you may not know: SQL is not a programming language. A computer program written in a language like Python, Java, or C++ is imperative—it gives a computer a sequence of instructions to carry out until it finished. SQL, by contrast, is a declarative language. In SQL queries, you describe the result that you want, not how the computer should obtain it. That turned out to be a genius move by the creators of the first relational databases.
Inside a DBMS like Postgres is a special function called the query optimizer which processes a SQL query and generates an execution plan for how best to obtain the desired result. In a complex query that incorporates multiple tables, there may be several steps in the plan, some slow and some quicker. These operations may include full table scans (reading an entire table from disk; slow), index scans (much quicker), and different types of join operations (see Table 3-1). Doing them in a certain order may be faster than doing them in another order, and this can make a big difference in a database with millions or billions of rows. Because SQL is declarative, the query optimizer has the freedom to choose the most efficient sequence.
Operation | Meaning |
---|---|
Full table scan | Read every row in the table and find the one(s) specified by the query. |
Index scan (aka “seek”) | Search an index to quickly find locations of the rows specified by the query. A database index is conceptually like the index in the back of a book; it makes finding the right “page” much quicker, more so when the book is longer. |
Table access | Go directly to the location of the specified row(s) and read the data. |
Hash join | A two-phase algorithm to quickly join two tables based on an equality condition. |
Nested loop join | A slower join algorithm that accommodates inequality conditions and other unusual joins. |
Here is a key point that I’ll come back to repeatedly: you should take advantage of the work that the database developers have already done. Yes, you could write your own execution plan, or your own program for processing the data, but it would take a lot of time and you might not get it right. Database engines are designed by some of the smartest computer scientists in the world and honed by practical experience for years, and they have very likely anticipated queries like yours. Give the database the freedom to optimize, and it will generally do an excellent job.
Relational operations
Relations (remember, this is the mathematical term for what we’re calling “tables”) are sets of tuples, as discussed in Chapter 2. There are a number of mathematical operations that can be performed on them, with the interesting property of closure: the result of each relational operation is itself a relation. The clauses of a SQL query can be interpreted as a specification of relational operations to be performed on the specified tables. Interestingly, just as you might simplify a complicated equation in high school algebra before solving it, the query optimizer might use relational algebra to build its execution plan—choosing which operations to perform first in order to reduce the amount of computation it will have to do to finish the job.
The key relational operations identified by E. F. Codd and derived from set theory are the projection, selection, and Cartesian product operations, but to this database developers have added several more very useful operations, particularly extended projection, aggregation, grouping, and sorting. See Table 3-2.
Operation | SQL clause | Symbol | Meaning |
Projection | SELECT |
\(\sigma\) | Return only the specified columns |
Selection | WHERE |
\(\Pi\) | Return only the rows that match specified criteria |
Cartesian product | CROSS JOIN |
\(\times\) | Return every combination of a row from table 1 with a row from table 2 |
Natural join | NATURAL JOIN |
\(\Join\) | Return all combinations of rows in specified tables that are equal on their common column |
Extended projection | SELECT |
\(\sigma\) | Generate new columns in the resulting table, such as the results of calculations or logical tests |
Aliasing | optional AS
|
\(\rho\) | Assign a (new) name to a column in the resulting table |
Aggregation |
SUM , COUNT , AVG , etc. |
\(G_{f(x)}\) | Replace original rows with a single row containing the computed result |
Grouping | GROUP BY |
\(_xG\) | In combination with aggregation, split the original data into subsets to yield subtotals, subaverages, or whatever |
Sorting | ORDER BY |
n/a | Re-arrange the rows in a specific order |
Basic relational operations from set theory
Projection is the operation of reducing a table to a subset of its columns, and in SQL it is expressed as a list of columns following the SELECT
keyword, for example:
Selection is the operation of reducing a table to a subset of its rows, and in SQL it is expressed as a logical test (for equality or inequality) follwing the WHERE
keyword. Multiple conditions may be combined into one with the AND
and OR
keywords if needed. For example:
These are certainly the most common operations, and most queries will employ both. Consider the query
This query may be expressed in relational algebra as \(\Pi_{name,age}(\sigma_{team=Patriots \land position=QB}(players))\). This formulation implies that the selection operation should be computed first, and then the projection operation. But because it is an algebra, and because the outcome of every operation is another relation, we could just as easily flip it around, i.e.: \(\sigma_{team=Patriots \land position=QB}(\Pi_{name,age}(players))\). This kind of flexibility gives the query opimizer room to make choices that speed up the query.
The third of the “original” relational operations is the Cartesian product operation which joins every row of one table with every row of a second table. The Cartesian product is expressed in PostgreSQL as CROSS JOIN
and one way it sometimes comes in handy is to generate a cross-tabulation of the rows of two tables. For example, if you want a report to yield some statistics about every football team in every year (perhaps to build a line graph?), the core of the query might be:
Or in relational algebra, \(teams\times seasons\). The more common type of join, as discussed in Chapter 2, is a natural join, where each row of one table is joined with only the rows of the other table that have matching values of a specific column (i.e., a foreign key - primary key relationship). In Postgres there is actually a NATURAL JOIN
keyword that works when the columns literally have the same name. If they have different names (for example, if a “players” table has a FK called “team_id” but in the “teams” table it’s simply called “id”), you can use either a JOIN
clause or a WHERE
condition to effect the join. These are three ways you might perform a natural join on two tables in SQL:
In relational algebra, the natural join is expressed as \(players\Join _{team\_id=id} teams\); the subscript expressing the join condition can be omitted if the FK-PK relationship is obvious. You could perform a natural join by first taking the Cartesian product and then selecting the rows where the FK matches the PK, Ã la \(\sigma _{team\_id=id} (players \times teams)\), and in theory this is what the database engine is doing. In practice, the query optimizer will use an algorithm like a hash join to perform an equality join much more quickly.
Inequality joins are also possible. If you want to join each player with teams he is not on, in order to perform some kind of comparison, you might do the following:
In relational algebra notation this is \(players\Join _{team\_id \neq id} teams\). Such a join is generally going to be quite expensive in computational terms because the database engine must perform a nested loop: for each row of the “players” table it must loop through the entire “teams” table to find relevant rows.
Extensions to the relational toolkit
Although relational modeling and relational algebra originate in set theory, database developers and users have made numerous pragmatic extensions to the original theory-derived set of methods we can apply. After all, a database isn’t an academic exercise, but a practical business tool.
The idea behind extended projection is that a query can give us not only a selection of columns from the original table(s), but can also produce new columns as a result of calculations or logical tests. For example:
A closely associated idea is that of aliasing (also known as the “rename” operation), an operation that changes the name of a column or assigns a name to a column that doesn’t have one (such as the calculated column above). In Postgres, you can use the optional AS
keyword, or simply provide an alias after specifying the column:
The last query above is an example that contains a logical test, “age > 35
”, and the result will be a column called “oldguy” that contains Boolean values: “true” and “false”. The AS
keyword is omitted; it is optional, but may make your queries easier to read.
Another powerful extension to relational algebra is aggregation; this allows us to generate new rows that do not come from the original tables, but instead are the result of calculations over some or all of the original rows. The aggregations you will use most frequently are SUM
and COUNT
, but a number of other aggregate functions in Postgres are available, particularly for statistics such as MIN
, MAX
, AVG
and so on.
Aggregation works together with the operation of grouping, which identifies the set(s) of rows to be aggregated together. If no grouping condition is set by a GROUP BY
clause in the SQL, all rows are aggregated into one result. To count the number of teams, for example, we could simply do:
If we want to compute aggregates for subsets of the data, we use GROUP BY
to generate a group for each distinct value of a particular column, for example:
In relational algebra notation, the grouping and aggregation operations are denoted by a capital “G”; the grouping column in a preceding subscript and the aggregation function in the following subscript. The above example would be expressed \(_{team} G _{AVG(salary)} (players)\).
A final important operation is sorting. Although in theory the order of rows in a relation is meaningless (it’s a set), in practice we want to sort the rows into some kind of meaningful order. There is no standard notation for this operation in relational algebra, but in SQL it’s expressed in the ORDER BY
clause of a query:
The relational operations listed here are the key components from which you’ll build most of your queries, and they are common to virtually all relational databases. In Chapter 5, we’ll introduce you to some additional relational patterns that are useful in special cases, and in Chapter 7 we’ll explore some of the special features of PostgreSQL that distinguish it from other relational databases.
Queries within queries
SQL queries can be much more complex than you have seen so far, when the answer to one query depends on the answer to others. In such cases, a complete query called a subquery can be nested within another. Subqueries are most often found in the WHERE
, FROM
, and SELECT
clauses. For example, this query names all players who earn a salary greater than the average:
The inner query, “SELECT AVG(salary) FROM players
” is evaluated first and yields a single numerical result. Then the outer query is evaluated to finish processing the query. Just like in a mathematical expression, parentheses set the subquery apart from the clauses of the surrounding query. (I also indented the subquery, but this is not necessary; PostgreSQL like most databases is indifferent to whitespace like spaces, tabs, and line breaks.)
Subqueries in the FROM
clause, when evaluated, produce result sets that act like tables in the outer query. To identify all the quarterbacks in the AFC East, we could join the “players” table with the result of a subquery that selects all teams in that division:
In this example you saw that the AS
keyword can be used to assign a name to a table, just as previously we saw it used to assign a name to a column. When using a subquery as a table in the FROM
clause, it must be given a name. The AS
keyword is optional, though I think it makes the query easier to read.
Another place you will often see a subquery is the SELECT
clause. Such a query will almost always yield a single numeric result, because in the SELECT
clause it generates a value for a single column of query output. Here is a simple example that lists the total salary budget for each team. (Note that this could also be accomplished using a join and a GROUP BY
… there are more ways than one to solve many SQL problems.)
This is a case of a correlated subquery, meaning that this subquery depends on some value from the outer query (namely teams.team_id
). Therefore, the subquery will be executed numerous times, once for each row of the “teams” table. The queries in the previous two examples are noncorrelated (or uncorrelated) subqueries which need only be executed once. Correlated subqueries are potentially much slower, so it could be preferable to solve this problem with a join instead of a subquery, but that’s not possible in every case. My philosophy is to trust the query optimizer (and its developers) to find the fastest way to execute the query, and not over-think the SQL: if it ain’t broke, don’t fix it.
Finally, subqueries can be nested within subqueries. Indeed, there may be several levels of such nesting, making for some pretty complicated queries. As you solved complicated equations in high school by first “simplifying the expression”, the query optimizer may have many ways to simplify and sequence things behind the scenes so that the query’s result can be obtained as efficiently as possible.
Lab
TODO:
This time psql directly to the lab3 database.
re-introduce term DML. you have already seen INSERT and SELECT.
Sports database of matches and wins.
Find average number of points scored per team per game.
Find highest-scoring player on every team.
Do an inequality join to identify coach based on start date (arbitrary).
Show teams’ ranks with window function.
Join with generated calendar to show wins/losses per month.
Include some compound logical tests with AND and OR.
Show how to explain a query, impose your own dumb query plan, and to time a query.
Normalizing a data model
coming soon
More relational patterns
coming soon
- how to implement hierarchical cateogories
- an alternative: how to implement “tags”: separate table or in-row array?
Logic in the database
coming soon
Advanced PostgreSQL tricks
coming soon
Just enough physical database design
coming soon
Dimensional modeling for analytics
coming soon
Databases in the cloud
coming soon
Appendix A: Installing PostgreSQL and accessing it from the command line
In this chapter, I’ll walk you through installing the latest version of PostgreSQL on Windows 10, Mac OS X, and select Linux distributions. As stated in Chapter 1, the goal is to get to the point that you can enter the command psql
at your system’s command line and connect to a PostgreSQL server running in your local environment.
Why type commands the command line? Although there are point-and-click graphical interfaces available for PostgreSQL, it is difficult for me or any teacher to give clear instructions about how to point and where to click. These interfaces may also be different for users on different platforms. The command line interface (CLI) to PostgreSQL is the same for all platforms and allows me to give precise instructions. It will also ensure that you get lots of practice writing SQL. As a bonus feature, non-technical people who see you doing this will think you’re a super-hacker.
Setting up PostgreSQL in Windows 10
Obtain and install PostgreSQL
First, visit www.postgresql.org for the latest information about PostgreSQL. Take note of the latest stable version number, which at the time of this writing appears under the heading “LATEST RELEASES”. There will probably also be an announcement about an unfinished future version, in alpha or beta test stage, but I recommend you avoid that one. From the homepage, follow the “Download” link(s) until you find a Windows installer for the latest version. At the time of this writing, the first option listed on the website is an interactive installer provided by EnterpriseDB, a company that promotes PostgreSQL.
Unless your computer is very old, it should be a 64-bit system, so you’ll want to choose the “Win x86-64” installer. Again, make sure you’re not downloading the unstable alpha or beta, but are getting the newest mature release. The download may take some time if you have a slow connection. When you have downloaded the installer, run it.
The installer has several screens that ask you how to configure PostgreSQL, but for the most part you can accept the defaults. On about the fourth screen, you’ll be asked to create a password for the database superuser (known as “postgres”). This is the user account that can create and delete databases, and other users. Since you’re just practicing on (presumably) your personal computer, it’s best to pick something easy to remember and that you don’t mind other people seeing. By all means, though, don’t forget what password you entered.
You should be able to click “next” through all of the remaining screens, and watch the installer do its job. After installing the software, the Setup Wizard will give you the option to launch something called “Stack Builder”, to augment PostgreSQL with some optional add-ons. You can uncheck the box and skip that step. If you do find yourself in Stack Builder, you can get out of it by clicking “Cancel” if you wish.
Access psql
from the Windows command prompt
To get to the Windows command line, search for the program called “cmd.exe”. In Windows 10 you can do this by clicking the “Windows” icon to launch the main menu, and then typing “cmd”. This should work for other recent versions of Windows, too. I recommend pinning the icon to your taskbar, so you can access it quickly in the future.
Upon opening this application, you’ll be presented with a text-only window into which you can enter commands one at a time. Type help
and hit Enter for a list of basic commands you can use to navigate this interface. Most likely you’ll find cd
(change directory) and dir
(list the files in a directory) useful for navigating folders and getting to your work.
Test your PostgreSQL installation by typing psql -V
. If the output of this command is the PostgreSQL version number, try then entering psql
to log in to the database server. If psql
opens without an error message, you’re done with installation, and may go back to Chapter 1 to begin the lab.
More likely, though, you’ll get an error message saying that psql
isn’t recognized as command or program. Failing that, you’ll get an error message saying that your password is wrong. In either case, this means there’s one more step to do: we need to set two important environment variables.
Setting environment variables in Windows 10
Environment variables are data that are accessible to all programs running on your system, and they’re usually used for things that multiple programs need to know about (such as the location of your database) but which may be different from one computer to the next. By using an environment variable to store this information, you don’t have to include it in code. Therefore a program could be developed on one computer and deployed to another that has things set up differently, and the original code wouldn’t need to be altered. In the code, instead of saying “connect to the database located in directory C:/X/Y/Z”, you’d say “connect to the database at environment variable DATABASE_LOCATION”.
The first environment variable we need to get right is the PATH. When you type the name of a program, like PostgreSQL’s command line interface psql
, the operating system looks in a number of directories to find a program of that name. If it doesn’t find one, you’ll see an error message saying “psql is not recognized as an internal or external command”. The PATH is the environment variable that stores the list of directories to search in. To configure it correctly, first find out where psql
lives on your system, and then add that directory to the PATH.
Finding psql
is the easy part. It’s most likely in the directory C:\Program Files\PostgreSQL\9.6\bin
(change “9.6” if necessary, depending on the version you installed). If not, use your system’s search tools or poke around until you find it. By the way, the true name of this file is “psql.exe”, with the “.exe” suffix signifying to Windows that it is an executable program. Depending on your folder settings, you may or may not see the suffix.
To alter the PATH, navigate to the “System” control panel. Windows 10 makes this a little more complicated than in earlier versions of Windows. You can either do a search for “Control Panel” and then click to “System”, or you can go to the Settings panel, click “System”, then scroll down and click “About”, then scroll down and click “System Info”. Either way you’ll find yourself on a page that displays basic information about your computer, its processor, memory, and operating system version among other details. On the left, click “Advanced system settings”. A new window will appear. Near the bottom of that window, click the button for “Environment Variables…”.
In the environment variable control panel (see Figure A-6), you’ll see a list of variables specific to your user account, and a list of variables that apply to the whole system (all possible users). You can alter the PATH (or “Path”, “path”, “pAtH”, etc.) in either of these lists, or create it if it doesn’t already exist. If one already exists, select it and click “Edit…”. Add a new listing to the end of the PATH for the directory where psql
lives. Do not delete existing entries! They are probably important for other programs you use.
Click ‘OK’. You may now open a new command prompt and try psql
again, and you’ll most likely get the confusing error illustrated in figure A-8. The program will ask you for the password–the same password you dutifully wrote down during installation (see above)–and then will tell you that “authentication failed for user YourName”. What this means is not that you got the password wrong, but that Postgres got the username wrong. It assumed your database username was the same as your Windows username. Glancing back up at figure A-2, you’ll see that the database user is simply named “postgres”.
There’s a workaround for this problem: you can append “-U postgres
” to all database commands to specify the username, e.g.:
However, that’s a bit tedious, and there’s a better solution. When psql
runs, it checks for a username stored in an environment variable called PGUSER before guessing that you want to use your Windows username.
To set this environment variable, click “New…” and enter the name PGUSER
with the value postgres
. Then click “OK”.
After configuring the PATH and PGUSER environment variables, you must click “OK” to close the Environment Variables control panel. Then open a new command prompt and try psql
. If you successfully log in to the database server (as in Figure A-10), you’re up and running!
Setting up PostgreSQL on a Mac
Obtain and install PostgreSQL
The home of PostgreSQL is www.postgresql.org and, although it’s not necessary to go there to obtain Postgres for the Mac,I recommend that you make a visit to that website as your first step anyway. There you can learn about the latest version, locate documentation and other resources, and find the latest information about installation, if this book is out of date by the time you read it. Take note of the latest stable version number, which at the time of this writing appears under the heading “LATEST RELEASES”. There will probably also be an announcement about an unfinished future version, in alpha or beta test stage, but I recommend you avoid that one.
For Mac users, there are actually quite a few ways to obtain Postgres, including an installer program provided by EnterpriseDB, an “app”, and the Homebrew package manager. I’ve found the Homebrew installation quite painless, and it looks like Homebrew may have other handy uses (such as starting and stopping the database server), so that’s the approach I recommend for you.
If you don’t have Homebrew yet, visit their website at brew.sh for instructions. The easiest way to install Homebrew is to open up a Terminal window and paste the code provided on the web page. This downloads an installation script and runs it on your computer.
Installing Postgres with Homebrew couldn’t be easier. Assuming you have installed Homebrew, go to a command line and type
Upon your hitting ENTER, Homebrew should proceed to download Postgres, install it, and configure it for use. You will see a number of messages scrolling by (see for example, Figure A-13) and a new prompt will appear when installation is complete.
Starting the server and creating a database
You can test that you installed the software by entering psql -V
at the command line. The system should respond by telling you the version of Postgres that you installed. (Tip: if this fails, try closing the Terminal window and opening a new one, then re-enter the command.)
On the Mac, though, there are a couple of additional things to do. Remember that psql
is just a utility for talking to the database; for it to be useful we must first start the database. As you might have seen in the messages output by the installation, we can start the server with a Homebrew command like this:
The PostgreSQL server is now running in the background on your computer, and it will keep running even if you restart the computer later. To stop this background server, you can enter the “stop” command at any time:
There’s one last thing to do now. By default, if you simply enter psql
at the command line, Postgres tries to open a database with the same name as your Mac OS username. Since that database doesn’t exist yet, it will fail. The simple solution is to use createdb <username>
(see Chapter 1) to create the database it’s looking for. Figure A-15 illustrates the error messages you may encounter, and the steps needed to successfully start the server and create the default database.
Now your Mac should be ready to play with all the examples in this book.
Setting up PostgreSQL in Linux
(to do)
About the Author
Joseph W. Clark, Ph.D. has researched and taught information systems and data analytics topics since 2006, most recently at the University of Maine. He was one of the first generation of Web developers in the 1990s, and has been fascinated with databases and data modeling since he first learned how relational databases could power dynamic websites, around 1999. His academic interests have been at the intersection of data analytics and entrepreneurship, and new types of workflows such as Agile, Lean, and Design Thinking. His most ambitious project yet is raising four children with his beautiful wife, Xiaofang.