11. Understanding Joins

11.1 What is a Join?

A join operation allows you to retrieve data from multiple tables in a single SELECT query. Two tables can be joined by a single join operator, but the result can be joined again with other tables. There must exist a same or similar column between the tables being joined.

When you design an entire database system using good design principles like normalization, we often require the use of joins to give a complete picture to a user’s query. For example, we split our programming languages table into two - one holding the author details and the other holding information about the languages itself. To show a report listing authors and which programming language they created, we would have to use a join.

Figure: authors_tbl contents
author_id author language_id
1 Colmerauer 1
2 Wall 2
3 Ousterhout 4
4 Iverson 3
5 Kemeny 5
6 Kurtz 5
Figure: newlang_tbl contents
id language year standard
1 Prolog 1972 ISO
2 Perl 1987 (null)
3 APL 1964 ANSI
4 Tcl 1988 (null)
5 BASIC 1964 ANSI

We now form a query to show our desired output - the list of all authors with the corresponding language they developed. We choose our join column as the language_id field from the authors table. This corresponds to the id field in the languages table.

Listing: running a join operation on our two tables
1 SELECT author, language FROM authors_tbl, newlang_tbl
2 WHERE language_id = id;
Figure: result of our join query
author language
Colmerauer Prolog
Wall Perl
Iverson APL
Ousterhout Tcl
Kemeny BASIC
Kurtz BASIC

The output of our query combines a column from both tables giving us a better report. The language_id = id is called the join condition. Since the operator used in the join condition is an equality operator (=), this join is called as an equijoin. Another important thing to note is that the columns participating in the join condition are not the ones we choose to be in the result of the query.

11.2 Alternative Join Syntax

You would have noticed that we formed our join query without much special syntax, using our regular FROM/WHERE combination. The SQL-92 standard introduced the JOIN keyword to allow us to form join queries. Since it was introduced earlier, the FROM/WHERE syntax is more common. But now that the majority of database vendors have implemented most of the SQL-92 standard, the JOIN syntax is also in widespread use. Below is the JOIN syntax equivalent of the query we just wrote to display which author created which programming language.

Listing: Rewriting our query using the JOIN(SQL-92) syntax
1 SELECT author, language FROM authors_tbl JOIN newlang_tbl
2 ON language_id = id;

Notice that instead separating the two tables using a comma (thereby making it a list), we use the JOIN keyword. The columns which participate in the join condition are preceded by the ON keyword. The WHERE clause can then be used after the join condition specification (ON clause) to specify any further conditions if needed.

11.3 Resolving ambiguity in join columns

In our example the join condition fields had distinct names - id and language_id. But what if in our languages table (newlang_tbl) we kept the key field’s name as language_id. This would create an ambiguity in the join condition, which would become the confusing language_id = language_id. To resolve this, we need to qualify the column by prepending it by the table name it belongs to and a .(period).

Listing: Resolving the naming ambiguity by qualifying the columns
1 SELECT author, language FROM authors_tbl JOIN newlang_tbl
2 ON authors_tbl.language_id = newlang_tbl.language_id;

Another way to solve such ambiguity is to qualify the columns using table aliases. The concept is to give a short name to a table and then use this to qualify the columns instead of a long, unwieldy table name.

Listing: using table aliases
1 SELECT author, language FROM authors_tbl a JOIN newlang_tbl l
2 ON a.language_id = l.id;

Here the authors table is given the alias a and the languages table is given the alias l. It is generally considered a good practice to qualify column names of a join condition regardless of whether there is a name ambiguity or not.

11.4 Cross Joins

You might think what would happen if we left out the join condition from our query. Well what happens in the background of running a join query is that first all possible combinations of rows are made from the tables participating in the join. Then the rows which satisfy the join condition are chosen for the output (or further processing). If we leave out the join condition, we get as the output all possible combinations of records. This is called a Cross Join** or Cartesian Product of the tables usually denoted by the sign X.

Listing: query for showing the cartesian product of our tables
1 SELECT author, language FROM authors_tbl, newlang_tbl;
Figure: the cartesian product of our tables
author language
Kemeny BASIC
Kurtz BASIC
Colmerauer BASIC
Wall BASIC
Ousterhout BASIC
Iverson BASIC
Kemeny Prolog
Kurtz Prolog
Colmerauer Prolog
Wall Prolog
Ousterhout Prolog
Iverson Prolog
Kemeny Perl
Kurtz Perl
Colmerauer Perl
Wall Perl
Ousterhout Perl

Another way to rewrite this query is to actually use the JOIN keyword with a preceding argument CROSS as shown below.

Listing: rewriting the query using CROSS JOIN
1 SELECT author, language FROM authors_tbl CROSS JOIN newlang_tbl;

11.5 Self Joins

Sometimes a table within its own columns has meaningful data but one (or more) of its fields refer to another field in the same table. For example if we have a table in which we capture programming languages which influenced other programming languages and denote the influence relationship by the language id, to show the resolved output we would have to join the table with itself. This is also called a SELF JOIN. Consider the table created below and pay close attention to the data being inserted.

Listing: creating and populating our language influence table
 1 CREATE TABLE inflang_tbl (id            INTEGER     PRIMARY KEY,
 2                           language      VARCHAR(20) NOT NULL,
 3                           influenced_by INTEGER);
 4 
 5 INSERT INTO inflang_tbl (id, language)
 6  VALUES (1, 'Fortran');
 7 
 8 INSERT INTO inflang_tbl (id, language, influenced_by)
 9 VALUES (2, 'Pascal', 3);
10 
11 INSERT INTO inflang_tbl (id, language, influenced_by)
12 VALUES (3, 'Algol', 1);
Figure: contents of inflang_tbl
id language influenced_by
1 Fortran (null)
2 Pascal 3
3 Algol 1

Our goal is to now write a self join query to display which language influenced which one, i.e. resolve the influenced_by column.

Listing: running a self join query
1 SELECT l1.language, l2.language AS influenced
2 FROM inflang_tbl l1, inflang_tbl l2
3 WHERE l1.id = l2.influenced_by;

Notice the use of table aliases to qualify the join condition columns as separate and the use of the AS keyword which renames the column in the output.

Figure: result of our self join query
language influenced
Algol Pascal
Fortran Algol