11. Understanding Joins
11.1 What is a Join?
A join operation allows you to retrieve data from multiple tables in a single SELECT query. Two tables can be joined by a single join operator, but the result can be joined again with other tables. There must exist a same or similar column between the tables being joined.
When you design an entire database system using good design principles like normalization, we often require the use of joins to give a complete picture to a user’s query. For example, we split our programming languages table into two - one holding the author details and the other holding information about the languages itself. To show a report listing authors and which programming language they created, we would have to use a join.
| author_id | author | language_id |
|---|---|---|
| 1 | Colmerauer | 1 |
| 2 | Wall | 2 |
| 3 | Ousterhout | 4 |
| 4 | Iverson | 3 |
| 5 | Kemeny | 5 |
| 6 | Kurtz | 5 |
| id | language | year | standard |
|---|---|---|---|
| 1 | Prolog | 1972 | ISO |
| 2 | Perl | 1987 | (null) |
| 3 | APL | 1964 | ANSI |
| 4 | Tcl | 1988 | (null) |
| 5 | BASIC | 1964 | ANSI |
We now form a query to show our desired output - the list of all authors with the corresponding language they developed. We choose our join column as the language_id field from the authors table. This corresponds to the id field in the languages table.
1 SELECT author, language FROM authors_tbl, newlang_tbl
2 WHERE language_id = id;
| author | language |
|---|---|
| Colmerauer | Prolog |
| Wall | Perl |
| Iverson | APL |
| Ousterhout | Tcl |
| Kemeny | BASIC |
| Kurtz | BASIC |
The output of our query combines a column from both tables giving us a better report. The language_id = id is called the join condition. Since the operator used in the join condition is an equality operator (=), this join is called as an equijoin. Another important thing to note is that the columns participating in the join condition are not the ones we choose to be in the result of the query.
11.2 Alternative Join Syntax
You would have noticed that we formed our join query without much special syntax, using our regular FROM/WHERE combination. The SQL-92 standard introduced the JOIN keyword to allow us to form join queries. Since it was introduced earlier, the FROM/WHERE syntax is more common. But now that the majority of database vendors have implemented most of the SQL-92 standard, the JOIN syntax is also in widespread use. Below is the JOIN syntax equivalent of the query we just wrote to display which author created which programming language.
1 SELECT author, language FROM authors_tbl JOIN newlang_tbl
2 ON language_id = id;
Notice that instead separating the two tables using a comma (thereby making it a list), we use the JOIN keyword. The columns which participate in the join condition are preceded by the ON keyword. The WHERE clause can then be used after the join condition specification (ON clause) to specify any further conditions if needed.
11.3 Resolving ambiguity in join columns
In our example the join condition fields had distinct names - id and language_id. But what if in our languages table (newlang_tbl) we kept the key field’s name as language_id. This would create an ambiguity in the join condition, which would become the confusing language_id = language_id. To resolve this, we need to qualify the column by prepending it by the table name it belongs to and a .(period).
1 SELECT author, language FROM authors_tbl JOIN newlang_tbl
2 ON authors_tbl.language_id = newlang_tbl.language_id;
Another way to solve such ambiguity is to qualify the columns using table aliases. The concept is to give a short name to a table and then use this to qualify the columns instead of a long, unwieldy table name.
1 SELECT author, language FROM authors_tbl a JOIN newlang_tbl l
2 ON a.language_id = l.id;
Here the authors table is given the alias a and the languages table is given the alias l. It is generally considered a good practice to qualify column names of a join condition regardless of whether there is a name ambiguity or not.
11.4 Cross Joins
You might think what would happen if we left out the join condition from our query. Well what happens in the background of running a join query is that first all possible combinations of rows are made from the tables participating in the join. Then the rows which satisfy the join condition are chosen for the output (or further processing). If we leave out the join condition, we get as the output all possible combinations of records. This is called a Cross Join** or Cartesian Product of the tables usually denoted by the sign X.
1 SELECT author, language FROM authors_tbl, newlang_tbl;
| author | language |
|---|---|
| Kemeny | BASIC |
| Kurtz | BASIC |
| Colmerauer | BASIC |
| Wall | BASIC |
| Ousterhout | BASIC |
| Iverson | BASIC |
| Kemeny | Prolog |
| Kurtz | Prolog |
| Colmerauer | Prolog |
| Wall | Prolog |
| Ousterhout | Prolog |
| Iverson | Prolog |
| Kemeny | Perl |
| Kurtz | Perl |
| Colmerauer | Perl |
| Wall | Perl |
| Ousterhout | Perl |
| … | … |
Another way to rewrite this query is to actually use the JOIN keyword with a preceding argument CROSS as shown below.
1 SELECT author, language FROM authors_tbl CROSS JOIN newlang_tbl;
11.5 Self Joins
Sometimes a table within its own columns has meaningful data but one (or more) of its fields refer to another field in the same table. For example if we have a table in which we capture programming languages which influenced other programming languages and denote the influence relationship by the language id, to show the resolved output we would have to join the table with itself. This is also called a SELF JOIN. Consider the table created below and pay close attention to the data being inserted.
1 CREATE TABLE inflang_tbl (id INTEGER PRIMARY KEY,
2 language VARCHAR(20) NOT NULL,
3 influenced_by INTEGER);
4
5 INSERT INTO inflang_tbl (id, language)
6 VALUES (1, 'Fortran');
7
8 INSERT INTO inflang_tbl (id, language, influenced_by)
9 VALUES (2, 'Pascal', 3);
10
11 INSERT INTO inflang_tbl (id, language, influenced_by)
12 VALUES (3, 'Algol', 1);
| id | language | influenced_by |
|---|---|---|
| 1 | Fortran | (null) |
| 2 | Pascal | 3 |
| 3 | Algol | 1 |
Our goal is to now write a self join query to display which language influenced which one, i.e. resolve the influenced_by column.
1 SELECT l1.language, l2.language AS influenced
2 FROM inflang_tbl l1, inflang_tbl l2
3 WHERE l1.id = l2.influenced_by;
Notice the use of table aliases to qualify the join condition columns as separate and the use of the AS keyword which renames the column in the output.
| language | influenced |
|---|---|
| Algol | Pascal |
| Fortran | Algol |