Leanpub: Publish Early, Publish Often

13. Working with Sets

Set theory is a branch of discrete mathematics that deals with a collection of objects. There is a lot of conceptual overlap between set theory and relational database concepts. It is no wonder that the output of a query is frequently called a resultset. Primitive set theoretic operations like union, intersection and difference are increasingly supported in various SQL implementations. We will now explore the theory behind these operations and how to use them in SQL.

13.1 Union

The union is an operation that combines elements of two sets. Let’s say we have the following two sets:

1 set1 = { 1, 3, 5}
2 
3 set2 = { 1, 2, 3}

The resulting union set will be a set consisting of all of these elements repeated exactly once, i.e. no duplicates are allowed. Note that the order of a set is unimportant. Think of it as a bag of elements rather than an ordered collection.

1 set1 UNION set2 = { 1, 3, 5, 2 }

Let’s now look at how to use simulate the UNION operation in SQL. Consider the table and its data below.

id	language	author	year	standard
1	Prolog	Colmerauer	1972	ISO
2	Perl	Wall	1987	null
3	APL	Iverson	1964	ANSI
4	RPG	IBM	1964	ISO

If we wanted to get the list of creation years of languages standardized by either ANSI or ISO, we could use a UNION keyword to achieve this.

Listing: using a UNION operator

1 SELECT year FROM proglang_tbl WHERE standard='ANSI'
2 UNION
3 SELECT year FROM proglang_tbl WHERE standard='ISO'
4 
5 > year
6   ----
7   1964
8   1972

The result is expectedly two rows because 1964 is a repeated year being the creation year of two languages which were standardized. We can get three distinct rows with the year 1964 being repeated twice by using the UNION ALL operator instead of simply UNION. The rest of the query remains exactly the same.

13.2 Intersection

The intersection operation that outputs only the common elements in the input sets. If we apply an intersection to the two sets in the previous section we get a resulting set of two elements.

1 set1 INTERSECTION set2 = { 1, 3 }

Translating this to SQL is pretty simple; instead of using UNION we use the keyword INTERSECT to get common elements.

Listing: using the INTERSECT

1 SELECT year FROM proglang_tbl WHERE standard='ANSI'
2 INTERSECT
3 SELECT year FROM proglang_tbl WHERE standard='ISO'
4 
5 > year
6   ----
7   1964

Note that while the ANSI SQL standard does provision for an INTERSECT ALL operator, I’m yet to come across a database management system that implements it.

13.3 Difference

The difference operation between sets, written as set1 - set2 is a list of all elements in set1 that do not occur in set2. If an element is only in set2, it will not be captured by the plain difference operation.

1 set1 DIFFERENCE set2 = { 5 }
2 set2 DIFFERENCE set1 = { 2 }

Let’s try and write a SQL statement to emulate this logic with our familiar IN and NOT IN operators. Suppose we wish to list out the years of creation of languages which were standardized by ISO but not the ANSI. From our source table we find that two languages were standardized with years 1972 and 1964. But since in 1964, APL was created which was eventually standardized by ANSI, we should ideally be left with the answer 1972.

Listing: trying to write set difference with IN

1 SELECT year FROM proglang_tbl 
2   WHERE standard IN ('ISO') 
3     AND standard NOT IN ('ANSI');
4 
5 > year
6   ----
7   1972
8   1964

Whoa, what sorcery is this!?! We thought 1964 would be ineligible because of ANSI standardization. But clearly this is not the case. What has happened actually is that first there was a scan of ISO rows - giving us two values. Then ANSI rows were discounted but not necessarily from the first result but the table as a whole. So while the APL 1964 was left off, the RPG 1964 still remained, effectively making our second condition worthless. The correct way to achieve this is using the set difference operator EXCEPT as below.

Listing: set difference with EXCEPT

1 SELECT year FROM proglang_tbl WHERE standard IN ('ISO')
2 EXCEPT
3 SELECT year FROM proglang_tbl WHERE standard IN ('ANSI')
4 
5 > year
6   ----
7   1972

Voila, this seems to yield the correct answer! When we write more than a single SELECT as a part of a single query and join them using an set theoretic operator, such statements are called compound queries.

Up next