Leanpub: Publish Early, Publish Often

10. Aggregation and Grouping

10.1 Aggregate Functions

An aggregate function is used to compute summarization information from a table or tables. We have already seen the COUNT aggregate function which counts the records matched. Similarly there are other aggregation functions in SQL like AVG for calculating averages, SUM for computing totals and MAX, MIN for finding out maxima and minima values respectively.

10.2 Using DISTINCT with COUNT

We have already seen the COUNT function, but we can further control its output using the optional argument DISTINCT. This allows us to count only non-duplicate values of the input specified. To illustrate this concept, we will now insert some rows into our proglang_tbl table.

Listing: Inserting some new rows in our programming languages table

1 INSERT INTO proglang_tbl (id, language, author, year, standard) VALUES (5, 'Fort\
2 ran', 'Backus', 1957, 'ANSI');
3 
4 INSERT INTO proglang_tbl (id, language, author, year, standard) VALUES (6, 'PL/I\
5 ', 'IBM', 1964, 'ECMA');

Note the new data choice that we are populating. With Fortran we are adding a new programming language that has a standard by the ANSI. With PL/I we now have a third distinctive standards organisation - ECMA. PL/I also shares the same birth year as APL (1964) giving us a duplicate year field. Now let us run a query to check how many distinct year and standard values we have.

Listing: Counting distinct year values

1 SELECT COUNT (DISTINCT year) FROM proglang_tbl;
2 
3 > 5

Listing: Counting distinct standard values

1 SELECT COUNT (DISTINCT standard) FROM proglang_tbl;
2 
3 > 3

The first query result is straightforward. We have 6 rows but two of them share a common year value, thus giving us the result 5. In the second result, out of 6 rows only 4 of them have values. Two rows have a NULL value in them meaning those languages have no standard. Among the 4 rows, two of them share a common value, giving us the result - 3. Note that the DISTINCT clause did not count NULL values as truly distinct values.

10.3 Using MIN to find minimum values

The MIN function is fairly straightforward. It looks at a particular set of rows and finds the minimum value of the column which is provided as an argument to it. For example, in our example table we wish to find out from which year do we have records of programming languages. Analyzing the problem at hand, we see that if we apply the aggregate function MIN to the field year in our table, we should get the desired output.

Listing: finding out the earliest year value in our table

1 SELECT MIN(year) from proglang_tbl;
2 
3 > 1957

The MAX function similarly finds the largest value in the column provided to it as an argument.

Listing: finding out the latest year value in our table and the programming language associated with it

1 select language, MAX(year) year from proglang_tbl;

language	year
1988	Tcl

10.4 Grouping Data

The GROUP BY clause of a SELECT query is used to group records based upon their field values. This clause is placed after the WHERE conditional. For example, in our sample table we can group data by which committee decided on their standard.

Listing: Grouping records by its fields

1 SELECT language, standard FROM proglang_tbl
2 WHERE standard IS NOT NULL
3 GROUP BY standard, language;

Figure: output for grouping records
language	standard
APL	ANSI
Fortran	ANSI
PL/I	ECMA
Prolog	ISO

The interesting thing to note here is the rule that the columns listed in the SELECT clause must be present in the GROUP BY clause. This leads us to the following two corollaries.

You cannot group by a column which is not present in the SELECT list.
You must specify all the columns in the grouping clause which are present in the SELECT list.

Another useful way to use grouping is to combine the operation with an aggregate function. Supposing we wish to count how many standards a particular organization has in our table. This can be achieved by combining the GROUP BY clause with the COUNT aggregate function as given below.

Listing: using GROUP BY with aggregate functions

1 SELECT standard, count(*) FROM proglang_tbl GROUP BY standard;

Figure: query output showing the count of standard organizations in our table
standard	col2
ANSI	2
ECMA	1
ISO	1
(null)	2

10.5 The HAVING Clause

Like a WHERE clause places conditions on the fields of a query, the HAVING clause places conditions on the groups created by GROUP BY. It must be placed immediately after the GROUP BY but before the ORDER BY clause.

Listing: demonstration of the HAVING clause

1 SELECT language, standard, year FROM proglang_tbl
2 GROUP BY standard, year, language HAVING year < 1980;

Figure: output of the HAVING clause demonstration query
language	standard	year
APL	ANSI	1964
Fortran	ANSI	1957
PL/I	ECMA	1964
Prolog	ISO	1972

From the output we can clearly see that the records for Perl and Tcl are left out since they do not satisfy the HAVING conditional of being created before 1980.

Up next

11. Understanding Joins