10. Aggregation and Grouping
10.1 Aggregate Functions
An aggregate function is used to compute summarization information from a table or tables. We have already seen the COUNT aggregate function which counts the records matched. Similarly there are other aggregation functions in SQL like AVG for calculating averages, SUM for computing totals and MAX, MIN for finding out maxima and minima values respectively.
10.2 Using DISTINCT with COUNT
We have already seen the COUNT function, but we can further control its output using the optional argument DISTINCT. This allows us to count only non-duplicate values of the input specified. To illustrate this concept, we will now insert some rows into our proglang_tbl table.
1 INSERT INTO proglang_tbl (id, language, author, year, standard) VALUES (5, 'Fort\
2 ran', 'Backus', 1957, 'ANSI');
3
4 INSERT INTO proglang_tbl (id, language, author, year, standard) VALUES (6, 'PL/I\
5 ', 'IBM', 1964, 'ECMA');
Note the new data choice that we are populating. With Fortran we are adding a new programming language that has a standard by the ANSI. With PL/I we now have a third distinctive standards organisation - ECMA. PL/I also shares the same birth year as APL (1964) giving us a duplicate year field. Now let us run a query to check how many distinct year and standard values we have.
1 SELECT COUNT (DISTINCT year) FROM proglang_tbl;
2
3 > 5
1 SELECT COUNT (DISTINCT standard) FROM proglang_tbl;
2
3 > 3
The first query result is straightforward. We have 6 rows but two of them share a common year value, thus giving us the result 5. In the second result, out of 6 rows only 4 of them have values. Two rows have a NULL value in them meaning those languages have no standard. Among the 4 rows, two of them share a common value, giving us the result - 3. Note that the DISTINCT clause did not count NULL values as truly distinct values.
10.3 Using MIN to find minimum values
The MIN function is fairly straightforward. It looks at a particular set of rows and finds the minimum value of the column which is provided as an argument to it. For example, in our example table we wish to find out from which year do we have records of programming languages. Analyzing the problem at hand, we see that if we apply the aggregate function MIN to the field year in our table, we should get the desired output.
1 SELECT MIN(year) from proglang_tbl;
2
3 > 1957
The MAX function similarly finds the largest value in the column provided to it as an argument.
1 select language, MAX(year) year from proglang_tbl;
| language | year |
|---|---|
| 1988 | Tcl |
10.4 Grouping Data
The GROUP BY clause of a SELECT query is used to group records based upon their field values. This clause is placed after the WHERE conditional. For example, in our sample table we can group data by which committee decided on their standard.
1 SELECT language, standard FROM proglang_tbl
2 WHERE standard IS NOT NULL
3 GROUP BY standard, language;
| language | standard |
|---|---|
| APL | ANSI |
| Fortran | ANSI |
| PL/I | ECMA |
| Prolog | ISO |
The interesting thing to note here is the rule that the columns listed in the SELECT clause must be present in the GROUP BY clause. This leads us to the following two corollaries.
- You cannot group by a column which is not present in the SELECT list.
- You must specify all the columns in the grouping clause which are present in the SELECT list.
Another useful way to use grouping is to combine the operation with an aggregate function. Supposing we wish to count how many standards a particular organization has in our table. This can be achieved by combining the GROUP BY clause with the COUNT aggregate function as given below.
1 SELECT standard, count(*) FROM proglang_tbl GROUP BY standard;
| standard | col2 |
|---|---|
| ANSI | 2 |
| ECMA | 1 |
| ISO | 1 |
| (null) | 2 |
10.5 The HAVING Clause
Like a WHERE clause places conditions on the fields of a query, the HAVING clause places conditions on the groups created by GROUP BY. It must be placed immediately after the GROUP BY but before the ORDER BY clause.
1 SELECT language, standard, year FROM proglang_tbl
2 GROUP BY standard, year, language HAVING year < 1980;
| language | standard | year |
|---|---|---|
| APL | ANSI | 1964 |
| Fortran | ANSI | 1957 |
| PL/I | ECMA | 1964 |
| Prolog | ISO | 1972 |
From the output we can clearly see that the records for Perl and Tcl are left out since they do not satisfy the HAVING conditional of being created before 1980.