C112: Aggregation and Computation in the SELECT clause

Computing new values

In the SELECT clause we can compute new values from existing data. As a reminder we’ve seen that the SELECT clause determines which column we return in the query result. The columns come from the table that we build using the FROM and WHERE clauses. More specifically, in a query

The FROM clause specifies the table we will use
The WHERE clause filters rows from that table by keeping only ones that meet certain conditions
The SELECT clause is used to specify which columns should be returned

The third step above is worded very generally. We’ve seen that we can specify columns by name (or by the * wildcard) to return in the result set, but we can also make new columns in the SELECT clause using existing columns. We can create these new columns using arithmetical operations like +, -, *, and / for add, subtract, multiply, and divide, respectively. Let’s imagine in the Sailors table that the best sailors are the youngest ones with a higher rating, and we want to quantify that by dividing the rating by the age (because dividing the same rating by the smaller age will yield a bigger rating than dividing by the age of an older sailor). We can do by

SELECT rating/age
FROM Sailors

and if we would like to name that column, we can do so as well:

SELECT rating/age AS SailorYears
FROM Sailors

That’s pretty much it for customizing in the SELECT clause. We can mix and match new columns with existing one by separating with commas, and we can renames columns one by one with AS:

SELECT rating/age AS SailorYears, age, sname AS SailorAge
FROM Sailors
WHERE age < 40

Aggregation Queries

To aggregate is to combine many things into one thing. In databases, there are tools known as aggregate functions or aggregation functions that aggregate a column of data; they combine many rows into one row for a particular column or group of columns. We’ve seen before that a column of data is interchangeable with a list or tuple of values, and that correspondence gives us a good way to give examples of aggregation. If we have a column of grades with values (80.0, 90.0, 95.0, 100.0), there are several ways to aggregate those many values into one value. We can find the largest value in the list (100.0) or the smallest (80.0). We could also add up the values (365.0), count them (4), or compute their average (365.0/4.0 = 91.25). These are all examples of common aggregation functions we may use in this course.

Aggregate Function
COUNT(column) or COUNT(*)
SUM(column)
MIN(column)
MAX(column)
AVG(column)

These are the five aggregation functions we’re responsible for memorizing in this course

Although we only deal with these five to learn the concept of aggregation, there are many other available ones in a typical DBMS. Aggregation queries are queries that use aggregation functions to roll up a table of data into a single (aggregate) row of data, so it’s a way to simplify information or compute new information that depends on 1 or more rows in a table. All semester long we’ve seen that the FROM and WHERE clause build a table that gets passed to the SELECT clause. Aggregate functions appear in the SELECT clause and condense the information in that table into a single row.

For instance, in the Sailors Reserving Boats database, each sailor has a rating, but together ALL sailors have an average rating. The average rating depends on all the rows in the sailor table, so it’s an aggregate quantity. If we use the Sailors table in the FROM clause but omit the WHERE clause, then the SELECT clause will receive all the rows in the Sailors table. To compute it we write in SQL

SELECT AVG(rating)
FROM Sailors

and see a single number (6.6 as a 1 x 1 table) returned to us with the answer. Note that we passed the name of a single column to the aggregate function AVG. All 5 accept a single column name as their argument. The only one that accepts anything else is COUNT() which also accepts the * wildcard. More on this later.

We can use aggregation with all the SQL bells and whistles we’re used to: “Find the average rating of sailors older than 20 years old.” In this example, since we’ll need a WHERE clause, the SELECT clause does not receive the entire Sailors table. It only receives certain rows. The SELECT clause computes the average of whatever data it receives, so when it receives only some ratings instead of all ratings, we should expect a different average value from the query.

SELECT AVG(rating)
FROM Sailors
WHERE age < 20

Indeed in this case the average rating is returned in a 1×1 table as 10.0, which is greater than the 6.6 average of all sailors’ rating.

One thing to be careful of with aggregation is that you can’t write a query like

SELECT sname, AVG(rating)
FROM Sailors

ever because there are many sname values in the sailors table, but only one average rating. You can’t make a table in SQL that has many rows in the first (sname) column and a single row in the next column. It’s not shaped like a table!

Information Need: “Find the number of sailors.”

SELECT COUNT(*)
FROM Sailors

We just care how many rows are in the table because this table holds records for Sailors. Each row is a sailor. We don’t even need to care about columns. We can thus use * with COUNT instead of the name of any particular column. The interpretation of COUNT(*) is “count rows” without worrying about the values in a particular column. Because we do not have to extract those values and work with them, COUNT(*) can be faster or more efficient, but really it’s just a good practice to use COUNT(*) whenever you know that it’s indeed the rows you’re counting and not particular values.

Information Need: “Find the number of sailors’ names.”

For this one, we need to be a little careful, for we have to worry about the duplicate snames like ‘Horatio’. If we just COUNT(*) we would count Horatio twice, and if we COUNT(sname), we count Horatio twice. We can add the keyword DISTINCT before the column name to remove duplicates:

SELECT COUNT(DISTINCT sname)
FROM Sailors

This one is straightforward

Information Need: “Find the highest rating for any sailor.”

SELECT MAX(rating)
FROM Sailors

Aggregation queries return a single row, guaranteed. If we have a single column in that single row, it’s a 1×1 table, so we can just use rating = and then put a single-column aggregation query in the () to generate a single value!

Notice I was careful above to say that aggregation returns a single row. We can make multiple-column aggregation queries easily: “Find the average, highest, and lowest ratings for all sailors under 20 years old.”

SELECT AVG(rating) AS avgRating, MAX(rating) as highRating, MIN(rating) AS lowRating
FROM Sailors
WHERE age < 20