Inner Joins – Kimmer's Course Notes

The Sailors table we’ve been using is from the “Sailors Reserve Boats” database which is used in examples in the textbook Database Management Systems by Ramakrishnan and Gehrke. That database stores information about sailors at a boating club who take different boats out sailing at different times. Here is the Boats table where each row is a record storing information about the available boats.

mysql> SELECT *
    -> FROM Boats;
+-----+-----------+-------+
| bid | bname     | color |
+-----+-----------+-------+
| 101 | Interlake | blue  |
| 102 | Interlake | red   |
| 103 | Clipper   | green |
| 104 | Marine    | red   |
+-----+-----------+-------+
4 rows in set (0.00 sec)

If we try to keep track of which sailor reserved or used which boat at which time, we don’t have enough information between the Sailors and Boats tables. We also need to record information of the form (sailor, boat, time). In other words, a sailor (row from the Sailors table), a boat (row from the Boats table), and the time define a reservation of a boat by a sailor. This reservation is a third record that the database must manage. Records are stored in tables, so we end up with a third table representing reservations. The table is named Reserves instead of Reservations, and we’ll see why that decision was made a little bit later.

mysql> SELECT * 
    -> FROM Reserves;
+-----+-----+------------+
| sid | bid | date       |
+-----+-----+------------+
|  22 | 101 | 2012-11-29 |
|  22 | 102 | 2012-11-29 |
|  22 | 103 | 2012-11-29 |
|  22 | 104 | 2012-11-29 |
|  31 | 102 | 2012-11-29 |
|  31 | 103 | 2012-11-29 |
|  31 | 104 | 2012-11-29 |
|  64 | 101 | 2012-11-29 |
|  64 | 102 | 2012-11-29 |
|  74 | 103 | 2012-11-29 |
+-----+-----+------------+
10 rows in set (0.00 sec)

Primary Keys

Looking at the Reserves table, we can always deduce which sailor reserved which boat because the reserve table uses ids (sid for Sailors and bid for Boats) to refer to each. Looking at the Sailors and Boats tables, the value of the id is unique in each one. In other words, there is only one boat that can have a bid of 101. There is only one sailor that can have a particular sid value. The sid value uniquely identifies sailors, and the bid value uniquely identifies boats. A column in a table whose value uniquely identifies each row is called the primary key of that table. One rule we will see relational databases follow is that every table must have a primary key. sid is the primary key of the Sailors table, and bid is the primary key of the Boats table.

If we look back at the Reserves table, we cannot identify a single column that uniquely identifies a reservation. The sid column won’t do it because the same sid value appears more than once, reflecting the notion that a Sailor isn’t limited to only one reservation. They can go sailing over and over. Likewise the bid column doesn’t uniquely identify a row because the same boat may be sailed many times. The day column will not work, either, because more than reservation per day is allowed. If one column can’t uniquely identify the row, then we must consider a combination of columns. We could consider a combination of two columns, and we could consider all three columns. For any combination, we could ask if the values in this combination are unique in the table, or can the same values appear in a different row? The possibilities are

(sid, bid) – The same sailor can reserve the same boat on different days, so this combination will not be unique.
(sid, day) – The same sailor can reserve different boats on the same day, so this combination will not be unique.
(bid, day) – The same boat can be reserved by different sailors on the same day, so this combination will not be unique.
(sid,bid,day) – There is a rule in the club that a sailor can use a boat at most once per day, so this combination is unique.

When multiple columns are the primary key for a table, it is called a composite primary key. Note that we had to know the “rules” for the sailing club to determine which combination might be the key. We could imagine a different version of the sailing club where a sailor could use the same boat more than once per day. In that case, we would have to store more information in the Reserves table and determine its primary key anew.

Schema Diagrams

Now that we know the primary keys for each table, we can draw a schema diagram that shows the tables and keys in the database. In a schema diagram, each tables nam’e is given and below that each column name in the table is listed. The primary key for each table is indicated by underling the column(s) included in the primary key. Every table must have a primary key, so every table in the diagram should have at least one column underlined. The arrows in the diagram indicate ways in which two different rows can be “glued together” to merge data. As described above, a reservation needs to be matched up with a sailor via the sid, and a reservations matches a boat via bid. The arrows from those two columns in Reserves to Sailors and Boats indicate that those rows may be matched up. Moreover, the direction of the arrow indicates in some sense that sid belongs to Sailors and bid belongs to Boats and the Reserves table is just copying those values to represent or reference a sailor or a boat.

Schema diagram for the Sailors Reserve Boats database

These three tables are the complete Sailors Reserve Boats database. We could write queries against either table individually as we’ve seen how to do already with the Sailors table. Some information needs are more complex and require information from more than one table, and now we need to learn how to transform these information needs into SQL.

What are the names of Sailors that have reserved boats?

The Sailors table is clearly needed for this query, for it is the only table with information about sailors names. A quick look at the data above should convince you that not every sailor has reserved a boat yet, so some sailors shouldn’t appear in the answer. The only way to tell whether a sailor has reserved a boat is to see if that sailor appears in a row of the reserves table. So we have to use two tables in this query. We can ask whether we need the third table, Boats, in this query, and we can hopefully convince ourselves we don’t required any information specific to the Boats table (name & color). We also don’t even care which boat (which bid) the sailor reserved. Just whether or not they reserved at least one boat.

The process the database uses to build a result table was described previously as:

Specify a single table (FROM clause)
Filter rows (WHERE clause)
Return columns (SELECT clause)

Now that we have more than one table, we can be more general and accurate about the process:

Build the table needed for the query results (FROM clause)
Filter rows (WHERE clause)
Return columns (SELECT clause)

The thought process for using more than one table in a query is that we use the separate, existing tables in the database to build a new table we need that will contain our query results. We may have to remove some of the rows from that new table, and we may have to remove columns from that table to generate our results. The main way to build a new table from two existing tables is to INNER JOIN those two tables. An INNER JOIN is an operation that starts with two tables and matches up rows between the two tables. A join condition is a test that is used to determine which rows can be matched up. The third table that is the result of the INNER JOIN will contain all the columns from both tables.

In this case, we need to INNER JOIN Sailors and Reserves, and our join condition needs to ensure that a row for a sailor from the Sailors table is matched up against a reservation made by that sailor and not a reservation made by a different sailor. The sid column of each of the two tables gives us values that we can test to ensure this criterium is met. We want the same sid value to appear from each table. Here is a row from each table that satisfies the join condition because the two sid’s are equal:

mysql> SELECT * 
    -> FROM Sailors
    -> WHERE sid = 74;
mysql> SELECT * FROM Sailors WHERE sid = 74;
+-----+---------+--------+------+
| sid | sname   | rating | age  |
+-----+---------+--------+------+
|  74 | Horatio |      9 |   35 |
+-----+---------+--------+------+
1 row in set (0.00 sec)


mysql> SELECT *
    -> FROM Reserves
    -> WHERE sid = 74;
+-----+-----+------------+
| sid | bid | date       |
+-----+-----+------------+
|  74 | 103 | 2012-11-29 |
+-----+-----+------------+
1 row in set (0.00 sec)

The two rows above can be matched up meaningfully to tell us that Horatio, age 35, reserved boat 103 on Nov. 29, 2011. If we wanted more information about this reservation, we would also match up the row from Reserves with the row for Boat 103 from the Boats table to find out the boat name and color.

Looking at the Sailors and Reserves tables above, we could create all the matches by hand, and we would come up with the table that results from an INNER JOIN of Sailors and Reserves with the same sid appearing in each table’s row. Note that the same Sailor row can appear more than once in this table, indicating that that sailor has made multiple reservations. Here is that data from the database if we do make all the matches between Sailor and Reserves rows. Note that the sid column appears twice in the results, once for each table that it belongs to. Note also that the total number of columns is the number in the Sailors table plus the number in the Reserves table. That is, all the possible columns are retained when gluing together or matching up rows in this way.

+-----+---------+--------+------+-----+-----+------------+
| sid | sname   | rating | age  | sid | bid | date       |
+-----+---------+--------+------+-----+-----+------------+
|  22 | Dustin  |      7 |   45 |  22 | 101 | 2012-11-29 |
|  22 | Dustin  |      7 |   45 |  22 | 102 | 2012-11-29 |
|  22 | Dustin  |      7 |   45 |  22 | 103 | 2012-11-29 |
|  22 | Dustin  |      7 |   45 |  22 | 104 | 2012-11-29 |
|  31 | Lubber  |      8 | 55.5 |  31 | 102 | 2012-11-29 |
|  31 | Lubber  |      8 | 55.5 |  31 | 103 | 2012-11-29 |
|  31 | Lubber  |      8 | 55.5 |  31 | 104 | 2012-11-29 |
|  64 | Horatio |      7 |   35 |  64 | 101 | 2012-11-29 |
|  64 | Horatio |      7 |   35 |  64 | 102 | 2012-11-29 |
|  74 | Horatio |      9 |   35 |  74 | 103 | 2012-11-29 |
+-----+---------+--------+------+-----+-----+------------+
10 rows in set (0.00 sec)

Going back to the original information need, we note that we have built a table containing Sailors with reservations. We have created a table with the rows we need, and we don’t need to filter the rows any more. The WHERE clause can be left out of our query. We only need to return names, though, so we need to make sure we only specify that column in the SELECT clause. The only new syntax we need to learn is the FROM clause where the INNER JOIN happens (where the new table is built to use in the query).

SELECT sname 
FROM Sailors INNER JOIN Reserves ON Sailors.sid = Reserves.sid

Because INNER JOIN requires two tables (it’s what we early have called a binary operator), we write one table name on the left side and one on the right side. After the right-hand table, we add the keyword ON and then follow that with the join condition. The join condition is a test that is TRUE or FALSE similar to what we have seen for the WHERE clause. Here we want the same sailor from each table, and since both column names are called sid, we have to say which table each sid comes from. We do that by putting the name of the table followed by a dot. We can always specify exactly which table a column comes from

SELECT Sailors.sname
FROM Sailors INNER JOIN Reserves ON Sailors.sid = Reserves.sid

Some SQL programmers always do this, and some never do. In I308, it’s up to you. It’s required to state which table a column is in when the column name is ambiguous, i.e. it could be in more than one table. When it’s unique/unambiguous like sname, it’s optional to specify the table it came from.

Typing long table names repeatedly in a query can get tedious, and longer queries are generally harder to read and think about than shorter one. It’s possible to rename a table in the FROM clause with an alias and use that alias to refer to the table every else. The way to create the alias is to place it right after the table name in FROM as shown in the first example below. It’s also common to use the AS keyword to indicate “here comes an alias”, but AS is optional. I tend to not use it in my own SQL; the choice is yours.

SELECT S.sname
FROM Sailors S INNER JOIN Reserves R ON S.sid = R.sid;

SELECT S.sname
FROM Sailors AS S INNER JOIN Reserves AS R ON S.sid = R.sid;

Referring back to our schema diagram above, we can notice that the join condition for Sailors INNER JOIN Reserves involves those two sid columns that have the arrow between them in the diagram. As a reminder the arrow indicates that Reserves.sid “references” (points to) Sailors.sid. Sailors.sid is the primary key, and copying that value lets us always point to a unique, specific sailor. Such a column that references a specific row in another table is called a foreign key. So Reserves.sid is a foreign key that references the primary key Sailors.sid. The two together are called a key pairing or a primary-foreign key pair. 99.9% of the time, INNER JOINs will take place on a key pairing. This knowledge is invaluable. If you know the key pairs and figure out which tables are needed in a query, it is easy to write INNER JOINs. If you neglect this concept, then every INNER JOIN will be a struggle because you won’t have the correct thought process for solving the mini-problem that each query presents.

Joining more than two tables

Information need: What are the sailor names and boat names for each reservation?

We need the Sailors table for sname. We need the Boats table for bname. At this point, it’s tempting to try FROM Sailors INNER JOIN Boats, but that won’t work because we don’t have any columns in common to use for the join condition. Said another way, there’s no arrow between Sailors and Boats in the schema diagram, so we don’t have a key pairing to use for the join. We have to use the Reserves table to match up sailors with their reservation, and we need to use the Reserves table to match up boats with their reservation. Sailors INNER JOIN Reserves involves the Reserves.sid/Sailors.sid key pairing. Boats INNER JOIN Reserves will involve a Reserves.bid/Boats.big key pairing. It’s tempting to think we’ll have to use the Reserves table twice, once for each join, but that’s not how it works. We’ve already seen that Sailors INNER JOIN Reserves will give us a table that has all the columns from Reserves, so it will have a bid column in it.

Sailors INNER JOIN Reserves is a table.
That table has a bid column in it.
INNER JOIN requires two tables. The two tables don’t have to be single tables in our database. They can be tables we have built in the query.
So we build a table using INNER JOIN and then INNER JOIN that one with Boats–(Sailors INNER JOIN Reserves) INNER JOIN Boats

The () in #4 above are meant to suggest the way we write arithmetic sometimes. 3 + 4 + 5 can be written as (3 + 4) + 5 to emphasize that adding three numbers consists of two steps: 1. add two numbers and 2. add the third number to the intermediate result. INNER JOIN is the same way. We can join two tables and then join the third table to the intermediate result. The syntax is

SELECT S.sname, B.bname
FROM Sailors S INNER JOIN Reserves R ON S.sid = R.sid 
   INNER JOIN Boats B ON R.bid = B.bid;

The original tables in the first INNER JOIN still keep their identity in the query, so we can still refer to R and S separately even though they are always combined as “S INNER JOIN R”; this is just a convenience that SQL gives us. To aid in readability, I put the second INNER JOIN on a new line and indented it, but this choice is only cosmetic. It also emphasizes that the first INNER JOIN builds one table, and then that newly-built table is joined with the third table in the query.

This section concludes by showing the full table resulting from joining all three tables together and keeping all the columns.

mysql> SELECT * 
    ->FROM Sailors S INNER JOIN Reserves R ON S.sid = R.sid 
    ->  INNER JOIN Boats B ON R.bid = B.bid;
+-----+---------+--------+------+-----+-----+------------+-----+-----------+-------+
| sid | sname   | rating | age  | sid | bid | date       | bid | bname     | color |
+-----+---------+--------+------+-----+-----+------------+-----+-----------+-------+
|  22 | Dustin  |      7 |   45 |  22 | 101 | 2012-11-29 | 101 | Interlake | blue  |
|  22 | Dustin  |      7 |   45 |  22 | 102 | 2012-11-29 | 102 | Interlake | red   |
|  22 | Dustin  |      7 |   45 |  22 | 103 | 2012-11-29 | 103 | Clipper   | green |
|  22 | Dustin  |      7 |   45 |  22 | 104 | 2012-11-29 | 104 | Marine    | red   |
|  31 | Lubber  |      8 | 55.5 |  31 | 102 | 2012-11-29 | 102 | Interlake | red   |
|  31 | Lubber  |      8 | 55.5 |  31 | 103 | 2012-11-29 | 103 | Clipper   | green |
|  31 | Lubber  |      8 | 55.5 |  31 | 104 | 2012-11-29 | 104 | Marine    | red   |
|  64 | Horatio |      7 |   35 |  64 | 101 | 2012-11-29 | 101 | Interlake | blue  |
|  64 | Horatio |      7 |   35 |  64 | 102 | 2012-11-29 | 102 | Interlake | red   |
|  74 | Horatio |      9 |   35 |  74 | 103 | 2012-11-29 | 103 | Clipper   | green |
+-----+---------+--------+------+-----+-----+------------+-----+-----------+-------+
10 rows in set (0.00 sec)

Exercises

What colors boats have been reserved?

SELECT color
FROM Boats B INNER JOIN Reserves R ON B.bid = R.bid

What are the names of sailors that have reserved red boats?

SELECT sname
FROM Sailors S INNER JOIN Reserves R ON R.sid = S.sid INNER JOIN Boats B ON R.bid = B.bid
WHERE color = 'red'

What are the id’s of Boats that have been reserved?

Be careful with this one! You don’t need the Boats table here because the bid is already in the Reserves table. So this is a single-table query instead of an INNER JOIN one.

SELECT bid
FROM Reserves