The main focus of I308 is understanding the relational data model so we can create and query the most common kind of databases, relational databases. We’ve already had an introduction to them by focusing on tables, primary keys, single-table queries, foreign keys, and inner joins. Given all we’ve seen there’s actually not too much more to cover. Mainly we need a couple of more precise definitions and a little motivation for why the relational model exists (i.e. what problems it solves).

The vocabulary list below indicates to us the big-picture view of a relational database. It’s one where each table has a primary key. The records that each table stores may be related to other records in the database, and if that is the case, there will be foreign keys, too.

TermDefinition
Primary KeyOne or more columns in a table that can be used to uniquely identify a row. In other words, the value(s) of the primary key column(s) cannot be duplicated in that table.
Foreign KeyOne or more columns in a table that can be used to refer to another row because those values in the other table uniquely identify a row. The foreign key usually references the primary key of a table, but there are some exceptions.
Composite KeyA key that contains more than one column.
Candidate KeyOne or more columns that could be a primary key for a table. Out of all the possible candidate keys for a table, one of them is picked and called the primary key.
RelationA table where each row is unique. That is, a relation is a table that has a primary key.
Relational Data ModelA logical data model where the database is a set of relations. Each relation stores one kind of record. Typically there will be foreign keys indicating relationship between the records.
Some vocabulary describing the relational data model in terms of its tables and keys

Now here’s another set of vocab. words that gets a little more specific about the make-up of each table.

TermDefinition
DomainThe allowed values for a column in a table. For example, they must be whole numbers (integers) or 10-character words.
ConstraintAdditional rules the data must obey. A primary key constraint indicates the primary key columns must be unique in a relation. A foreign key constraint indicates that a row referenced by a foreign key must exist. Other constraints could limit the domain. For example, a column storing grades may only contain whole numbers between 0 and 100, inclusive.
AttributeA column in a relation. it has a name, a domain, and possibly constraints.
TupleA row in a table. A collection of attributes for a table. It’s pronounced to rhyme with couple and not with a u like in Lego Duplo.
SchemaA description of the structure of a database. For a relational database, the schema includes all the table names and attribute names along with primary and foreign keys. It may or may not include domains and constraints depending on how specific it is.
InstanceA particular example of a database schema that has data in it. The same schema can have many possible different instances. We’ve been using one instance of the Sailors database. We could make another with all the sailors and boats but no reservations yet. 2 of the 3 tables would be the same, but the Reserves table would be empty (until reservations have been made).
Additional relational data model vocabulary

Tuples are technically the correct name for the records stored in a relation in a database. Relation is technically the correct name for a table in a relational database. Attribute is technically correct for columns, I guess. Lots of database folks use these terms a lot, so you have to know them and understand what is meant when I say them. The same people who use these terms a lot will also use “table”, “column”, and “row”, but when they do, they mean tables with unique rows (a relation). Informatics students are also likely to encounter tuple in other situations, like in programming languages. Moreover, there are formal theory of databases where tuple is used as a technical term, and relations are as well. These frameworks are called relational algebra and relational calculus, and you may run across them in our SQL textbooks.

The first three terms relate to columns. Every column in a database needs to have a domain or type associated with it. This concept relates to the logical and physical views of data, too. As informatics or HIM students, we thing at the logical level where a particular column is a whole number or a decimal number, for instance. The DBMS operates at the physical level where different types are stored differently by the computer. Floating point numbers are stored differently than integer numbers in a computer’s memory or in the CPU, and text or string values may be stored in any one of several encodings (Unicode, ASCII, …) that describe how the bits in the computer map to letters, numbers, or symbols in a character set.

Schema Diagrams

It’s important to be able to communicate the structure of a relational database to someone else. You may have a database design for a new app that other people running that app need to create; or you may be publishing a new open source medical records systems and need to let people know your data model. In either case, a typical way to communicate the structure is by sharing the schema with them. There are two common ways to share schemas. One is informal and is called a schema diagram. The other is more formal and involves sharing SQL. If you share the SQL, someone can execute it in the DBMS and then have a working version of the database. If you share just the schema diagram, the would have to write the SQL statements; they may end up with an equivalent but not necessarily identical database depending on how much information you put in the schema diagram.

The informal version of schema diagrams we’ll use in I308 shows the names of tables and each column in a table. It also indicates the primary of each table by underlining the column name(s). Foreign keys are also indicated by drawing arrows from the foreign key to the primary key. Below is the schema diagram for the sailorsReserveBoats database.

The rules for our “house style” of schema diagram in I308 are:

  1. Show the name of the table
  2. Show the names of the table’s columns in a box below the table name
  3. Underline each table’s primary key column(s)
  4. Indicate foreign key column(s) by drawing an arrow from the foreign key to the column(s) it references. The foreign key is the tail of the arrow; the column(s) referenced are the tip of the arrow.

Schemas vs Instances

For a particular schema, there can be many different instances of that schema, because not all databases with the same tables and keys need to have the same values in all their rows. Here are two different possible instances of the Sailors table. They have the same schema (table name, column headings, domain for each column, constraints), but they differ in the records (each row is a record for a Sailor) they hold.

sidsnameratingage
22Dustin745.0
29Brutus133.0
31Lubber855.5
One instance of the Sailors table
sidsnameratingage
1Chris0115.0
2Popeye133.0
3Captain Nemo855.5
And another instance!

Reasons for the relational model

To justify why the relational model is the way it is, let’s imagine a database where we put all the information into one big table. We’ll end up with something that looks like

bidsidsnameratingagedatebnamecolor
10122Dustin745.0Nov. 29, 2012Interlakeblue
10222Dustin745.0Nov. 29, 2012Interlakered
10322Dustin745.0Nov. 29, 2012Clippergreen
10422Dustin745.0Nov. 29, 2012Marinered
10231Lubber855.5Nov. 29, 2012Interlakered
10331Lubber855.5Nov. 29, 2012Clippergreen
10431Lubber855.5Nov. 29, 2012Marinered
10164Horatio735.0Nov. 29, 2012Interlakeblue
10264Horatio735.0Nov. 29, 2012Interlakered
10374Horatio935.0Nov. 29, 2012Clippergreen
bidsidsnameratingagedatebnamecolor
The Sailors Reserve Boats supertable

Such a table is called a supertable. We could make it ourselves from the sailorsReservesBoats database by joining all three tables together, so it’s possible to get this information from the relational model. But here with the supertable, we have all the information right in front of us. Is that better or worse than having it split up into separate tables with foreign keys that lets us match the rows up back together?

It turns out it’s worse because it leads to redundancy. The same information appears in more than one place in the supertable. For instance every time Lubber reserves a boat, we see the full set of sailor information (sid, sname, age, rating) for Lubber. It’s stored redundantly, appearing 3 times when maybe 1 would suffice. Likewise whenever the same boat like Clipper is reserved more than once, all the boat information (bid, bname, color) appears over and over. At a minimum it’s inefficient to use this extra storage space to duplicate data. Real database tables often have thousands, millions, or billions of rows, taking up gigabytes or more of storage. Using unnecessary space can be costly, slow, and limit the amount of data that can be stored.

Some more subtle problems of redundancy were grappled with by Codd, Boyce, and others. They considered the consequences of unintended situations that can arise if redundancy is not carefully managed. These situations are called anomalies and classified them as

  • Insertion anomaly – problems when adding data
  • Update anomaly – problems when changing data
  • Deletion anomaly – problems when removing data

To understand these anomalies, imagine we want to add a new sailor to the supertable. Since a new sailor wouldn’t have yet reserved any boats, we would have to leave much of the row empty. This is actually allowed, but it seems odd to store a sailor with no reservations in a table where each row is a reservation. The same situation would arise if we were to add a new boat to the database. Update anomalies are more serious. If we need to change Dustin’s record to update his rating, we would have to change it in multiple places. It’s very likely if humans were updating the data that mistakes would be made, and then the different Dustin rows may show different ratings. Which row would then represent the truth? A deletion anomaly could arise in a similar way if we delete the reservation for the Horatio sailor with a rating of 9. That sailor appears in only one row, so if we delete that row, then the sailor vanishes, too! Deleting a reservation should not also delete the sailor.

The relational model addresses these anomalies because there is only one row for each sailor, boat, or reservation. Deleting a reservation simply involves removing a row from the Reserves table. Adding a sailor or boat simply involves adding a row to the single table without having to worry about associated reservations. Changing a sailor or boat’s values simply involves editing the one row in the table corresponding to the record that needs to be changed. The relational model solves the issues of redundancy by decomposing the supertable into separate relations. The decomposition is called a lossless decomposition because foreign key constraints allow us to rebuild the supertable whenever we need to.

We will address these anomalies again in more detail a little bit later in the course , for the relational model does not completely eliminate them. We will discuss a topic called schema refinement and learn about normal forms our databases can adhere to that guarantee we have protected the data against these anomalies.

The SQL DDL

The SQL DDL is the SQL Data Definition Language. It’s purpose is to define the schema or to change the schema. The main statement, and the only one emphasized in I308, is the CREATE TABLE statement. The basic idea behind the CREATE TABLE statement is that you immediately provide the name of the table being created. After the name of the table, the columns, their domains, and any constraints are defined within a set of (). In programming, any ( must have a matching ), so be sure to not forget the closing ). In general, each piece of information in the () is separated by commas.

CREATE TABLE Sailors (
  sid INT PRIMARY KEY,
  sname CHAR(20),
  age FLOAT,
  rating FLOAT
);

Note that the last column, rating, does not have a trailing comma. “Separated by commas” means commas go in between elements but not after the last one. Note that each column is defined by providing its name followed by its domain. The ALL CAPS SQL convention helps in this case because you can easily tell which parts are SQL and which parts are names we provided for the schema.

Single-column PRIMARY KEYs are “special” in the we can just add PRIMARY KEY after the domain to tell the DBMS that this column is the PRIMARY KEY of the table. So for the Boats table we can do something similar

CREATE TABLE Boats (
  bid INT PRIMARY KEY,
  bname CHAR(20),
  color CHAR(20)
);

The textual data in these two tables are declared as CHAR (short for character) types. CHAR types are different from the numerical types because we have to specify their width, the maximum number of character they are allowed to hold. That’s what the number in () indicates; the width is not optional. There is no simple color CHAR line that would work. We have to add the () with a number inside.

Here is a summary of the possible domains we’ll deal with in I308. There are other ones used in databases, but once you’ve gotten the hang of a few of them, it’s easy to learn the rest. We just focus on a simple set of them here to learn how to think about databases instead of simply memorizing the names of types or domains.

Domain or Type NameNotes
INTWhole numbers
FLOATDecimal numbers
CHAR(x)Words. Providing a whole number width x is mandatory.
VARCHAR(x)Like CHAR(x) except stored differently
DATE
See here for a list of MySQL data types
The I308 list of useful domains or data types in SQL

The Reserves table is a different matter for two reasons. It contains foreign keys, so we must declare those constraints. It also has a composite primary key, so we can’t just put “PRIMARY KEY” after each column. We’ll have to declare that constraint separately. The general syntax for doing so is PRIMARY KEY (column1, column2, …). So we can start off with

CREATE TABLE Reserves (
  sid INT,
  bid INT,
  day DATE,
  PRIMARY KEY (sid, bid, day)
);

This gets us the composite primary key, but now we must also add the two foreign key constraints. The general syntax for this is FOREIGN KEY (column1, column2, …) REFERENCES table(column1,column2, ….). So remembering that a foreign key means columns in a row in THIS table refer to other rows, we first provide the column(s) name(s) in this table (the Reserves table here), and then we provide the name of the other table that gets referenced followed by its column(s). Or in terms of the schema diagram, the first list of columns is the tail of the arrow and the second list is the tip of the arrow.

CREATE TABLE Reserves (
  sid INT,
  bid INT,
  day DATE,
  PRIMARY KEY (sid, bid, day),
  FOREIGN KEY (sid) REFERENCES Sailors(sid),
  FOREIGN KEY (bid) REFERENCES Boats(bid)
);

Getting this FOREIGN KEY syntax down is important. Both sets of () are required, and in the () put a comma-separated list of columns for a composite key or a single column name.

Referential Integrity

If we want to create an instance of the sailors reserves boats database from scratch, we will need to write three create table statements. If we use the last one above for the Reserves table first, we’ll see this error message:

mysql> CREATE TABLE Reserves (
    ->   sid INT,
    ->   bid INT,
    ->   day DATE,
    ->   PRIMARY KEY (sid, bid, day),
    ->   FOREIGN KEY (sid) REFERENCES Sailors(sid),
    ->   FOREIGN KEY (bid) REFERENCES Boats(bid)
    -> );
ERROR 1215 (HY000): Cannot add foreign key constraint

The foreign keys cannot be added, because the Sailors and Boats tables do not exist. The foreign keys need those tables to exist so that each table can have an id column that they can reference. The idea that something being referenced (pointed to) in a database must exist is called referential integrity. In order to create the Reserves table, the Sailors and Boats tables must exist. Because those two tables don’t reference any other tables (they do not have any foreign keys), we can create them in any order. To obey referential integrity, we must create the Reserves table third when making a sailorsReservesBoats database.