Now that we have seen the relational data model and know about relational database schemas we need to learn a good process for creating them. This database design process begins with understanding the information needs in order to determine the records required and the relationships between them. If we do that part well, then we can follow a set of steps that will allow us to have a relational database schema.

  1. Research the information needs
  2. Document the information needs in a specification
  3. Create an Entity Relationship Diagram (ERD) from the spec.
  4. Convert the ERD into a relational database schema
  5. Normalize the schema

This process of designing a database schema is often called data modeling, and some version of it usually happens whenever an organization needs to create a significant new database. If you were a data modeler carrying out this process for a living, you may begin by interviewing the people in the organization in order to understand their information needs or what kind of information they need to keep track of or are already keeping track of. Once that research has been carried out, it will often be condensed into a set of requirements or a specification. In I308, the specification will be our starting point for data modeling; it will be given to us.

Once the spec. exists, we create a drawing either by hand or on the computer called an Entity Relationship Diagram (ERD). There is more than one standard for ERDs that data modelers use. We’ll learn one called Chen Notation which is one of the older forms introduced by Peter Chen in a 1976 paper. Other commonly used ERD notations are Crows foot notation and the Unified Modeling Language (UML). Many organizations will prefer one style or standard over another and require their employees follow it. In I308, our required style is Chen notation; so little to no credit will be given when grading for using a different notation!

We’ve seen entity as a vocabulary word before. It essentially means the same thing as a record. An entity is a collection of related attributes. Said another way an entity is some thing that we can see as separate from other things (other entities). In a class, we may recognize Teacher and Student entities because they are distinct concepts. Similarly, Player and Coach entities may be recognized in sports, Cat and Mouse entities in cartoon violence, and so on. We’ll see defining an entity is somewhat subjective. We try to describe data that we store in terms of distinct groups of information–distinct entities.

An entity set is a set of entities; a set of records of the same type. Relationships may exist between entities. A relationship is a connection between two entities. Drawing an ERD amounts to identifying the entities and relationships in a problem as stated by spec. It is equivalent to identify the records and the relationships between them since entities are synonymous with records.

The best sentences are essentially noun-verb-noun (“The cat ate a Buick.”) The best ERDs can be read like a sentence (“Sailors reserve boats.”). In that sentence the nouns, sailors and boats, are entities while the verb reserve is a connection between the entities. The verb is a relationship. A spec. for the sailing club may be

  • Sailors are identified by an id and have a name, age, and rating.
  • Boats have a unique id along with a name and a color.
  • Sailors reserve boats on a particular date.

Sometimes the spec won’t be a bulleted list. It just depends. The same distillation of an information need could appear as

Sailors in the club have an id number, but their name, age, and rating is also recorded. Boats have an id number, too, while the name and color of the boat is tracked. Sailors reserve boats on a particular date. A sailor can reserve the same boat more than once, and a boat can be reserved by many different sailors. Sailors can also reserve many different boats.

These two versions are equivalent, although the second is more verbose. In both of them we see a notion that a sailor is a collection of things–id number, name, age, and rating. Similarly a boat is a collection of id number, name, and color. Each of these things in the collection is a noun, and the name of the collection itself, sailor or boat, is a noun. Out of all these two groups of nouns

  • sailor, id, name, age, rating
  • boat, id, name, color

Sailor and boat are different from the other nouns in their group because they describe what the collection represent. They’re all related because they’re in a group together, but sailor and boat are bigger because they represent the rest of the nouns. Sailor and boat are the entities and the other are attributes. Id, name, age, and rating are attributes of the sailor entity. Id, name, and color are attributes of the boat entity. Once we’ve identified an entity and its attributes we can draw it in Chen notation. The rules for drawing an entity is that its name goes inside a rectangle while the names of attributes go inside ovals connected to the rectangle. If the entity has a key, an attribute or group of attributes whose value is unique in the entity set, we underline the attribute name(s).

The two entities drawn using Chen notation

Note that there is no notion of a primary key in ERDs. ERDs are different, separate concepts than a relational database. Relational database tables have primary keys. Entities just have keys. Also note that we draw a single entity in the diagram, but it represents a set of records. The sailor rectangle represents all the sailors that we would need to keep track of. It’s a similar abstraction to a schema diagram where we just include the table name and column headings without filling in the rows in the instance. The ERD is general enough to represent any instance of the actual data.

Now that we have Sailor and Boat entities, we’re not finished. The two entities are connected because a sailor can reserve a boat. We draw this relationship connecting them in a diamond, and we give the relationship a name so that the ERD reads like a sentence either from left to right or top to bottom. In this case we have a nice phrase “sailor reserves boat”. It’s also a common convention to read it in a passive voice when reading from right to left or bottom to top: “boat is reserved by sailor”.

Adding the relationship completes the sentence.

I followed a convention of using singular nouns for the entity names. If I use one singular noun, I need to make sure the rest are singular, too, so that the sentences will read correctly. Another convention is to use plural nouns, and then we would end up with “sailors reserve boats”. Note that all nouns are plural and that the verb/relationship name has changed accordingly so that it reads well. Either choice is good as long as its used consistently; it is not considered a good practice to create ERDs with mixtures that don’t read well.

The ERD is not complete, because we haven’t incorporated the fact that we also track the date on which a sailor reserves a boat. It’s a noun, but not a collection of other related nouns, so we expect it to be an attribute. Where does it belong? If we add it to the Sailor entity, it doesn’t make sense since having one date associated with one sailor seems to limit that sailor to just one reservation. Likewise if we add it to Boat, it seems to limit a boat to just one reservation. The solution is that it belongs to the connection between the two, because it’s the sailor and a boat together (the relationship) that happens on a particular date. The visual syntax for indicating this is to once again draw the attribute in an oval but make sure to connect the oval to the diamond signifying the relationship.

Adding the attribute to the relationship completes this ERD.

Top-level ERDs

As we keep working with ERDs, the specs will get more complicated and our diagrams will get “busier” with many more entities and relationships covering our paper. As these numbers increase, it becomes more difficult to fit them all on the page in a readable, comprehensible way. It’s very common to have to sketch a draft to work out the placement of all the items and then make a final copy. It’s also common to have to erase part of one and re-do it in order to make it all work well.

As the number of items increases, fitting all the attributes in becomes a big challenge. There may not be room on the paper, or all the little ovals just may make the diagram unreadable. Attributes are important because they describe the actual data being held, but focusing on them in diagrams sometimes covers up something else equally important: the relationships between records. A top-level ERD is a form of ERD that emphasizes records and relationships but not attributes. They’re easy to draw; just leave out all the attributes and only include entities and relationships. For the sailor reserves boat ERD a top-level entity relationship diagram would look like

A top-level ERD
A top-level ERD is one without attributes

Participation Constraints

The visual syntax of Chen notation we’ve seen so far lets us draw ERDs showing the records, their attributes, and the relationships between them. It does not let us fully describe the constraints on those relationships. When we looked at our first logical data models, the flat, hierarchical, and network models, we characterized them based on the kind of relationship between records. The flat model had no relationship between records. The hierarchical model had the parent-child relationship that we called “one-to-many” (a record has at most one parent; a parent can have many children). The network model had the most general relationships that we called “many-to-many”. A record can be related to many other records. Remember that in that context we described the word “many” as simply meaning “more than one”.

In ERDs, the relationships between records can be characterized by their constraints in the same way. They can be many-to-one or many-to-many, which means we can describe hierarchies and networks or records using ERDs. When characterizing the relationships between records, it always helps to have a picture in mind so you can visualize the relationship. Each rectangle in the diagram represents and entity set, a set of records. So we should visualize each Sailor rectangle as standing in for a group of sailors and each Boat rectangle as representing more than one boat.

Relationships are more complicated than entities because they are more abstract. An entity is pretty easy to visualize since it’s just a collection of values where we’ve labeled each value with a name. A relationship is different because it’s primarily a connection between two entities. it may have values associated with it, too, as we’ve seen, but those are optional. The connection is the main part about a relationship.

Once we have these things in mind, we can then visualize the entity sets and the connections between the records in each one. Sailor and Boat have a relationship (Reserves) between them. One sailor reserves one boat at a particular time. That’s just a single relationship between two entities, but the diamond in the ERD represents a relationship set, a group of connections between individual members of the entity set. So the Sailor Reserves Boat ERD represents something like all these records with all these connections between them. Incidentally, a relationship between two different entities like this is called a binary relationship. In class, we’ll also see that there can be unary relationships, ternary relationships, and beyond.

ERDs represent entity sets and relationship sets. In this ERD, we can see that a sailor can reserve many (more than one boat), and a boat can be reserved many times (by more than one sailor). This relationship is many-to-many.

Relationships are described based on how many times an entity participates in the relationship. Dustin participates twice above while Lubber doesn’t participate. On the other side of the relationship, boat 101 doesn’t participate although 102 participates 3 times and 103 twice.

The most exact way of describing how many times is to provide upper and lower bounds for an entity’s participation. If a student was required to take 4 classes every semester and a class could have anywhere from 0 to 20 students, we would describe these participation limits as a lower bound of 4 for Students (a Student would be connected to at least 4 Classes) and a lower bound of 0 for Classes (a Class may not be connected to any students). The upper bounds would be 4 for Students (a student would be connected to at most 4 Classes) and 20 for Classes (a Class could be connected to at most 20 Students).

We’ll see that in Chen notation we are not this precise in how we describe the upper and lower bounds. The upper limit on how many times an entity participates is called cardinality while the lower one is called ordinality. Each is described in turn below.

Cardinality

Now that we have this picture, we can start visualizing what the constraints on the relationships look like. For each entity participating in the relationship, we need to know the maximum number of times it participates (its upper bound). Remember an entity in the entity set “participates” in a relationship every time it’s connected to another entity.

  • What’s the maximum number of times can a sailor reserve a boat?
  • What’s the maximum number of times a boat be reserved?

Note that the answers to these questions are independent of each other. We can imagine a sailing club where a sailor can reserve more than one boat but never the same one twice. Given this constraint, we can imagine that boats may only be reserved by one sailor (single-use boat) or that boats can be reserved by different sailors (multi-use boat). All of this is to emphasize that for a particular relationship between two entities, there are many different constraints that relationship may have to obey depending on the specific situation or information need being recorded in the data. For the two entities participating in the relationship, we have to ask and answer these questions for each entity; we can’t just focus on one of the entities.

As we saw earlier, all of these questions are about upper limits. The upper bound on how many times an entity can participate in a relationship is called its cardinality, and the most precise answer might be a number like the max. of 20 students in a class above. In Chen notation, though, a less detailed view of cardinality is taken. Instead of any number, there are only two possible answers: “many” or “one”. Many means “more than one” and one means “one”. So Chen notation is not completely precise with regards to cardinality, but at least there’s only two possibilities per entity to worry about. So the full spectrum of possibilities for a relationship between two entities in terms of these upper bounds is

LEFT Entity (Sailor)RIGHT ENTITY (Boat)Cardinality
ManyManyMany to many (M to N)
ManyOneMany to one (M to 1)
OneManyOne to many (1 to M)
OneOneOne to one (1 to 1)
The 4 kinds of Cardinality in I308’s Chen notation. Many to one and one to many represent the same concept since we can always re-draw an ERD by changing which side of the diamond a particular rectangle is on.

We will soon decorate our ERDs with symbols indicating the cardinality of a relationship. When we read the ERDs as sentences, we will use the verbose version of the cardinality column above, adding the words “many” or “one” to the sentence as appropriate. Since there are two entities with each having an upper bound of participation, there are two sentences needed to fully describe the cardinality of the relationship. These two sentences describe a many-to-many relationship.

  • A sailor reserves many boats.
  • A boat may be reserved by many sailors.

When we decorate the ERD, we use the short form from the cardinality column and decorate our ERD with upper case letters M, N, P, Q, … or 1’s. Note that O (big o) was left out so that it’s not confused with the number 0. We put the letter or 1 for an entity on the opposite side of the relationship’s diamond. This is often confusing to student’s about Chen notation, but it’s the rule. The best way to remember this rule in my experience is that it goes in the same spot it does in the sentence. This ERD shows that “A sailor reserves many boats.”

While this one shows “A boat may be reserved by many sailors”. Remember the ERD in this case is being read from right to left.

Once we have worked out where each of the symbols goes, we can draw the fully-decorated diagram:

One to many and many to one relationships

The reservations in the sailing club this database keeps track of represent a many-to-many relationship as drawn above in the ERD. Let’s consider a different spec.

NBA Players have a name, uniform number, and salary. NBA Teams have a unique name, location, and mascot. Teams consist of 12 players while a player must play on only one team.

Hopefully it’s getting easier to spot the two entities and their attributes. Note that the NBA Players entity doesn’t have a key in its description but the Teams entity does. This is not a many-to-many relationship. Its rule is many (12) players play on only one team, and a player can’t play on more then one team. It’s many-to-one or one-to-many depending on which entity we read first (many players one team or one team many players). To indicate this in an ERD, we have to decorate the relationship diamond with a 1 and an M (or another upper-case letter). We put the decorations just as we were reading the sentence, NBA Players (entity/rectangle) play on (diamond/relationsip) at most one (1 decoration) NBA Team (entity/rectangle). This is the ERD:

NBA Players Play on at most 1 NBA Team

Reading the other direction, we have NBA Teams (entity/rectangle) are played on (diamond/relationsip) by many (M/decoration) NBA players (entity/rectangle), leading to this ERD:

Team have (are played on by) many (M) NBA Players.

Combining the two yields

The full ERD showing the many-to-one/one-to-many relationship.

One to one relationships

One to one relationships are less complicated than many-to-one because they’re simpler. Each participating entity gets to match up with only one other. We simply decorate each side of the relationship with ones.

NBA Players have a name, uniform number, and salary. NBA Teams have a unique name, location, and mascot. Teams consist of 12 players while a player must play on only one team. Head Coaches have a name and age. Each team has one head coach, and a coach coaches just one team.

We end up with the many-to-one ERD from above and add on the 1-to-1 relationship between Teams and Coaches.

ERD showing a 1-to-1 relationship

Ordinality

Cardinality is about the upper limit, and ordinality is about the lower one. With Students and Classes we saw example lower bounds of 4 (Students had to take at least four classes) and 0 (a Class may not participate if no Students are signed up for it). In Chen notation, ordinality is describe somewhat imprecisely just like cardinality. There are only two possibilities: zero or one. A lower bound of zero means and entity doesn’t have to participate; it’s optional for a member of the entity set to participate. Since not every entity participates, this is often called partial participation. A lower bound of one means every entity will participate at least once, so participation is mandatory. This situation is often called total participation.

Lower BoundDescriptionNotation
0Optional
Partial Participation
Light line
Single line
1Mandatory
Total Participationg
Strong/Bold Line
Double line
The two ordinality possibilities in Chen notation.

Instead of writing 0’s and 1’s in the diagram as with cardinality, a different way of decorating the diagram is used. A single or lightweight line is used to indicate partial participation and a bold or double line is used to indicate partial participation. The two different choices for partial participation are useful because it’s not always easy to indicate a bold line in a hand-drawn ERD. When drawing by hand, prefer double lines to indicate total participation. If you’re creating the ERD on the computer, then either a bold or a double line can be easily created and easily recognized by the viewer, so use either one.

Students have a name, unique id, and year. Courses have a name and unique number. Every student must take at least 3 courses. Courses may have between 0 and 20 students.

The phrase “every student” tells us that student participation in the relationship is mandatory. The possible enrollment of 0 tells us that course participation is optional. The two upper bounds (at least 3 and 20) tells us both cardinalities are many. That leads to the ERD below. Note that I’m drawing top-level ERDs (no attributes) to focus on the new aspects of Chen notation. Always include the attributes in I308 unless a top-level ERD is specifically indicated!

The final notation issue to note with ordinality is that the line closest to the participating entity indicates that entity’s ordinality in the relationship. The cardinality for that entity is on the far side of the diamond, but the ordinality is on the near side. This is confusing for many people about Chen notation; you have to keep it straight so you don’t mess up the diagrams.

What if the spec. doesn’t indicate cardinality or ordinality?

Ordinality and/or cardinality may not always be indicated in a spec. Perhaps the participation limits haven’t been worked out yet, or perhaps they just aren’t considered important yet. It’s possible to see a spec describe a relationship like “Students take courses. A student can take up to 5 courses and must take at least 4. A course will contain between 0 and 20 students.” In this case we can spot the cardinality for each entity and the ordinality for each.

With a spec like “Students take courses. A student can take up to 5 courses and must take at least 4” we can determine cardinality and ordinality for Students but not for courses. We must make an assumption. The rule of thumb for making these assumptions is to choose the most general (least constraining) possibility in the absence of information. For ordinality the most general assumption is partial participation because it does not exclude every entity from participating. For cardinality, many is more general than one because an upper bound of many doesn’t exclude each entity from only participating at most one time.

With the least constrained spec “Students take courses” we have to assume cardinality and ordinality for both entities. We choose a cardinality of many and an ordinality of partial participation for both.

Finally, when the information is not known, some practitioners prefer to not decorate the diagram. The single line/partial participation is the default, so an ordinality of 0 is still assumed. But they only decorate the diagram when cardinality is known, and they leave it blank when it is not known.

Weak Entities

All the entities we’ve seen so far are called strong entities. A strong entity is one that “owns” its key–the attribute or attributes making up its key belong to it. A strong entity is indicated in an ERD the way we’ve always drawn them; it goes in a simple rectangle. There is another kind of entity called a weak entity. A weak entity is one that does not contain all of its key. It depends on one or more attributes in another entity for at least part of its key. Here’s an example.

Employees have a name, unique id, and a salary. Dependents have a name and age and are identified by their name along with an employee’s id.

This employee-dependent example is a typical one in the database world. Companies only need to record information about dependents when they cover them as part of their employee’s insurance. The dependents are not the company’s primary focus. A dependent’s name alone is not necessarily unique so another attribute needs to be part of the key, too. In the case of an entity that exists only because of another entity, it’s common to get part of the key from that other entity. In this case, every dependent is linked to one employee. This is an important statement about weak entities. Since keys are unique and every weak entity in a weak entity set needs one, then every dependent would participate (total participation), and each would be related to at most one employee in this way. So although an employee may have many dependents, each and every dependent is connected to the one employee that it is related to.

The Chen notation for a weak entity lets us capture all these aspects. We draw the binary relationship as we usually do, and then we indicated the weak entity by double-boxing or using a heavy line for its rectangle. When drawing the ERD by hand, always prefer to double box. This identifies the weak entity, but not we need to know where to find the rest of its key. Since an entity may participate in many relationships in general, we won’t know which relationship leads us to the rest of the key. We need to double-box (when drawing) or use a heavy line to indicate the diamond in the relationship, too. Then since every weak entity must participate in the relationship, we also need a double or heavy line from the weak entity to the special relationship. Finally, since every weak entity participates at most once, we add the 1 on the opposite side of the relationship from the weak entity.

Weak entities are not as common as strong entities, but they do arise in certain situations. We have to recognize the tell-tale description in the spec. when part of a key already belongs to another entity, and then we have to indicate all the facets correctly in Chen notion: highlight the weak entity, highlight the relationship, indicate its mandatory, and indicate the upper bound of 1. Note that we didn’t indicate any cardinality for the upper bound of how many dependents an employee may have or the ordinality of whether an employee must have dependents. The spec didn’t say, so those are left blank. In general, it’s possible to have the strong entity in the relationship participate totally or partially with an upper bound of many or 1. All the possibilities can be realized in certain situations.