Clearing a Path through the 3NF Join Jungle

Rob Gravelle explores the pros and cons of normalization with a particular focus on the ramifications of third normal form normalization on data extraction by SELECT query.

Extracting data from a normalized database structure requires special care. This holds especially true where tables share a many-to-many relationship. In this article, we’ll be looking at the pros and cons of normalization with a particular focus on the ramifications of third normal form normalization on data extraction by SELECT query.

A Crash Course on Normalization

Among the most important tenets of database design, as influenced by the work of E.F. Codd, is that a database should at least fulfill the conditions of third normal form (3NF).

EF Codd
E. F. Codd

Normalization is the process of applying a series of rules to ensure that your database achieves optimal structure. Each normal form is a progression of these rules as successive normal forms achieve a better database design than the previous ones did. Although there are several levels of normal forms, it is generally sufficient to apply only the first three levels of normal forms. They are described in the following paragraphs.

 

First Normal Form

To achieve first normal form, all columns in a table must be atomic. This means, for example, that you cannot store first name and last name in the same field. The reason for this rule is that data becomes very difficult to manipulate and retrieve if multiple values are stored in a single field.

Another requirement for first normal form is that the table must not contain repeating values. An example of repeating values is a scenario in which Item1, Quantity1, Item2, Quantity2, Item3, and Quantity3 fields are all found within an Orders table.

Second Normal Form

To achieve second normal form, all non-key columns must be fully dependent on the primary key. In other words, each table must store data about only one subject.

Third Normal Form

To attain third normal form, a table must meet all the requirements for first and second normal form, and all non-key columns must be mutually independent. This means that you must eliminate any calculations, and you must break out data into lookup tables.

The Good, the Bad, and the Ugly

First, the good. Normalization does provide many benefits:

  • It provides indexing in the form of clustered indexes.
    In most cases, the creation of a clustered index will speed up data access and may increase insert, update, and delete performance.
  • Minimizes modification anomalies.
    Modification anomalies arise when data is inserted, updated, or deleted, and information is lost in unexpected ways.
  • Reduces table/row size.
    By removing duplicate data, we conserve disk space and increase the amount of row space available for other fields.
  • Enforces referential integrity.
    Referential integrity problems usually manifest themselves as an inability to extract important data or relate information from one table with data in another table because there is nothing on which to join the two tables.

And now for the bad and the ugly:

  • The more optimized a database is, the more effort and time it takes to retrieve data.
    You see, database normalization is aimed at storing data in the most optimal manner – preserving referential integrity, as well as removing data redundancy, thus making inserts, updates and deletes as efficient as possible. But this is just looking at one part of the picture. The tradeoff is this:

        Optimizing a database for storage automatically makes it unoptimized for retrieval.

  • Numerous and complex table joins.
    Database normalization comes with a price. A drawback to having a highly normalization database structure is that you may need a large number of joins to pull back the records the application needs to function. The fact that the database is usually an application performance bottleneck only aggravates the problem. This is especially apparent in environments in which many concurrent users access the database.

A Practical Example

The following UML diagram, which appears on Jeff Atwood’s Coding Horror site, represents a user profile on a typical social networking site:

UML diagram

This is the kind of information you see on the average profile on Facebook. With the above design, it takes six SQL join operations to access and display the information about a single user. This makes rendering the profile page a fairly database intensive operation. This issue is compounded by the fact that profile pages are the most popular pages on social networking sites.

select * from Users u
inner join UserPhoneNumbers upn on u.user_id = upn.user_id
inner join UserScreenNames  usn on u.user_id = usn.user_id
inner join UserAffiliations ua  on u.user_id = ua.user_id
inner join Affiliations a on a.affiliation_id = ua.affiliation_id
inner join UserWorkHistory uwh on u.user_id = uwh.user_id
inner join Affiliations wa on uwh.affiliation_id = wa.affiliation_id

The Three-way Join

The joining of tables that share a many-to-many relationship is the most complex. That’s because of the third junction table that links the two related tables. To extract the data from these tables requires something called a three way join. Here’s a three way query that retrieves data from the employees, shops, and junction shops_employees tables where employees can work for several shops at once and shops can have many employees working there:

select * 
from employees as e,
shops_employees as se,
shops as s 
where e.id = se.employee_id 
and   se.shop_id = s.shop_id;

Notice how the employees table links to the shops_employees table through the employee id and the shops_employees table in turn links to the shops table through the shared shop_id field.

Handling Optional Foreign Keys

There is some debate as to whether foreign keys should ever be null. Conceptually, there is nothing to inherently prevent the existence of null foreign keys. From the standpoint of a data-driven application, which interacts with a database, the determining factor is whether or not the related object, that is the owner of the foreign key, is optional or not. For instance, one form might contain information about individuals associated with various organized crime organizations. The same individual can be a part of several organizations, or none at all, in which case, there will be no record for that client in the organized crime organization or junction tables.

To perform a lookup on clients that displays relevant organized crime group associations, we have to take special care to not filter out clients who have no such association because

   RECORDS WITH NULL FOREIGN KEYS WILL NOT BE RETURNED BY THE QUERY USING REGULAR JOINS

Here are some modified employees and shops tables that I have been referring to in my last several articles. I changed their relationship from one-to-many to many-to-many by allowing employees to work for more than one shop. This change would better record the temporal data of an employee that changed shops over the course of their employment.

The employees table

id

gender

name

salary

1

m

Jon Simpson

4500

2

f

Barbara Breitenmoser

(NULL)

3

f

Kirsten Ruegg

5600

4

m

Ralph Teller

5100

5

m

Peter Jonson

5200

6

m

Allan Smithie

4900

The shops table

shop_id

shop

1

Zurich

2

New York

3

London

The shops_employees table

shops_employees_id

employee_id

shop_id

1

1

2

2

1

1

3

2

2

4

3

3

5

3

1

6

5

3

Using a straight join does not display all of the employees because Ralph Teller and Peter Johnson are not associated with any shop. Perhaps they are sales people who travel as part of their jobs.

select name,
       gender,
       shop 
from employees as e,
shops_employees as se,
shops as s 
where e.id       = se.employee_id 
and   se.shop_id = s.shop_id;

Notice that Ralph Teller and Peter Johnson are missing from the results:

name

gender

shop

Jon Simpson

m

New York

Jon Simpson

m

Zurich

Barbara Breitenmoser

f

New York

Kirsten Ruegg

f

London

Kirsten Ruegg

f

Zurich

Peter Jonson

m

London

To retrieve all of the employees, whether or not they are associated with any shops necessitates the use of Left Joins between the employees and shops_employees as well as the shops_employees and shops tables:

select emp.name,
       emp.gender,
       sh.shop
from employees AS emp 
     left JOIN shops_employees as se ON (emp.id     = se.employee_id)
     left JOIN shops           as sh ON (sh.shop_id = se.shop_id)
order by emp.name,
         sh.shop;

Now we get all of the employee information, including those with null shop_ids:

name

gender

shop

Allan Smithie

m

(NULL)

Barbara Breitenmoser

f

New York

Jon Simpson

m

New York

Jon Simpson

m

Zurich

Kirsten Ruegg

f

London

Kirsten Ruegg

f

Zurich

Peter Jonson

m

London

Ralph Teller

m

(NULL)

There is a whole lot more to querying normalized database tables than we covered here today. Executing more complex queries and handling larger data sets call for more sophisticated strategies. You might be appalled to learn that one such strategy involves…gasp…denormalization! That would make one fine Halloween topic.

» See All Articles by Columnist Rob Gravelle

Robert Gravelle
Robert Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles