If you’re a DBA who has never managed the data itself, or a new database architect, you’ll need to know the ins and outs of normalization. For those who aren’t comfortable with the concept, NORMALIZATION is the process of structuring tables for maximum efficiency of joins, while assuring minimum duplication of data and maximum data integrity.
Why Normalize? Yes, building a database that isn’t normalized is quicker. If you’re the one in charge of maintaining it, though, you’ll pay later: explosion in the size of your database, convoluted coding, duplication of data, and even contradictory data isn’t uncommon. There are often good reasons NOT to normalize, too: If your database is largely a reporting platform, you may want to persist duplicated data to avoid unnecessary joins. Quite frankly, OLAP (on-line analytical processing) is just a big excuse not to normalize. For the purposes of this article, let’s assume you’re designing a production OLTP (on-line transactional processing) system, with users inserting, updating, and deleting records willy-nilly.
Okay, you may have heard of the “normal forms”. The forms are a set of standards that state how well a database is normalized. They aren’t rules, nobody will come to your office and slap you if you violate them. Different developers go to different lengths to implement them–most folks I know don’t go above 3rd normal form on a regular basis. Here are the quick and dirty definitions in English.
1st Normal Form (1NF): All 1NF says is that you don’t repeat groups of data as separate fields. An example of this is having a customer table with three phone number fields (home, work, and fax). To avoid violating 1NF, you’d need a separate phone numbers table. Once again, sometimes it makes sense to ignore this rule, even in an OLTP scenario.
2nd Normal Form (2NF): Okay, now each column in the table must be dependant on the ENTIRE key. Using the example above, you wouldn’t want to keep the customer name in the phone numbers table; since there would be 0-3 records for each customers, you could say that the number itself is dependant on the key (i.e., there is just one record for each key/phone combination) but the customer name is NOT dependant (the key/name combination has the possibility of duplication).
Basically, the customer name applies to lots of different kinds of records (invoices, addresses, etc.) and therefore should be stored separately and related to each different record using a key. Now, if you add “Phone Type” to the key, you’re not in violation.
3rd Normal Form (3NF): This one can be confusing, but its result makes sense. Each column must depend DIRECTLY on the key. You could think of this as a logical extension of 2NF; while the example above fits 2NF, it doesn’t fit 3NF because the key in the phone numbers table doesn’t identify the customer in every instance (such as if you were looking for customer orders). To avoid violating 3NF, you’d need to put customer name in the Customers table and then key to the phone numbers table based on CustomerID.
4th and 5th Normal Form: Pretend as if they don’t exist. They’re really only useful for late night discussions when you’ve had too many tequila shooters and cheese fries. I have personally had nightmares about them. Repeat this mantra: “High order Normalization is a tool of the Devil”.
As a general rule, you should use non-meaningful key values to avoid having to change them when the data in the row changes.
Comments and suggestions for future articles are gratefully accepted. I will also post corrections if someone points out a mistake.