Sample Database Design
February 7, 2001
I decided to start making notes as I designed a database to hold the articles that I am writing. When I started writing, it was easy to keep track of what I submitted. Not that I am writing for more than one publisher and they have different editorial cycles (some as long as 4-5 months) it is becoming increasingly difficult to keep track of what is submitted where and when it will be published. Actually, it is impossible given that I have a full-time job elsewhere and a busy family life.
So, as I design the database and build a front end to keep track of it, I will be writing notes and including code here for people to use if they would like. Since this is a work in progress, the writing may be a little rough. I will not be spending too much time editing my notes as I implement the application.
Before I go any further, I have to give credit to Jerry Pournelle for giving me the idea for this project. I am a long time reader of his columns in Byte (www.byte.com) and occassionally follow the notes he keeps at Chaos Manor (www.jerrypournelle.com).
As a writer, I have a substantial number of articles in the works, completed, but not published, as well as completed and published. Recently I also was asked to write an article on NT whereas I have been primarily concerned with writing about SQL Server.
Over the last few weeks, I have spent nearly as much time trying to figure out what articles to submit for publication as I have writing articles each week. As with most applications, I decided to build a quick prototype to organize my thoughts.
Now the first prototype was not using Access, SQL Server, VB, or anything else. I started with a quick notepad document. Since I am also trying to keep my ASP skills sharp, I entered the data in notepad as an ASP document which I could view in a web page. This also allowed me to quickly format the data and reference it from any computer without having to transport a file around. This was my substitute for a paper prototype since my goal is a mostly paperless office.
The initial prototype resulted in the following information for each article:
Before creating and tables, I need to examine the list of information above. >From experience, I know that I have not thought of everything, and I know the design will evolve as I learn more about what I need to track and start to request reports from the application. Therefore, I am going to start with the list above and see where this takes me.
When I start to design a database, I look for the places where I can make the design flexible. I try to anticipate changes that might be asked by the end user. I also try to then "break" my design by testing some changes against it to see if it can "handle" them without modification.
How do I then convert the data above into a schema? First I will start at the top and design my "Product" table. I could call this table "Articles", but my products are articles and I am trying to get into the book writing business and I would also argue my book reviews are not really articles, so I will stick with Products. I also may one day decide to speak at some event, these speeches could also be stored in this table.
I will go through the fields one by one and explain how I look at this
information. First, Article Title is the main identifier that differentiates
each article. Now I tend to like identity fields even though a number of
database designers feel this is a poor design. My feelings are that
the identity represents a hash to me and whether I store
the data in a table and update it each time, get the max(), or use an
identity, the result is the same. The identity is also smaller (less storage) joins quicker,
and is easy to manage (IMHO).So while Article Title may be unique and
will probably be my PK, I will also add an ID field to my first table. This gives me
Product ----------- ProductID ProductName (I like this better than title)
The next piece of information is the synopsis. This is a description of the
item. My first instinct is to consider whether there will be multiple
descriptions for each article. If there are then, this would likely by put
into a child table. Since I do not expect my internal description to be a many
for each article, I will keep it in the Product table. There may be a
different description for each submission, however, so I will start another table
for submissions that includes a description as well. Now we have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description
Next we have the "Submitted To" data element. This is definitely not a part of the product. First I have submitted the same article to two different sources (at different times, of course), so this is definitely part of a child table. Second, an element (A) that is a multiple of another element (B) (many submissions per article) as well as having the the other element (B) be a multiple of the original element (A) (the same company may get multiple submissions) is a great candidate for its own table. The "submitted to" is in reality a publisher or a company. An example of this is "Swynk.com" where this article is published. I write many articles that are all submitted to Swynk.
Another way to look at "Submitted To" is to consider this as a lookup
table. When I go to submit an article, I will really want a lookup against
past companies as well as the chance to add a new company. The result is
that this element results in a new table and a new field in the "Submissions"
table. Now I have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description CompanyID Companies ------------ CompanyID CompanyNameMy middle table has become a many-many join table between essentially two lookup tables. What about "submitted date"? This is a one-one for each submission. Therefore we will add this to the Submissions table as another field. Dates are generally not good candidates for lookups or their own tables.
Next we have Published Title which is most likely a one-one for each
submission as well. It is possible, however, for a publisher to re-release
the article under another title at a later date. I will choose not to
implement the many relationship here. My rational is that if an Article
were to be re-released, I would add it as another Submission for the
same article. I realize that this is a limitation of the design, but
I understand the implications and since this will not likely be a design
that serves thousands of clients (in my dreams), having a few rows
of (mostly) duplicate data is acceptable. If this were something that
happened for most of my articles, then I might feel differently. However
it will be a relatively rare event, so it goes into Submissions. The same
logic goes for the Publication Date. Now we have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Companies ------------ CompanyID CompanyNameThe Copyright Date is a one-one date for the article. It is essentially the date that I assign to the article as being completed. Therefore this goes into the "Product" table.
Next we have category. This is a classification attribute of the product.
Whenever I see something that is used to classify a data element I
immediately think "lookup table". Since this is the first classification,
I will create a new table to hold this and add a field to the product
table. There may be the need to include this field in submissions, but I
am not sure, so I will leave it out for now.
Product ----------- ProductID ProductName (I like this better than title) CopyrightDate CategoryID Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Companies ------------ CompanyID CompanyName Category ----------- CategoryID CategoryName
Article Type is the next data element that I have to consider. This is
also a classifier for the product. Now, I have to consider the following
things. One, If I add this as another category, I have two fields
that have FKs to the category table. If I perform a join, I need to be
sure to include category twice in the join so I get both items. This
can be confusing since I would need something like:
select * from product p, category c1, categoryc2 where p.categoryid = c1.categoryid and p.articletype = c2.articletypeI prefer to make another lookup table to hold article types. This makes for quite a few lookup tables, but I have not had a problem managing thirty or so lookup tables. As with Article above, I changed the name to ProductType.
The last field is payment. Since I get payed by submission, not article,
I will include this in the submissions table. I suppose the possibility exists
that there could be multiple payments for a submission (not that I
have seen this one), but I do not want to track this for now. The final
design looks like this:
Product ----------- ProductID ProductName (I like this better than title) CopyrightDate CategoryID ProductTypeID Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Payment Companies ------------ CompanyID CompanyName Category ----------- CategoryID CategoryName ProductType ------------- ProductTypeID ProductTypeName
This has gone on long enough. I will tackle the design next time and include some images of the ERD that I build.
This is the start of a multi-part series that I will be releasing over the next few months. In reality, the entire project will take only a few days, but I need some time to organize the short, quick notes I take as I build this into something easier to read.
Now I realize that quite a few of my regular readers may ask "Where are the abbreviations?" They are coming! This is the analysis phase of this project and an initial table design. As I start to draw this up and build the tables, I will build a table of abbreviations and include them in the actual DDL code. But for the business presentation and analysis, I prefer to keep the longer names. It prevents confusion when explaining to non-technical people. Plus, I hate when non-technical people actually know the names in the DB. They do not use them enough to be familiar and quickly confuse everyone. This is one of the arguements to use long names all the time (see the alternative viewpoint in Abbreviations), but since I really hate typing and business rules and nomenclature can change, I will stick with my abbreviations for code.
As always, I welcome feedback and hope that this series will help some of the more junior designers understand how I approach a database design. This same approach would apply for enhancements to a schema as well.Steve Jones