Introduction
I decided to start making notes as I designed a database to hold the
articles that I am writing. When I started writing, it was easy to
keep track of what I submitted. Not that I am writing for more than one publisher
and they have different editorial cycles (some as long as 4-5 months)
it is becoming increasingly difficult to keep track of what is submitted where
and when it will be published. Actually, it is impossible given that I have
a full-time job elsewhere and a busy family life.
So, as I design the database and build a front end to keep track of it, I will
be writing notes and including code here for people to use if they would
like. Since this is a work in progress, the writing may be a little
rough. I will not be spending too much time editing my notes as I
implement the application.
Before I go any further, I have to give credit to Jerry Pournelle for
giving me the idea for this project. I am a long time reader of his
columns in Byte (www.byte.com) and occassionally follow the notes he
keeps at Chaos Manor (www.jerrypournelle.com).
The Problem
As a writer, I have a substantial number of articles in the works,
completed, but not published, as well as completed and published. Recently I
also was asked to write an article on NT whereas I have been primarily
concerned with writing about SQL Server.
Over the last few weeks, I have spent nearly as much time trying to
figure out what articles to submit for publication as I have writing
articles each week. As with most applications, I decided to build a
quick prototype to organize my thoughts.
Now the first prototype was not using Access, SQL Server, VB, or anything
else. I started with a quick notepad document. Since I am also trying to
keep my ASP skills sharp, I entered the data in notepad as an ASP document
which I could view in a web page. This also allowed me to quickly format the
data and reference it from any computer without having to transport
a file around. This was my substitute for a paper prototype since my goal
is a mostly paperless office.
The initial prototype resulted in the following information for each article:
- Article Title – Title of the article (working)
- Synopsis – Brief description
- Submitted to – published I submitted it to
- Submitted Date – date I submitted the article
- Published Title – Actual title the article is published under
- Publication Date – Date the article was published
- Copyright Date – Date I mark the article as completed
- Category – To classify each article (DTS, SoapBox, etc.)
- Article Type – This was added as I write about NT as well as SQL Server
- Payment – Not a big deal, but I was curious how I am doing.
Of course, doing this in ASP/HTML results in a really long list with
embedded lists inside for other information. It did, however, give me a
starting point to begin designing a database.
Analysis
Before creating and tables, I need to examine the list of information above.
>From experience, I know that I have not thought of everything,
and I know the design will evolve as I learn more about what I need to track
and start to request reports from the application. Therefore, I am going to
start with the list above and see where this takes me.
When I start to design a database, I look for the places where I can
make the design flexible. I try to anticipate changes that might be asked
by the end user. I also try to then “break” my design by testing some
changes against it to see if it can “handle” them without modification.
How do I then convert the data above into a schema? First I will start
at the top and design my “Product” table. I could call this table “Articles”,
but my products are articles and I am trying to get into the book
writing business and I would also argue my book reviews are not really articles,
so I will stick with Products. I also may one day decide to speak at
some event, these speeches could also be stored in this table.
I will go through the fields one by one and explain how I look at this
information. First, Article Title is the main identifier that differentiates
each article. Now I tend to like identity fields even though a number of
database designers feel this is a poor design. My feelings are that
the identity represents a hash to me and whether I store
the data in a table and update it each time, get the max(), or use an
identity, the result is the same. The identity is also smaller (less storage) joins quicker,
and is easy to manage (IMHO).So while Article Title may be unique and
will probably be my PK, I will also add an ID field to my first table. This gives me
Product ----------- ProductID ProductName (I like this better than title)
The next piece of information is the synopsis. This is a description of the
item. My first instinct is to consider whether there will be multiple
descriptions for each article. If there are then, this would likely by put
into a child table. Since I do not expect my internal description to be a many
for each article, I will keep it in the Product table. There may be a
different description for each submission, however, so I will start another table
for submissions that includes a description as well. Now we have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description
Next we have the “Submitted To” data element. This is definitely not a part of
the product. First I have submitted the same article to two different
sources (at different times, of course), so this is definitely part of a
child table. Second, an element (A) that is a multiple of another element
(B) (many submissions per article) as well as having the the other
element (B) be a multiple of the original element (A) (the same company
may get multiple submissions) is a great candidate for its own table.
The “submitted to” is in reality a publisher or a company. An example of
this is “Swynk.com” where this article is published. I write many articles that
are all submitted to Swynk.
Another way to look at “Submitted To” is to consider this as a lookup
table. When I go to submit an article, I will really want a lookup against
past companies as well as the chance to add a new company. The result is
that this element results in a new table and a new field in the “Submissions”
table. Now I have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description CompanyID Companies ------------ CompanyID CompanyName
My middle table has become a many-many join table between essentially two
lookup tables. What about “submitted date”? This is a one-one for each
submission. Therefore we will add this to the Submissions table as another
field. Dates are generally not good candidates for lookups or their own tables.
Next we have Published Title which is most likely a one-one for each
submission as well. It is possible, however, for a publisher to re-release
the article under another title at a later date. I will choose not to
implement the many relationship here. My rational is that if an Article
were to be re-released, I would add it as another Submission for the
same article. I realize that this is a limitation of the design, but
I understand the implications and since this will not likely be a design
that serves thousands of clients (in my dreams), having a few rows
of (mostly) duplicate data is acceptable. If this were something that
happened for most of my articles, then I might feel differently. However
it will be a relatively rare event, so it goes into Submissions. The same
logic goes for the Publication Date. Now we have:
Product ----------- ProductID ProductName (I like this better than title) Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Companies ------------ CompanyID CompanyName
The Copyright Date is a one-one date for the article. It is essentially the
date that I assign to the article as being completed. Therefore this goes
into the “Product” table.
Next we have category. This is a classification attribute of the product.
Whenever I see something that is used to classify a data element I
immediately think “lookup table”. Since this is the first classification,
I will create a new table to hold this and add a field to the product
table. There may be the need to include this field in submissions, but I
am not sure, so I will leave it out for now.
Product ----------- ProductID ProductName (I like this better than title) CopyrightDate CategoryID Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Companies ------------ CompanyID CompanyName Category ----------- CategoryID CategoryName
Article Type is the next data element that I have to consider. This is
also a classifier for the product. Now, I have to consider the following
things. One, If I add this as another category, I have two fields
that have FKs to the category table. If I perform a join, I need to be
sure to include category twice in the join so I get both items. This
can be confusing since I would need something like:
select * from product p, category c1, categoryc2 where p.categoryid = c1.categoryid and p.articletype = c2.articletype
I prefer to make another lookup table to hold article types. This makes for quite
a few lookup tables, but I have not had a problem managing thirty or
so lookup tables. As with Article above, I changed the name to ProductType.
The last field is payment. Since I get payed by submission, not article,
I will include this in the submissions table. I suppose the possibility exists
that there could be multiple payments for a submission (not that I
have seen this one), but I do not want to track this for now. The final
design looks like this:
Product ----------- ProductID ProductName (I like this better than title) CopyrightDate CategoryID ProductTypeID Submissions ------------ ProductID Description CompanyID PublicationName PublicationDate Payment Companies ------------ CompanyID CompanyName Category ----------- CategoryID CategoryName ProductType ------------- ProductTypeID ProductTypeName
The Design
This has gone on long enough. I will tackle the design next time and include
some images of the ERD that I build.
Conclusions
This is the start of a multi-part series that I will be releasing over the
next few months. In reality, the entire project will take only a few
days, but I need some time to organize the short, quick notes I take
as I build this into something easier to read.
Now I realize that quite a few of my regular readers may ask “Where are the
abbreviations?” They are coming! This is the analysis phase of this
project and an initial table design. As I start to draw this up and build
the tables, I will build a table of abbreviations and include them in the
actual DDL code. But for the business presentation and analysis, I prefer to keep
the longer names. It prevents confusion when explaining to non-technical people.
Plus, I hate when non-technical people actually know the names in
the DB. They do not use them enough to be familiar and quickly confuse
everyone. This is one of the arguements to use long names all the
time (see the alternative viewpoint in Abbreviations),
but since I really hate typing and business rules and nomenclature can
change, I will stick with my abbreviations for code.
As always, I welcome feedback and hope that this series will help some
of the more junior designers understand how I approach a database design.
This same approach would apply for enhancements to a schema as well.
Steve Jones
January 2001