Exploring Relational Database Theory and Practice
August 25, 2010
Microsoft Access 2010 has a collection of wizards to lead you step-by-step through each process involved in developing and using a production-grade database application. ' Exploring Relational Database Theory and Practice ' is extracted from ' Microsoft Access 2010 In Depth', published by Que.
Moving from Spreadsheets to Databases
Word processing and spreadsheet applications were the engines that drove the fledgling personal computer market. In the early PC days, WordPerfect and Lotus 1-2-3 dominated the productivity software business. Today, most office workers use Microsoft Word and Excel on a daily basis. Its probably a safe bet that more data is stored in Excel spreadsheets than in all the worlds databases. Its an equally good wager that most new Access users have at least intermediate-level spreadsheet skills, and many qualify as Excel power users.
Excel 2010s Data ribbon offers elementary database features, such as sorting, filtering, validation, and data entry forms. You can quickly import and export data in a variety of formats, including those of database management applications, such as Access. Excels limitations become apparent as your needs for entering, manipulating, and reporting data grow beyond the spreadsheets basic row-column metaphor. Basically, spreadsheets are list managers; its easy to generate a simple name and address list with Excel. If your needs expand to contact management and integrating the contact data with other information generated by your organization, a spreadsheet isnt the optimal approach.
The first problem arises when your contacts list needs additional rows for multiple persons from a single company. You must copy or retype all the company information, which generates redundant data. If the company moves, you must search and replace every entry for your contacts at the firm with the new address. If you want to record a history of dealings with a particular individual, you add pairs of date and text columns for each important contact with the person. Eventually, you find yourself spending more time navigating the spreadsheets rows and columns than using the data they contain.
Contact lists are only one example of problems that arise when attempting to make spreadsheets do the work of databases. Tracking medical or biological research data, managing consulting time and billings, organizing concert tours, booking artist engagements, and myriad other complex processes are far better suited to database than spreadsheet applications.
Moving to a relational database management system (RDBMS), such as Access, solves data redundancy and navigation problems and greatly simplifies updating existing information. After you understand the basic rules of relational database design, Access makes creating highly efficient databases quick and easy. Access 2010 has a collection of wizards to lead you step-by-step through each process involved in developing and using a production-grade database application. Unfortunately, no Relational Wizard exists to design the underlying database structure for you, but youll find a wealth of pre-built database templates in the Backstage pages New tab. (Click the ribbons File tab to open the new Backstage page.)
Reliving Database History
Databases form the foundation of world commerce and knowledge distribution. Without databases, there would be no World Wide Web, automatic teller machines, credit/debit cards, or online airline reservation systems. Newsgathering organizations, research institutions, universities, and libraries would be unable to categorize and selectively disseminate their vast store of current and historical information. Its difficult to imagine today a world without a network of enormous databases, many of which probably contain a substantial amount of your personal data that you dont want to be easily available to others.
Jim Grays article, Data Management: Past, Present, and Future, which is available as a Microsoft Word document at http://research.microsoft.com/~gray/DB_History.doc, offers a more detailed history of data processing systems. Dr. Gray was a senior researcher and the manager of Microsofts Bay Area Research Center (BARC) until early 2007, when he became lost at sea while sailing off the California coast
The Early History of Databases
The forerunner of todays databases consisted of stacks of machine-readable punched cards, which Herman Hollerith used to record the 1890 U.S. census. Hollerith formed the Computing-Tabulating-Recording Company, which later became International Business Machines. From 1900 to the mid-1950s, punched cards were the primary form of business data storage and retrieval, and IBM was the primary supplier of equipment to combine and sort (collate) punched cards, and print reports based on punched-card data.
The development of large computer-maintained databasesoriginally called databanksis a postWorld War II phenomenon. Mainframes replaced punched cards with high-capacity magnetic tape drives to store large amounts of data. The first databases were built on the hierarchical and network models, which were well suited to the mainframe computers of the 1950s. Hierarchical databases use parent-child relationships to define data structures, whose diagrams resemble business organization charts or an inverted tree with its root at the top of the hierarchy. Network databases allow relaxation of the rules of hierarchical data structures by defining additional relationships between data items. Hierarchical and network databases ordinarily are self-contained and arent easy to link with other external databases over a network.
Early databases used batch processing for data entry and retrieval. Keypunch operators typed data from documents, such as incoming orders. At night, other operators collated the days batch of punched cards, updated the information stored on magnetic tape, and produced reports. Many smaller merchants continue to use batch processing of customers credit-card purchases, despite the availability of terminals that permit almost instantaneous processing of credit- and debit-card transactions.
Hierarchical databases remain alive and well in the twenty-first century. For example, data storage for Windows 2000s Active Directory and Microsoft Exchange Server is derived from the hierarchical version of Accesss original relational Jet databases. The name Jet comes from the original Access database engine called Joint Engine Technology.
The Internets Domain Name System (DNS) is a collection of hierarchical databases for translating character-based Internet domain names into numerical Internet Protocol (IP) addresses. The DNS database is called a distributed database, because its data is held by a global network of thousands of computers.
The Relational Database Model
Dr. E. F. Codd, an employee of IBM Corporation, published A Relational Model of Data for Large Shared Databanks in a journal of the Association for Computing Machinery (ACM) in June 1970. A partial copy of the paper is available at http://www.acm.org/classics/nov95/. Dr. Codds specialty was a branch of mathematics called set theory, which includes the concept of relations. He defined a relation as a named set of tuples (records or rows) that have attributes (fields or columns). One of the attributes must contain a unique value to identify each tuple. The common term for relation is a table whose presentation to the user is similar to that of a spreadsheet.
Relational databases solve a serious problem associated with earlier database types. Hierarchical and network databases define sets of data and explicit links between each data set as parent-child and owner-member, respectively. To extract information from these databases, programmers had to know the structure of the entire database. Complex programs in COBOL or other mainframe computer languages are needed to navigate through the hierarchy or network and extract information into a format understandable by users.
Dr. Codds objective was to simplify the process of extracting formatted information and make adding or altering data easier by eliminating complex navigational programming. During the 1970s, Dr. Codd and others developed a comparatively simple language, Structured Query Language (SQL), for creating, manipulating, and retrieving relational data. With a few hours of training, ordinary database users could write SQL statements to define simple information needs and bypass the delays inherent in the database programming process. SQL, which was first standardized in 1985, now is the lingua franca of database programming, and all commercial database products support SQL.
Client/Server and Desktop RDBMSs
In the early database era, the most common presentation of data took the form of lengthy reports processed by centralized, high-speed impact printers on fan-folded paper. The next step was to present data to the user on green-screen video terminals, often having small printers attached, which were connected to mainframe databases. As use of personal computers gained momentum, terminal emulator cards enabled PCs to substitute for mainframe terminals. Mainframe-scale relational databases, such as IBMs DB2, began to supplement and later replace hierarchical and network databases, but terminals continued to be the primary means of data entry and retrieval.
The most widely used SQL standard, SQL-92, was published by the American National Standards Institute (ANSI) in 1992. Few, if any, commercial relational database management systems (RDBMSs) today fully conform to the entire SQL-92 standard. The later SQL-99 (also called SQL3) and SQL-200n specifications add new features that arent germane to Access databases.
RDBMS competitors have erected an SQL Tower of Babel by adding nonstandard extensions to the language. For example, Microsofts Transact-SQL (T-SQL) for SQL Server, which is the subject of Chapter 27, Moving from Access Queries to Transact-SQL, has many proprietary keywords and features. Oracle Corporations Oracle:SQL and PL/SQL dialects also have proprietary SQL extensions.
Oracle, Ingres, Informix, Sybase, and other software firms developed relational databases for lower-cost minicomputers, most of which ran various flavors of the UNIX operating system. Terminals continued to be the primary data entry and display systems for multiuser UNIX databases.
The next step was the advent of early PC-based flat-file managers and relational database management systems. Early flat-file database managers, typified by Jim Buttons PCFile for DOS (1981) and Claris FileMaker for Macintosh (1988) and Windows (1992), used a single table to store data and offered few advantages over storing data in a spreadsheet. The early desktop RDBMSssuch as dBASE, Clipper, FoxBase, and Paradoxran under DOS and didnt support SQL. These products later became available in multiuser versions, adopted SQL features, and eventually migrated to Windows. Access 1.0, which Microsoft introduced in November 1992, rapidly eclipsed its DOS and Windows competitors by virtue of Accesss combination of graphical SQL support, versatility, and overall ease of use.
PC-based desktop RDBMSs are classified as shared-file systems because they store their data in conventional files that multiple users can share on a network. One of Accesss initial attractions for users and developers was its capability to store all application objectsforms, reports, and programming codeand tables for a database application in a single file, which used the earlier .mdb extension.. FoxPro, dBASE, Clipper, and Paradox require a multitude of individual files to store application and data objects. Today, almost every multiuser Access application is divided (split) into a front-end .accdb file, which contains application objects and links to a back-end database .accdb file that holds the data. Each user has a copy of the front-end .accdb file and shares connections to a single back-end .accdb file on a peer Windows workstation or server.
Prior to Access 2000, Jet was Accesss standard database engine, so the terms Access database and Jet database were interchangeable. Microsoft considered SQL Server to be its strategic RDBMS for Access 2000 and 2003. Strategic means that SQL Server gets continuing development funds and Jet doesnt. Jet 4.0, which was included with Access 2003 and is a part of the Windows XP and later operating systems, is the final version and is headed toward retirement.
Microsofts Access team decided to enhance Jet 4.0 with the new features described in Chapter 1, Access 2010 for Access 2007 Users: Whats New, change the file extension from .mdb to .accdb, and drop all references to Jet. To reflect this change, this edition uses the terms Access database and SQL Server database. Unless otherwise noted, SQL Server refers to all SQL Server 2005 editions except the Compact and Mobile editions.
Client/server RDBMSs have an architecture similar to Accesss front-end/back-end shared-file multiuser configuration. What differentiates client/server from shared-file architecture is that the RDBMS on the server handles most of the data-processing activity. The client front end provides a graphical user interface (GUI) for data entry, display, and reporting. Only SQL statements and the specific data requested by the user pass over the network. Client/server databases traditionally run on network operating systems, such as Windows and UNIX, and are much more robust than shared-file databases, especially for applications in which many users make simultaneous additions, changes, and deletions to the database. All commercial data-driven Web applications use client/server databases.
This book uses the terms field and record when referring to tables, and columns and rows when discussing data derived from tables, such as the views and query result sets described later in this chapter.
Since version 1.0, Access has had the capability to connect to client/server databases by linking their tables to an Access database. Linking lets you treat client/server tables almost as if they were native Access tables. Linking uses Microsofts widely accepted Open Database Connectivity (ODBC) standard, and Access 2010 includes an ODBC driver for SQL Server and Oracle databases. You can purchase licenses for ODBC drivers that support other UNIX or Windows RDBMSs, such as Sybase or Informix, from the database supplier or third parties. Chapter 19, Linking Access Front Ends to Access and Client/Server Databases, describes the process of linking Access and Microsoft SQL Server 2008 databases. Although Chapter 19 uses SQL Server for its examples, the linking procedure is the same foror at least similar toother client/server RDBMSs.
Access data projects (ADP) and the Microsoft SQL Server 2005 Express Edition combine to make Access 2010 a versatile tool for designing and testing client/server databases and creating advanced data entry and reporting applications. You can start with a conventional Access database and later use Accesss Upsizing Wizard to convert the .mdb file(s) to an .adp file that holds application objects and an SQL Server 2005 back-end database. Access 2010s Upsizing Wizard has incorporated many improvements to the Access 2000 and earlier wizard versions, but Access 2010s Wizard is the same as 2007s. Despite the upgraded wizardry, youre likely to need to make changes to queries to accommodate differences between Access and SQL Servers SQL dialects.
Defining the Structure of Relational Databases
Relational databases consist of a collection of self-contained, related tables. Tables typically represent classes of physical objects, such as customers, sales orders, invoices, checks, products for sale, or employees. Each member object, such as an invoice, has its own record in the invoices table. For invoices, the field that uniquely identifies a record, called a primary key[field], is a serial invoice number.
Figure 4.1 shows Accesss Datasheet view of an Invoices table, which is based on the Northwind.mdb sample databases Orders table. The InvoiceNo field is the primary key. Values in the OrderID, CustomerID, EmployeeID, and ShipperID fields relate to primary key values in Northwinds Orders, Customers, Employees, and Shippers tables. A field that contains values equal to those of primary key values in other tables is called a foreign key [field].
This simple Invoices table was created from the Northwind Orders table and doesnt take advantage of Accesss extended properties, such as the field captions, lookup fields, and subdatasheets in the Datasheet view of the Orders table.
If you need information about a particular invoice or set of invoices, open the Invoices table and search for the invoice(s) by number (InvoiceNo) or another attribute, such as a customer code (CustomerID), date (ShippedDate), or range of dates. Unlike earlier database models, the user can access the Invoices table independently of its related tables. No database navigation programming is needed. A simple, intuitive SQL statement, SELECT * FROM Invoices, returns all the data in the table. The asterisk (*) represents a request to display the contents of all fields of the table.
Removing Data Redundancy with Relationships
The Invoices table of Figure 4.1 is similar to a spreadsheet containing customer billing information. Whats missing is the customer name and address information. A five-character customer code (CustomerID) identifies each customer to whom the invoice is directed. The CustomerID values in the Invoices table match CustomerID values in a modified version of Northwinds Customers table (see Figure 4.2). Matching a foreign key with a primary key value often is called a lookup operation. Using a key-based lookup operation eliminates the need to repeatedly enter name, address, and other customer-specific data in the Invoices table. In addition, if you change the customers address, the change applies to all past and future invoices.
Foreign key values in the Invoices table must match primary key values in the Customers table.
The Invoices table also connects with other tables, which contain information on orders, sales department employees, and the products ordered. Connections between fields of related tables having common values are called relationships (not relations). Figure 4.3 shows Accesss Relationships window displaying the relationships between the Invoices table and the other tables of the Northwind sample database.
Accesss Relationships window displays the relationships between the tables of the Northwind sample database, plus the added Invoices table. Every relationship between these tables is one-to-many. The many-to-many relationship between Products and Orders is an indirect relationship.
Another method of generating unique keys is by use of Globally Unique Identifiers (GUIDs), which also are called Universally Unique Identifiers (UUIDs). GUIDs are 16-byte computed binary numbers that are guaranteed to be unique locally and universally; no other computer in the world will duplicate a GUID. SQL Servers uniqueidentifier data type is a GUID. Because GUIDs cant represent a property of an object, such as a check number, GUID keys are called surrogate keys. You cant select a GUID data type in Accesss Table Design mode.
Relationships come in the following three flavors:
There are many other indirect relationships between the tables shown in Figure 4.3. For example, a many-to-many relationship exists between the Suppliers and Orders tables. In this case, Products and Order Details act as linking tables between the Suppliers and Orders tables.
The Relationships window displays the names of primary key fields in a boldface font. Notice in Figure 4.3 that the OrderID and ProductID field names are preceded by a key symbol. The OrderID and ProductID fields compose a composite primary key, which uniquely identifies an order line item. You cant repeat the same combination of OrderID and ProductID; this precaution makes sense for products that have only one stock-keeping unit (SKU), such as for Aniseed Syrup, which comes only in a carton of 12 550ml bottles.
Access 2010s multivalue field feature automatically generates a hidden linking table under the covers. Access 2007 introduced the multivalued field for compatibility with SharePoint lists.
The Oakmont.accdb sample database file in the \2010Samples\Oakmont folder of the downloadable code has a structure that differs from that of Northwind.accdb, but the design principles of the two databases are similar. OakmontSQL.mdf is an SQL Server 2008 database for use with ADP. ADP uses a special set of toolscalled the project designer or da Vinci toolset in this bookfor designing and managing SQL Server databases. The Oakmont files are course enrollment databases for a college. Figure 4.4 shows the Database Diagram window for the OakmontSQL database. The SQL Server Diagram window is similar to the Relationships window for Accesss traditional Access databases. The key and infinity symbols at the ends of each line represent the one and many sides, respectively, of the one-to-many relationships between the tables. Access and SQL Server databases store information on table relationships as an object within the database file.
The SQL Server Database Diagram window for the OakmontSQL database shows one-to-many relationships between primary key fields (identified by key symbols) and foreign key fields (infinity symbols).
This book uses the Access 2010 and SQL Server 2008 R2 versions of the Northwind and Oakmont sample databases in almost all examples. The tables of the Oakmont database have many more records than the Northwind tables. The large number of records in the Oakmont database makes it better suited than Northwind for predicting the performance of production Access and SQL Server database applications.
The one-product-entry-per-order restriction prevents shared use of the Order Details table as an invoice line items table. If you short-ship an order item on one invoice, you cant add another record to the Order Details table when you ship the remaining quantity of the item. Microsoft didnt add an Invoices table for Northwind Traders, probably because of the complexity of dealing with backorders and drop-shipments.