Build a Web Site Traffic Analysis Cube: Part I - Page 2
July 21, 2003
Obtaining the Traffic Data
Before we can begin the generation of our Site Traffic Analysis Cube in Analysis Services, we need a data source that has, in essence, the characteristics of a star schema, or at least those of a fact table. A common source of information about traffic and activity on virtually any web site is the Server Access Log, which we will discuss in the next section.
Introducing the Server Access Log
While there are many options for storing site access information, a common measure of these statistics, file accesses (pervasively known as "hits"), are typically recorded in a file called an access log. Although it is beyond the scope of this lesson to become involved in the well-known considerations surrounding the appropriateness of the use of "hits" as the sole measure of site traffic, many of us are aware of the fact that alternatives exist that might more accurately provide refined data as to actual visits versus mere file accesses, etc. For the purposes of this article, we will use the access log as the sole source of data with which we intend to build our cube, primarily to keep focused on the cube design and construction itself. Suffice it to say that many advanced approaches to the objective of articulate data capture have been devised, and that our approach within the scope of this lesson certainly is not a recommendation of "best practices" within this specialized science.
The Server Access Log is central, for our purposes, to learning about the visits to our site, as well as the visitors themselves. At a high level, any visit to our site means a corresponding request for a file from the site. A request for a file results in a corresponding entry to the server access log, which acts as a cumulative history of every attempt (whether successful or unsuccessful) to retrieve information from the site. The information from the individual entries is easily extracted and loaded to a data store that is more readily adapted to the support of a multidimensional cube. The typical log contains entries in the common log file format that includes the following fields:
Example entries in a simple server access log appear in Illustration 1.
Other logs exist, in addition to the above, and, often, logs are combined to add additional information to the entries shown above. For purposes of our lesson, we will work with a sample log file similar to that shown in Illustration 1. Keep in mind, throughout the procedures that follow, that the steps we take to entrain the data into our star schema are similar with any log, with modifications obviously required to adapt to differing scenarios. Again, we will keep the extraction process simple so that we can focus on our primary objectives.
As virtually all of us know, many options exist for performing the import of a log file. We not only want to import information from the file, however, but we want to set that information up into a data source that can be easily accessed by Analysis Services to build a cube. To achieve both objectives simultaneously, we will use Data Transformation Services (DTS), an Extraction, Transformation and Loading (ETL) tool that accompanies Microsoft SQL Server 2000.
Populating a Cube Data Source using DTS
Data Transformation Services (DTS), which comes along with the typical installation of MSSQL Server 2000, acts as an excellent tool for developing, automating, and managing data extraction, transformation and loading. In this lesson, our first major step will be to entrain the Server Access Log data, which we described in the previous section, that is useful to our cube design. While our excursion into the use of DTS will involve only the simplest functions, we will still gain an appreciation for the flexibility of the tool, and the number of things we can accomplish from a single graphical user interface. In our example, we will use DTS to accomplish the lion's share of building the data source structure, including the creation of the destination database and its member table; the process would likely differ in reality, particularly in the fact that the star schema would most likely have been designed and in place well before beginning the ETL process.
DTS makes use of OLE DB to acquire data. While the details lie outside our present scope, it is useful to understand that OLE DB goes beyond ODBC's limitations of access to only relational data. OLE DB defines a set of COM interfaces that let you access and manipulate any data type, enabling Microsoft's much-touted Universal Data Access (UDA).
In its role as an OLE DB consumer, DTS extracts data from any data source that acts as an OLE DB provider (that is, provides a native OLE DB interface), as well as from any ODBC data source. As we will see, we establish an OLE DB connection, using the DTS Package Designer, to our data source (the Server Access Log) in order to extract the data we need and to load it to our MSSQL Server destination database. Acting in its transformation role, DTS will map the access log data fields to the respective destination data fields via the data source connections, enabling the conversion of data types as necessary.