In our recent articles published on this forum, we have been discussing Business Intelligence based features incorporated into SQL Server 2008 R2 Integration Services. Continuing this approach, we will now turn our attention to two closely related Data Flow components – Row Sampling Transformation and Percentage Sampling Transformation, which provide data sampling functionality. This functionality is useful in a number of Analysis Services and Data Mining-based scenarios (although it is important to note that their applicability extends to other areas, wherever selecting a random portion of an existing data set is needed).
The primary purpose of both of these transformations is to facilitate the creation of a group of records of arbitrarily chosen size that constitutes a fairly accurate representation of an underlying data source. In the first case, this size is set to be equal to a provided integer number, while the second one calculates it according to the specified percentage. The pseudo-random nature of each sample is accomplished through an algorithm utilizing a seed value, which can be either obtained from the current tick count (number of milliseconds since the operating system startup) or assigned explicitly (via the graphical interface or programming methods) to the corresponding component property. The former produces a different outcome during each package execution while the latter allows you to yield the same record selection as many times as needed.
These characteristics are well suited for development and testing environments, where the ability to evaluate performance or data trends is desired but processing the entire content of data source is not feasible or practical. They also come in handy when training and evaluating Data Mining models.
Since our primary objective is demonstrating the functionality of both transformations, we will (for the sake of simplicity) leverage a predefined data set representing the business operations of AdventureWorks – a fictitious company manufacturing and selling bicycles (with sample databases downloadable from the CodePlex web site, where you will also find a detailed procedure describing their deployment). The resulting installation includes AdventureWorks database, containing [Sales].[vSalesPerson] view, which will serve as our data source.
Once the database is set up and online, launch Business Intelligence Development Studio and create a new project based on the Integration Services template. Switch to the Data Flow tab and activate the hyperlink displayed in the center of the main window stating No Data Flow task has been added to this package. Click here to add a new Data Flow task.
With the Data Flow task in place and active, drag an OLE DB Source component from the Toolbox onto the Designer interface. Display its Editor window by selecting the Edit… item from its context-sensitive menu. In the OLE DB Connection Manager section, click on the New… command button and point to the target instance of the AdventureWorks database. Verify that the operation has been successful by taking advantage of the Test Connection command button. After you have defined the new OLE DB connection manager, ensure that the Table or View entry appears in the Data access mode listbox, and select [Sales].[vSalesPerson] as the Name of the table or the view entry. Click on the Preview… command button to examine query results and close the OLE DB Source Editor.
Next, drag the Row Sampling icon from the Toolbox onto the Designer window, position it directly underneath the OLE DB Source component, and extend the Data Flow Path so the two of them become connected. Display the Editor window of the newly created transformation. In its Sampling section, you will find a spin box controlling the sample size (labeled Number of rows), two textboxes allowing you to assign names to sample and unselected outputs (referred to as Sample output name and Unselected output name with the default names of Sampling Selected Output and Sampling Unselected Output, respectively), as well as the combination of a checkbox and associated spin box that activate the use of random seed and set its value. As mentioned earlier, by employing this option and setting the seed to some arbitrary but constant value, you have the ability to ensure that each consecutive package execution will yield identical output (which might be useful if you are performing repetitive tests, which require consistent input). The other section, titled Columns allows you to exclude individual columns from both outputs.
Since our primary intention is to demonstrate characteristics of Row Sampling transformation, we will employ Flat File Destination to capture both of its outputs (although, when dealing with Business Intelligence processing, you would likely redirect them to a Data Mining model for testing or training). Drag its icon from the Toolbox, position it underneath the Row Sampling transformation, and extend the Data Flow Path from the latter until they are connected. When prompted for Input Output Selection, pick Sampling Selected Output in the output listbox, leaving Flat File Destination Input in the Input listbox unchanged. Display the Editor window of the Flat File Destination (via its context sensitive menu entry) and point to an arbitrarily chosen file by creating Flat File Connection Manager. Repeat the same sequence of steps to create another Flat File Destination and its corresponding Connection Manager to provide storage for the Sampling Unselected Output. Once you execute the package, you should be able to confirm that the transformation is functioning as intended (either by examining the number of rows being displayed for each of the data flow paths within the Designer window or by reviewing the content of the text files where output rows are stored).
An equivalent setup can be created using Percentage Sampling transformation. Most of the properties exposed in its Editor window are identical (both facilitate assigning an arbitrary seed or utilize one calculated based on the tick count, as well as generate two complementing outputs), although, as expected, in this case, Number of rows spinbox is replaced by Percentage of rows.