Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Jul 24, 2007

DB2 9 and Microsoft Excel 2007 Part 2: Working with the DB2 Data... - Page 2

By Paul Zikopoulos

Data validation

In the previous figures, you can see that there is quite a range of values for the SALARY column and we’ve been able to easily categorize or identify them using various graphical, icon, or color formatting. What happens if you are working with a very large amount of data, however, and you want to exclude certain ranges? One option would be to generate a query such that the data returned to Excel 2007 was not returned to the worksheet. Excel 2007 offers an alternative if you don’t have the luxury of receiving data that matches the exact ranges you want to investigate. (You can also use the data validation feature to force entered data to be chosen from a drop-down list of values you want to specify.)

For example, let’s assume you received data but didn’t want to analyze salary data that was less than $25,000 or more than $90,000 since these ranges are considered outliner data and under the discretion of the Human Resources department.

The first step in data validation is to set the rules to exclude outlying data. To configure a set of data validation rules in Excel 2007, perform the following steps:

1.  Select the Salary column and click Data>Data Validation:

2.  The Data Validation window opens. Use this window to configure the outlying data rules, or value-check entered values, and so on. For this example, set Allow to Decimal, Minimum to 25,000, and Maximum to 90,000:

Note: If the values you want to use as a lower or upper boundary are part of the data set, you can use the cell selector () to select the value directly from the worksheet.

The Input Message and Error Alert tabs are used for parameter entry validation; however, this capability is outside the scope of this article.

After you define validation rules for your data, nothing happens to your data until you specifically call the data validation function that will interact with your rules and validate the data.

3.  Highlight the data you want validated by clicking Data>Data Validation>Circle Invalid Data. When you select this option, you can see that Excel 2007 applies the data validation rules set up in Step 2 to the data in the worksheet, as shown below:

Notice in the previous figure that there are two red circles (you can configure this color) around the outlying data as defined by our rules.

4.  Clear the data validation rules by selecting Data>Data Validation>Clear Validation Circles.

Removing duplicate data

Sometimes your data may contain duplicate values and you don’t want these to be used as part of your data analysis. You can address this programmatically at the database tier using SQL (shown below); however, doing so requires using a database query from within Excel or working with the database administrator (DBA) to can a query as a view or something similar.

In previous versions of Excel, removing duplicate data once the data was inserted into your Excel worksheet was a very manual task. In fact, depending on how the data was delivered to staff workstations, a DBA had to sometimes write custom scripts to clean up the data.

In Excel 2007, things get a lot easier: If you want to remove duplicate values for the DEPT column in the STAFF table, perform the following steps:

1.  Select the column from which you want to remove duplicates (in this case, DEPT):

Tip: You don’t have to select the column to remove duplicate values from since you can specify the column directly in Step 2; however, I recommend that clients use this approach because it keeps them visually focused on the task at hand and what data they are going to remove. After all, this data is likely to be used in some sort of analysis process or decision so you want to minimize the chance of data errors and this is a good way to do that.

2.  Select the Design ribbon (it’s under the Table Tools menu that appears when the table object has focus in Excel) and click Remove Duplicates:

When the Remove Duplicates window opens, select only the columns from which you want to remove the duplicate values. In this simple example, I just want to ensure that only distinct values are returned for the STAFF table as a function of the department. In other words, there can only be one row of data for each department; therefore, in the previous figure, I selected DEPT.

You can also see in the previous figure that I left the My Data has headers check box select which is the default. This is the reason why my table’s column headings are shown in the Columns section of the Remove Duplicates window. If you don’t select this check box, you’ll have to work with your data using generic headings, as shown below, which greatly increases the chances of error; for this reason, I recommend always using this option:

3.  Click OK. Any rows that have repeating DEPT values are removed from the data set and a pop-up window lets you know just how many rows were removed and how many remain. For our example, the data set now looks like this:

If you select more than one column in the Remove Duplicates window, Excel 2007 will treat them as a composite duplication key. Quite simply, this means that it’s the combination of the selected columns that determine if the value is unique or not. For example, a composite duplication key of the NAME and DEPT columns wouldn’t remove any rows in our example since there are no two employees with the same name in the same department:

However, if you selected DEPT and JOB as the composite duplication key, only the first employee in the data set that had the same job in the same department would remain in the data set, as shown below:

One thing I don’t like with this feature is that you can’t select multiple columns that aren’t adjacent in the data set. For example, if I highlighted the DEPT and the YEARS columns and wanted to use both as a composite duplication key, I would receive the following error message:

The workaround is simple: reorder the columns. The best way to reorder columns in Excel so that you can create a composite duplication key is to copy and paste one of the columns that will be used for the composite duplication key and move it to its new location, and then delete the original column. Alternatively, you could write an SQL statement in a database query and make that your data server’s connection object, or base the connection on a view that you create that projects (reorders) the columns in the required order.

Wrapping it up...

In this article, I showed you some fancy features that are new in Excel 2007. What’s more, you can use these features to seamlessly work with your DB2 data. In my next article, I'll show you how you can use a wizard to build a query that retrieves a subset of data from your connection object. This may be an easier approach to remove duplicate data our outliner values. Again, it all depends on how the data is delivered to you, so I want to give you a number of tools that you can use to solve issues as they arise.

» See All Articles by Columnist Paul C. Zikopoulos

About the Author

Paul C. Zikopoulos, BA, MBA, is an award-winning writer and speaker with the IBM Database Competitive Technology team. He has more than ten years of experience with DB2 and has written more than one hundred articles and several books about it. Paul has co-authored the books: DB2 9 New Features, Information on Demand, IBM DB2 Version 9: New Features, DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2 Fundamentals Certification for Dummies, DB2 for Dummies, and A DBA's Guide to Databases on Linux. Paul is a DB2 Certified Advanced Technical Expert (DRDA and Cluster/EEE) and a DB2 Certified Solutions Expert (Business Intelligence and Database Administration). In his spare time, he enjoys all sorts of sporting activities, running with his dog Chachi, and trying to figure out the world according to Chloë – his new daughter. You can reach him at: mailto:paulz_ibm@msn.com.


IBM and DB2 are registered trademarks of International Business Machines Corporation in the United States, other countries, or both.

Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Copyright International Business Machines Corporation, 2007. All rights reserved.


The opinions, solutions, and advice in this article are from the author’s experiences and are not intended to represent official communication from IBM or an endorsement of any products listed within. Neither the author nor IBM is liable for any of the contents in this article. The accuracy of the information in this article is based on the author’s knowledge at the time of writing.

DB2 Archives