Data validation
In the previous figures, you can see that there is quite a
range of values for the SALARY column and weve been able to easily categorize
or identify them using various graphical, icon, or color formatting. What
happens if you are working with a very large amount of data, however, and you
want to exclude certain ranges? One option would be to generate a query such
that the data returned to Excel 2007 was not returned to the worksheet. Excel
2007 offers an alternative if you dont have the luxury of receiving data that
matches the exact ranges you want to investigate. (You can also use the data
validation feature to force entered data to be chosen from a drop-down list of
values you want to specify.)
For example, lets assume you received data but didnt want
to analyze salary data that was less than $25,000 or more than $90,000 since
these ranges are considered outliner data and under the discretion of
the Human Resources department.
The first step in data validation is to set the rules to exclude
outlying data. To configure a set of data validation rules in Excel 2007,
perform the following steps:
1. Select
the Salary column and click Data>Data
Validation:
2. The
Data Validation window opens. Use this window to configure the outlying data
rules, or value-check entered values, and so on. For this example, set Allow
to Decimal, Minimum to 25,000, and Maximum to 90,000:
Note: If the values
you want to use as a lower or upper boundary are part of the data set, you can
use the cell selector (
) to
select the value directly from the worksheet.
The Input Message and Error
Alert tabs are used for parameter entry validation; however, this
capability is outside the scope of this article.
After you define validation rules
for your data, nothing happens to your data until you specifically call the
data validation function that will interact with your rules and validate the
data.
3. Highlight
the data you want validated by clicking Data>Data
Validation>Circle Invalid
Data. When you select this option, you can see that Excel 2007 applies the
data validation rules set up in Step 2 to the data in the worksheet, as shown
below:
Notice in the previous figure that
there are two red circles (you can configure this color) around the outlying
data as defined by our rules.
4. Clear
the data validation rules by selecting Data>Data Validation>Clear
Validation Circles.
Removing duplicate data
Sometimes your data may contain duplicate values and you
dont want these to be used as part of your data analysis. You can address this
programmatically at the database tier using SQL (shown below); however, doing
so requires using a database query from within Excel or working with the
database administrator (DBA) to can a query as a view or something similar.
In previous versions of Excel, removing duplicate data once
the data was inserted into your Excel worksheet was a very manual task. In
fact, depending on how the data was delivered to staff workstations, a DBA had
to sometimes write custom scripts to clean up the data.
In Excel 2007, things get a lot easier: If you want to
remove duplicate values for the DEPT column in the STAFF table, perform the
following steps:
1. Select
the column from which you want to remove duplicates (in this case, DEPT):
Tip: You dont have
to select the column to remove duplicate values from since you can specify the
column directly in Step 2; however, I recommend that clients use this approach
because it keeps them visually focused on the task at hand and what data they
are going to remove. After all, this data is likely to be used in some sort of
analysis process or decision so you want to minimize the chance of data errors
and this is a good way to do that.
2. Select
the Design ribbon (its under the Table Tools menu that appears
when the table object has focus in Excel) and click Remove Duplicates:
When the Remove Duplicates window
opens, select only the columns from which you want to remove the duplicate
values. In this simple example, I just want to ensure that only distinct values
are returned for the STAFF table as a function of the department. In other
words, there can only be one row of data for each department; therefore, in the
previous figure, I selected DEPT.
You can also see in the previous
figure that I left the My Data has headers check box select which is
the default. This is the reason why my tables column headings are shown in the
Columns section of the Remove Duplicates window. If you dont select
this check box, youll have to work with your data using generic headings, as
shown below, which greatly increases the chances of error; for this reason, I
recommend always using this option:
3. Click
OK. Any rows that have repeating DEPT values are removed from the data
set and a pop-up window lets you know just how many rows were removed and how
many remain. For our example, the data set now looks like this:
If you select more than one column
in the Remove Duplicates window, Excel 2007 will treat them as a composite duplication
key. Quite simply, this means that its the combination of the selected columns
that determine if the value is unique or not. For example, a composite
duplication key of the NAME and DEPT columns wouldnt remove any rows in our
example since there are no two employees with the same name in the same
department:
However, if you selected DEPT and JOB
as the composite duplication key, only the first employee in the data set that had
the same job in the same department would remain in the data set, as shown
below:
One thing I dont like with this
feature is that you cant select multiple columns that arent adjacent in the
data set. For example, if I highlighted the DEPT and the YEARS columns and
wanted to use both as a composite duplication key, I would receive the
following error message:
The workaround is simple: reorder
the columns. The best way to reorder columns in Excel so that you can create a
composite duplication key is to copy and paste one of the columns that will be
used for the composite duplication key and move it to its new location, and then
delete the original column. Alternatively, you could write an SQL statement in
a database query and make that your data servers connection object, or base
the connection on a view that you create that projects (reorders) the columns
in the required order.
Wrapping it up...
In this article, I showed you some fancy features that are
new in Excel 2007. Whats more, you can use these features to seamlessly work with
your DB2 data. In my next article, I'll show you how you can use a wizard to build a query that retrieves a subset of
data from your connection object. This may be an easier approach to remove duplicate data our outliner values.
Again, it all depends on how the data is delivered to you, so I want to give you a number of tools that you can use to solve issues as they arise.
»
See All Articles by Columnist Paul C. Zikopoulos
About the Author
Paul C. Zikopoulos, BA, MBA, is an
award-winning writer and speaker with the IBM Database Competitive Technology
team. He has more than ten years of experience with DB2 and has written more
than one hundred articles and several books about it. Paul has co-authored the
books: DB2 9 New Features, Information on Demand, IBM DB2 Version 9: New
Features, DB2 Version 8: The Official Guide, DB2: The Complete Reference, DB2
Fundamentals Certification for Dummies, DB2 for Dummies, and A DBA's Guide to
Databases on Linux. Paul is a DB2 Certified Advanced Technical Expert (DRDA and
Cluster/EEE) and a DB2 Certified Solutions Expert (Business Intelligence and
Database Administration). In his spare time, he enjoys all sorts of sporting
activities, running with his dog Chachi, and trying to figure out the world
according to Chloë his new daughter. You can reach him at: mailto:paulz_ibm@msn.com.
Trademarks
IBM and DB2 are registered trademarks of International
Business Machines Corporation in the United States, other countries, or both.
Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks
or service marks of others.
Copyright International Business Machines Corporation, 2007.
All rights reserved.
Disclaimer
The opinions, solutions, and advice in this article are from
the authors experiences and are not intended to represent official
communication from IBM or an endorsement of any products listed within. Neither
the author nor IBM is liable for any of the contents in this article. The
accuracy of the information in this article is based on the authors knowledge
at the time of writing.