One of Oracle’s most
powerful but somewhat under the radar tool is Warehouse Builder. Like several
of Oracle’s non-RDBMS products, it didn’t get off to exactly the best start
possible with respect to being user friendly, bug free, easy to use/install,
documented, and well, you get the idea. However, in later versions, Warehouse
Builder has evolved into an extremely feature rich, highly functional
application which enables users to do some pretty amazing things without having
to get wrapped up in all of the gory details.
This article assumes you
already have OWB installed and are trying to figure out how to build an
automated ETL process. What are the step-by-step details to make this happen?
Background
Warehouse Builder,
frequently referred to as OWB, can automate the loading of flat files into a
database. Many DBAs are intimately familiar with using a combination of
SQL*Loader and shell scripting, plus having some cron action tossed in here and
there. OWB does that (and much more) as well, by abstracting this process via a
wizard driven, GUI interface with lots of point and click functionality.
Through its Design Center and Control Center interfaces, a user can design and deploy the ETL
process (and we’ll just focus on the loading part, that is, what it takes to
get the contents of a delimited flat file loaded into a table, no changes to
data along the way). And the deployment is not limited to the server you’re
working on at the time. OWB enables you to design a process on one server and
then deploy the steps to another server. Or, to more than one server if you so
desire.
What is the procedural
concept behind this operation? Enumerating the steps to make this happen helps
to provide a framework.
1.
Identify the source file,
including the location and the nature of the data within.
2.
Create an external table which
serves as the data dictionary container.
3.
Identify, and if necessary,
create the “real” table in the database.
4.
Make all of this happen on a
scheduled basis, or as a one time deal.
OWB’s approach to this
process is to use the metadata about these objects and tie them together via a
mapping and a workflow. The workflow, in fact, can be created as a visual
artifact, that is, OWB will produce a workflow diagram, which, interestingly
enough, is exactly what OWB uses behind the scenes: Oracle Workflow.
Think of each piece in this
process as being an object: the file, the location of the file, the external
table, the table, the mapping of the flat file to the external table, the job
to run this at the end, and so on. Everything is an object, and objects can be
tied together via hierarchies and dependencies. Every object type lives in a
module. Since this tool is Java-based, the object oriented design makes sense.
Each object is an instance or instantiation of some method, generally speaking.
Scenario
A typical ETL scenario could
involve uploading a flat file on a recurring basis (of which, you’ll also be
able to execute just the mapping piece by itself). If you break down the
process into several steps involving different parts of the project tree, the
overall task is easy to understand. As a point of reference, we’ll start at the
bottom and work our way up the hierarchy, with one exception. Within a project,
to get started, you will need to have a module created under Oracle.
As a tip, remember that
almost every category involves the same two steps: create a module and import
metadata. Also the examples are not always based on the same table (something
related to customers and countries were used).
An expanded project tree is
shown below.
The areas of interest in the
project tree, in the order in which we want to build the ETL process are:
- Files
- External Tables
- Tables
- Mappings
Once the mapping step is
complete, you need to drop down to Process Flows and Schedules.
Create a new module under
Files and identify the location of the file.
Finish the Create Module step
and launch the Import Metadata wizard. This is where you tell OWB about the
contents of the flat file, which in turn launches the Flat File Sample wizard.
Don’t forget to add a date mask for date fields. After the file has been
identified and sampled, you’re ready to create the metadata about an external
table. Step 1 of the Flat File Sample wizard is shown below.
When identifying the field
delimiter, you can manually enter something other than what the drop down shows
(e.g., a pipe). Once the external table definition is complete, you could
deploy the external table right now and have that table created in the target
schema.
In the next step, we need to
create a table. If the table has to be created from scratch, open the Data
Object Editor and design it. Ensure the column definitions match those of the
corresponding external table. As a tip, deploy the external table first, and
then in SQL*Plus, create the table via CTAS from the external table (just the
table definition, no data).
The next step is to map the
contents of the external table into the real table. Create a new mapping and
map the columns, as the picture below suggests.
Deploy the mapping and if
successful, it is at this point where you could manually run (Start) the load
of data from the external table into the real table. This is verified by
entering the Control Center and looking at jobs run.
To put this into an
automated workflow, a new Process Module is what’s needed. Create the process
flow module, package and process flow, and launch into the Process Editor. Add
in the mapping and end states, with a result like this:
Create a new schedule once
the process flow diagram is finished. The schedule will be generic, that is, it
is not tied to anything. You have to go back to the mapping and associate the
mapping to the schedule. Once scheduled, the loading of the stage table from a
flat file is automated.
In Closing
Although this was a somewhat
higher level overview of what it takes to automate a load, it did cover each
and every project item you need to address for this to work. For a one time
file, and even for the same file or set of files day in and day out, it may be
easier to shell script this work flow and schedule it via a cron job. On the
other hand, once this process is configured, it can be deployed from a
development to QA or production environment. And don’t let the “warehouse” part
of this tool lead you to believe this is only for data warehouses. If you need
to map, design and schedule an ETL load in a different type of database,
Warehouse Builder will work just fine there as well.