Datawarehousing ( ETL) Enhancements in Oracle 9i
August 20, 2001
Datawarehouse environments face the challenge of exchanging, integrating and consolidating huge amounts of data over many systems to provide a unified information base for Enterprise Intelligence. This process in general is refererred to as ETL ( Extract, Transform and Load ). Extract process is very often involved with the extraction of the data from many source systems. For example a content network like internet.com would be required to consolidate the logs that it generates across hundreds of sites that area part of its network to create intelligence that can be used for managing the visitor information. The most time consuming part of this process is the Transformation and Loading process where multiple filters are applied and the data is validated against an set of known information and loaded into the current enterprise warehouse system. This has to be done in a very quick and scaleable manner without affecting the source and Warehouse systems.
Most ETL processes today are done outside the database using 3rd party software that can provide specific capabilities to access disparate source systems before the data is transformed. Once the data is transformed, current Oracle Warehouses use parallel insertion or loading techniques. The primary role of the database at this point are to manage the rows, indexes and constraints. Some ETL processes use a serial approach of Extract Load and Transform, thereby using more of the database capabilities to perform the transformation of the data. In this approach the data is extracted and loaded into a staging area using 3rd party tools and then PL/SQL or Java is used within the oracle database system to transform the data before its posted to the final tables. The inherent disadvantages with this approach are many including the inability of the ETL software to scale, No control over recovery and restartability in case of a failed ETL process and the ever present nightmare of managing home grown processes.
Oracle 9i introduces the new paradigm of transform-while-loading replacing the age old serial transform-then-load and load-then-transform processes. In the new approach the database becomes an integral part of the ETL process and some of the necessary steps become obsolete while others can be remodeled to enhance the process. The database now offers a toolkit that enables the most appropriate ETL process flow for a specific customer need and not dictate or constrain it from a technical perspective. 9i implements the following new functionality to help the process be faster and more efficient.
Oracle Change Data Capture Framework
can be used to optimize the extraction portion of ETL process and build the basic maintenance framework for the repetitive scheduled execution of the complete ETL process. OCDC has the capability to capture changed data from oracle data sources. Using the proven Replication Framework and the Logminer technologies oracle can perform either synchronous or asynchronous change data capture when the source systems are using Oracle. An extensible API is provided for 3rd party vendors to integrate external data drivers for non-Oracle sources into the Framework.
feature allows external data sources like flat files to be exposed to users like any other data residing in a regular table. The only limitation is that this table is read only and acts as a virtual table that can be used in SQL, PL/SQL and Java operations without having to first load the data into the database. This enables the merging of the loading process with the transformation process by reducing any interruption of data streaming and staging the data inside the database.
Multi Table Insert
is a cool new feature that every dba and developer can appreciate. Until Oracle 8i, you had to run throught the data multiple times or write procedural code to perform inserts into multiple tables in one pass. 9i's Multi Table Insert feature provides a feature to insert data into multiple tables in one pass.
Other notable features are Upsert, which is a sql extension that provides the functionality to conditionally update or insert a row into the database. More enhancements, allows a database to have different block sizes when transporting tablespaces between databases. Starting with 9i, the source and target systems do not have to be the same block size.
With the addition of these new features Oracle enhances its capability as a data-processing platform in a data warehousing environment.