SQL’s Optimized Hierarchical Data Processing Driven by its Data Structure


Learn how hierarchical data processing can be optimally driven automatically by its hierarchical structure semantics. When applied to SQL hierarchical data processing, this structured semantics offers optimization beyond current relational processing. In addition, this hierarchical optimization works with SQL dynamic processing and offers new relational capabilities.

My previous Database Journal article covered the ANSI SQL Hierarchical Data
Processing Basics
. This hierarchical data
processing
article will show how hierarchical processing can be optimally
driven automatically by its hierarchical structure semantics. When applied to
SQL hierarchical data processing, this structured semantics offers optimization
beyond current relational processing. In addition, this hierarchical
optimization works with SQL dynamic processing and offers new relational capabilities
that will be covered in this article.

Hierarchical Structures and Processing

The Left Join operation preserves its left argument and not its
right argument for unmatched joins. This causes it to operate hierarchically
allowing the left parent to exist even with no matching right child argument,
but the reverse is not possible; children cannot exist without their parent. As
explained in the previous hierarchical structure processing basic articles, the hierarchical
structure in Figure 1, where tables are represented as hierarchical nodes, can
be defined hierarchically by a series of LEFT Outer Joins. These LEFT Joins
model hierarchical structures using their hierarchical syntax to model the
structure and specify its associated hierarchical semantics, which defines its parent/child
data preservation semantics. This allows hierarchical views’ data modeling
syntax and associated semantics to be supplied directly to the relational
engine when its view is processed at query invocation. This automatically turns
your ANSI SQL processor into a powerful full hierarchical processor.

Hierarchical Data Processing - Global Structure

Figure 1: Global Structure

Notice in the hierarchical view definition above in Figure 1
the use of the newer outer join ON clause to separately specify the link points
between hierarchical pairs of nodes. This offers an exact and unambiguous
hierarchical mapping that can support multiple paths. The ANSI SQL LEFT Outer Join
hierarchical structure definition doubles as structure metadata to specify how
the data is hierarchically processed. This is possible because hierarchical data
structure processing must follow hierarchical data preservation principles inherent
in the data structure. These LEFT Outer Join definitions of hierarchical
structures can be automatically generated from the information in hierarchical metadata
sources such as XML schemas or even IBM’s IMS and VSAM database definitions.
This is also useful because all forms of hierarchical structures, contiguous or
linked and physical or logical structures, have the same hierarchical
processing principles and can be seamlessly processed heterogeneously with standard
hierarchical processing. All that is necessary are the addition of specific
access modules. These access modules will be different from similar access
modules used today because they will be preserving the semantics of the
hierarchical structures on retrieval because the data will be processed
hierarchically.

Hierarchical Optimization

Because of a hierarchical structure’s natural data
preservation, a very powerful optimization can be easily applied at query invocation.
This optimization is based on the data referenced in nodes and specifies that
only the referenced nodes at query execution and any nodes on the path to
referenced nodes are necessary for processing. All other nodes can be
eliminated entirely from processing. It is obvious that referenced nodes are
required. The reason that unreferenced nodes on the path to referenced nodes
are needed is that the paths to referenced nodes are needed to preserve the
semantics of the access path to reach the desired data. The data structure below
in Figure 2 is optimized based on the query shown referencing only nodes A, C
and F.

The arrows in Figure 2 indicate LEFT Outer Joins for
specific nodes that can be removed or ignored for the active query shown without
changing the semantics or results of the query. This optimization does preserve
the semantics of the query, but also has a beneficial side effect that can
reduce unneeded data replications caused by joins with unneeded tables or
nodes. This can actually help reduce problems caused by these additional
extraneous data replications. Hierarchical structures designed specifically for
hierarchical data processing always remove all data replications. It is helpful
to know that replicated data are extraneous while duplicate data are valid and meaningful.
They should be processed differently.

As shown above in Figure 1, the newer more flexible ON
clause replaces the single WHERE clause by using an ON clause at every join operation.
This causes a reference to every node, but this is not taken into consideration
for this optimization because it is a self-reference and if nodes are not
necessary, their self-reference is removed, as is the unneeded node. This is
supported by hierarchical processing’s hierarchical data preservation. This demonstrates,
in Figure 2, how easily the LEFT Outer Join hierarchical view can remove
unneeded nodes for the dynamic rebuilding of the run time query.

Hierarchical Data Processing - Optimized Access

Figure 2: Optimized Access

The hierarchical optimization, in Figure 2 above, would not
normally be recognized by standard relational processing, which has no idea it
is operating hierarchically. As previously mentioned, hierarchical optimization
can be performed immediately when the query is invoked at run time. This is an
extremely powerful semantic optimization, but does not replace standard
relational optimization, which will be performed naturally after the
hierarchical optimization is performed. This previously performed hierarchical
optimization can have a further positive effect on the following standard
relational optimization. It can enable standard relational optimization to
perform more effectively by simplifying the run time query in ways it could not
detect making relational optimizations more clear and detectable.

Optimization Enables Global Views

A significant benefit of hierarchical optimization is that it
enables global hierarchical views to be used without any additional overhead. Normally
views that have unnecessary tables at execution cause a great deal of overhead.
The previous solution for this has been for the user to code many sub views,
each tailored specifically to the required result. Besides having to create many
different sub views for the same data structure and constantly adding views,
this is also difficult for the user to be aware of all the different views.
Hierarchical optimization allows global views with all nodes specified so that
it can be dynamically capable of processing any number of different sub views
without overhead. This is because each query is dynamically optimized to match
the needs of the active query. The query above in Figure 2 is an example of a
global view being automatically utilized.

Schema-free Processing Enabled by Global Views

Another benefit of hierarchical optimization is that it also
enables schema-free processing supported naturally by SQL navigationless processing,
which was covered in my previous article. Schema-free processing allows queries
to be specified without user navigation when used for XML data. This also means
the user does not need to know the data structure and this enables non-technical
users to access XML data. Since users do not need to know the structure for
these powerful schema-free queries or where the data is located in the
structure, they need to be free to specify data anywhere in the structure and
this requires a global view. The query in Figure 2 above is also an example of
a schema-free query allowing its use without the user’s need to navigate or
know the data structure.

SELECT List Controls Processing Dynamically

It is important to point out that the SQL SELECT clause and
its variable SELECT list supports full dynamic processing. This dynamically controls
optimization, global views, and schema-free processing. The SELECT list
dynamically specifies the output data and this directly affects the processing
that is required. Since SQL is nonprocedural, there is no user coding
necessarily for processing. In this case, the processing is established
dynamically and is optimized only to cover the required results, which are
inferred by the current dynamically specified output. In other words, the variable
SELECT list dynamically controls the required processing as shown in the
optimized easily rewritten Left Outer Join string in Figure 3 below. It limits
the internal processing for the query and you can notice that it does not
contain any processing for nodes D, E and G.

Hierarchical Data Processing - Optimized Processing

Figure 3: Optimized Processing

SELECT List More Powerful Than Realized

This dynamic optimized processing may not seem that
impressive because with SQL it is easily applied automatically, but when
compared to XQuery it will become clear how much more automatic and powerful SQL
is. The processing of the SQL query above in Figure 3 is truly nonprocedural. This
process is inferred from the hierarchical data structure defined in the LEFT
Outer Join view and its natural processing principles. It is dynamically controlled
from query to query by the variable SELECT list. Most books on XQuery mention
that it contains the three key processing elements of SQL. These are SELECT,
FROM and WHERE. XQuery has a WHERE clause data filter and has something similar
to the FROM clause, but it does not have anything that approaches the variable
and automatic power of SQL’s dynamically variable SELECT list.

SQL’s powerful variable SELECT list with its dynamic
processing capability is not found in XQuery. XQuery through the use of its FOR
syntax requires the user to set up the proper hierarchical nesting looping
structure to match the desired hierarchical processing required. The data items
required to generate the output values must be placed accurately at their correct
nested looping levels. This is procedural coding and is not performed
automatically or nonprocedurally. Any changes to the query represented in this fixed
code must be procedurally adjusted and tested by the user. In SQL, this could
be achieved simply by altering the SELECT list. This is real nonprocedural and
declarative power.

Conclusion

An important test for proper query optimization is that it
does not negatively affect the result or its external operation. Interestingly,
the powerful hierarchical optimization described in this article also introduced
two beneficial side effects. These were the reduction of superfluous relational
data replication and then there was the automatic creation of the global view
and the further capabilities it enabled. These were all seamless beneficial additions
and stayed within the tenets of proper optimization.

Additional Resources

Creating Hierarchical Data Structure Mashups

Navigationless Database XML: Hierarchical Data Processing

»


See All Articles by Columnist

Michael M. David

Michael M. David
Michael M. David
Michael M. David is the founder of Advanced Data Access Technologies, Inc. Previously, he was the lead XML architect for NCR/Teradata, and served as their representative to the ANSI SQLX Group. Before that he was a staff scientist for Teradata and designed high level multi-featured SQL utilities. He has more than 25 years of experience researching and designing commercial nonprocedural heterogeneous database hierarchical query processing products using flat, relational and hierarchical data. From this experience, he authored the book Advanced ANSI SQL Data Modeling and Structure Processing, as well as numerous papers and articles on this subject.

Latest Articles