The SQL-92 standard unknowingly and without planning introduced the capability to perform full hierarchical data processing with its introduction of the LEFT Outer Join operation. This natural hierarchical processing capability will be explained in this article.
Database Hierarchical Structure Introduction
The SQL-92 standard unknowingly and without planning
introduced the capability to perform full hierarchical data processing with its
introduction of the LEFT Outer Join operation. The reasons this powerful capability
has not been used were covered in my previous Database Journal article Ten
Problems with XQuery and the SQL/XML Standard. This natural hierarchical
processing capability will be explained in this article. Current built-in hierarchical
processing support in relational databases has been proprietary and very
limited. Externally added hierarchical processing is not automatic, it is user
driven using functions. Database level hierarchical support for XML and legacy
hierarchical databases like IBM’s IMS and Structured VSAM requires support for
both multiple node types and multiple data occurrences. The support for this
level of hierarchical processing is not practical if performed procedurally and
externally. It must be performed internally and automatically to be practical. This
will also be covered in this article.
Hierarchical Structure Basics
Hierarchical structure principles involve hierarchical data
preservation based on parent and child relationships. This means that parents
can exist without children in the database, but children can not exist without
a parent. If a parent with children is deleted from the database, the related
children and all descendents are removed also. Typically a delete command to a
virtual structure in memory would work the same, but would probably not be
persisted. In the case of retrieving and building the virtual hierarchical
structure in memory by only selecting the nodes desired, the unreferenced nodes
are sliced out of the resulting structure and do not cause removal of lower
level nodes. This is known hierarchically as node promotion and can cause node
collection to automatically occur where a higher level node increases the
number of pathways directly under it. This process changes the structure, but
still preserves its basic structure semantics and integrity. SQL’s SELECT
operation works exactly the same way with its variable SELECT list skipping
around intervening data types. This is known relationally as projection.
Previously, hierarchical structures received a bad
reputation for being fixed and not flexible when relational databases became popular.
In actuality, this did not take fully into account the use of logical hierarchical
structures such as virtual structures mentioned above. Logical hierarchical
structures have unlimited flexibility and even more capability for dynamic
changes to the structure than relational. They can be used in the modeling of
relational tables into a logical hierarchical structure. Flexible logical
hierarchical structures also follow precisely the same hierarchical processing
principles as fixed physical hierarchical structures. This capability can be
used heavily in extending SQL’s hierarchical processing to XML and legacy
databases.
Divide and Conquer the Hierarchical Processing Problem
The “M” in XML stands for Markup
and that was what XML was designed to handle. The additional use of XML for
database data was an afterthought. Database data use is more fixed than
Markup data. Markup data requires user navigation to get the full flexible use
out of Markup data. Database data being more fixed with a more specific use does
not need to be navigated by the user; it can be accessed transparently with no
user navigation required. This is known as navigationless access.
Unfortunately, this difference in use has not yet been recognized and utilized.
Even more of a concern is that database data processed as Markup data can
produce incorrect results. By making this important distinction and limiting
XML’s database processing to fixed hierarchical processing allows its
processing to operate navigationlessly. This also allows it to perform many
advanced full hierarchical processing capabilities not possible or practical
with user navigation.
SQL Hierarchical Data Structure Modeling
In the SQL Structure Definition in
Figure 1, the Left Join operation preserves the left side argument over the
right side. This means node A can exist even if there are no matching node B
items, but node B can not exist without matching node A items. The SQL
processing continues left to right defining the hierarchical structure. The ON
clause specifies the link data points between nodes; this is why node C is
linked to node B and not node A. Notice in the SQL Structure Definition that
nodes B and D are both linked back to node A so that node A starts two separate
pathways. Each ON clause only operates locally at its specific use point and
downward because the left argument side is always preserved with Left Outer
Joins. The SQL WHERE clause does not have this local narrow range of effect;
its global effect allows it to filter the entire hierarchical structure. SQL
WHERE clause filtering can be added to the end of the SQL Structure Definition
below. The ON clause filtering is the same as XPath filtering and the SQL WHERE
clause is the same as the XQuery WHERE Filtering operation.
The SQL Structured Definition
shown in Figure 1 can be specified in an SQL Query to be directly processed by
the SQL processor. The Hierarchical Structure is defined by the Left Join
syntax and its hierarchical operation is specified by the associated semantics.
The SQL Structure Definition can also be named and defined in an SQL view for
easy use and reuse. This is shown later in Figure 4.
Figure 1 Hierarchical Data Modeling Produces Hierarchical Rowset
Interpreting Hierarchical Rowsets
Figure 1 also demonstrates a number of the most basic
hierarchical processing attributes produced in ANSI SQL. These attributes are
hierarchical data preservation from the left join and variable length path
creation with nulls inserted in the rowset to keep them aligned properly. This
is shown in the Structured Rowset above where the two sibling legs BC and DE
are independently represented at different variable lengths. The shortened null
terminated legs show that hierarchical preservation is also working; otherwise
they would be totally missing.
Mapping SQL to Hierarchical Processing at a Conceptual Level
In mapping SQL to and from hierarchical processing,
relational tables and XML elements are treated as nodes. Relational columns,
XML element strings and XML element attributes are treated as fields in data
nodes. Figure 2 below demonstrates SQL’s high level hierarchical processing
using SQL’s SELECT,
FROM, WHERE syntax and their intuitive hierarchical structured operations. These
are: input data and its hierarchical SQL modeled structure specified by the
FROM clause; output data selection specified by the SELECT clause which
indicates the hierarchically related nodes to be returned and their data and order;
and the data filtering specified in the WHERE clause which hierarchically
filters the data following its modeled hierarchical structure. These
operations used together will hierarchically process the input data and
automatically produce a hierarchical structured XML correctly representing the
processed results.
Figure 2: SQL Hierarchical Query Specification and Operation Overview
In Figure 2 you can notice that the SQL user can continue to
use the same SQL, but visualize the data structures as hierarchical and the
operation as being performed hierarchically at a high conceptual level to
produce hierarchical structures. These will be automatically converted to its
hierarchical XML structure unless overridden. The Query Result Structure in
Figure 2 demonstrates the hierarchical operation of node promotion. Notice that
the unselected nodes are removed from query result.
Querying Multipath Hierarchical Data Structures
The examples used so far have used multipath processing
where queries reference multiple pathways in the accessed hierarchical
structure. The processing of these multipath queries is not generally possible
today because they require special hierarchical processing to get correct
hierarchical results.
The ON clause covered earlier is a linear single path
qualification that is used during structure definition and creation where it
can control only the path it is on. The WHERE clause data filtering and data
qualification is logically applied after the entire structure has been created
and can affect the entire structure such that data qualification based on one path
can affect data qualification on another Path. This is because every node is
related to every other node in a hierarchical structure. This full nonlinear
hierarchical qualification is not designed or made up. It is standard
processing for nonlinear hierarchical processing based on naturally occurring
hierarchical principles. These principles have been utilized as far back as the
original hierarchical databases. And now with relational databases they are automatically
performed naturally when the data is modeled hierarchically proves their
inherent technology is correct. XML database products today are lacking this
level of principled hierarchical processing.
This hierarchical data filtering is a new capability for the
XML industry. Figure 3 indicates how it works. You can notice how the relational
rowset and its associated hierarchical structure are related. In this example,
the WHERE clause is pointing to node C in the Data Qualification Flow diagram
to demonstrate the data filtering directions taking off from node C in
different directions. If this node is selected for output, all of its
associated node occurrences are also qualified, up, down, and around as shown by
the arrows in the Data Qualification flow. This happens automatically because
the entire row is qualified. Notice how the unqualified row is not output. The
darker cells are qualified and output.
Figure 3: WHERE Clause Data Multipath Qualification Flow
Joining Structures Hierarchically
Figure 4 below represents the processing of a query that is
invoked by a query dynamically joining two hierarchical structures represented
by their view names RDBV and XMLV. In the same way that the hierarchical data
modeling was performed a node at time using Left Outer Joins in Figure 1 to
define hierarchical structures, hierarchical structures defined in SQL views
can also be joined by Left Outer Joins producing a hierarchical superstructure.
In this example, the left Relational structure is joined over the right XML structure
linked by its ON clause join criteria at a high conceptual level as
demonstrated in Figure 4. The ON clause takes on added importance by
unambiguously specifying the link points in each structure being hierarchically
joined. This is shown by the diagram of the joined structures to the right of
the SQL Processor box in Figure 4. The X node box in this diagram is dashed
indicating it is not selected for output and will be removed from output.
The ON clause was introduced in SQL-92 to replace the WHERE
clause’s join use for a more precise join control geared more for local control
at the join point. The ON clause operates during structure creation while the
WHERE clause operates after structure creation affecting the entire structure
hierarchically as a whole. This is a huge difference between the WHERE and ON
operation not generally realized and enables much of SQL’s hierarchical
capabilities.
This example in Figure 4 also demonstrates a heterogeneous
join of structures by hierarchically joining a relational structure over an XML
Structure. Non relational structures such as XML and legacy hierarchical
structures such as IBM’s IMS require their own view definition to capture their
specific set of metadata needed for their processing. This is indicated in the
XML Definition Processing box in this example. This information is used in the
Service Request for XML box when XML access is needed by the SQL Processor. The
XML Definition Process will also automatically generate a standard SQL
hierarchical view that specifies the non relational structure’s structure as a
standard SQL hierarchical structure view as shown. This allows the joined
heterogeneous structures to be hierarchical joined seamlessly using standard
SQL sequence of Left Outer Joins.
Figure 4: SQL Multi Structure Heterogeneous Hierarchical Join
In this example the relational view is a logical view and
the XML view is a physical view. The relational logical view is constructed
while the XML physical view is already constructed and is just defined
recording its structure information. Joining and processing physical and
logical structures does not present an impedance mismatch problem because they
both follow the same set of hierarchical processing principles and rules.
Conclusion
In the implementation of the prototype that proved the
capability of the design modeled in Figure 4, the SQL processor’s hierarchical rowset
results were converted to an XML hierarchical structure as shown. This design
kept the ANSI SQL processor unmodified, proving that it could indeed operate
fully hierarchically beyond current XML processors hierarchical capabilities.
This design could be improved in efficiency if the external processing additions
shown were integrated directly into the SQL processor. The ANSI SQL
Hierarchical XML Processor prototype can be tested at www.adatinc.com/demo.html.