My previous article, SQL’s
Optimized Hierarchical Data Processing Driven by its Data Structure, covered
global views, schema–free navigationless processing and the powerful SQL SELECT
clause, which optimally and dynamically controls full multipath processing.
This article continues describing advanced capabilities of hierarchical
multipath query processing by describing the little known Lowest Common
Ancestor (LCA) processing that enables multipath hierarchical query processing
to always produce correct meaningful results automatically. It will explain how
multipath processing dynamically increases the data value of the data by accurately
querying it from multiple pathways.
Hierarchical Structures Create More Value than is Captured
Hierarchical structures are very unique not only
in how well they organize data, but also in how they naturally capture more
meaning than is stored with the data. Even more impressive is their ability to
dynamically process this natural goldmine of meaning in unlimited ways that
further increases the value of the stored data. With the increased use of
hierarchical XML data structures today, there is an incredible amount of unused
data value potential available.
Hierarchical structures are easily built and
expanded naturally over time. Suppose we start building a hierarchical
structure by creating and populating node A over node B. Then at some point we
add node C also directly under node A as shown in Figure 1. This is an example
of reuse by reusing the already existing node A and its data. We also know what
nodes the A node has control over and how nodes B and C are related to each
other. With the addition of each new node, the data value will increase in a
nonlinear fashion. At this point, this multipath nonlinear hierarchical
structure can look like the one in Figure 2.
Figure 1: More value than captured
Multipath Queries Dynamically Increase Data Value
The interesting question in Figure 1 is how nodes
B and C are related. How does a query language handle this? Can node C be
selected for output based on a data value in node B? Can node A be
selected based on data values in multiple nodes B and C? Multipath
queries dynamically increase data value by utilizing the additional semantic
structure information that exists naturally between the pathways. The very
nature of having to use additional semantic structure information in a query
increases the value of the data used in the query and in the result. An example
of such a query is selecting data from one path of the structure based on data
in another path. For example using Figure 1: SELECT B.b WHERE C.c= 1. This also
allows an unlimited number of queries to be asked of the same data view. To
explore this further, Figure 1 needs to be expanded over time into Figure 2.
Figure 2: Example multipath structure and its SQL view
Figure 2 demonstrates how SQL can model hierarchical
structures in a view. The SQL-92 standard introduced the LEFT Outer Join, which
is hierarchical in nature allowing the left data argument to be preserved when
there is no matching right data argument. This places the left argument
hierarchically above the right argument as shown in the structure in Figure 2.
Additionally with the SQL-92 standard the ON clause was added to replace the
WHERE clause for specifying the join criteria. The ON clause is significantly
different from the WHERE clause because it is specified at each join point and
has a more controlled linear operation. This allows the entire hierarchical
definition in Figure 2 to be precisely defined by SQL’s LEFT Outer Join syntax.
In addition, the associated Left Outer Join’s semantics allows the syntax to be
processed directly by the SQL processor allowing it to perform full
hierarchical processing automatically. This has been covered in more detail in a
previous Database Journal article in this series listed at the end of this
article.
Multipath Queries Require LCA Processing
In a multipath hierarchical data structure, nodes or points
on two paths are related by their Lowest Common Ancestor known as the LCA node.
In Figure 2, nodes B and C have node A as the LCA. In the case of nodes F and
G, node C is the LCA. And in the case of nodes B and F, node A is also the LCA.
The purpose of the LCA node is to control or limit the domain range of the
operation applied across the two related nodes. In Figure 2, nodes D and F have
a larger domain than nodes F and G.
Hierarchical databases that handle XML and other large hierarchical
databases like IBM’s IMS or Structured VSAM used with COBOL’s hierarchical
processing require multiple node types and the support of multiple data occurrences.
This is shown in the example of hierarchical data structure in Figure 3. Except
for the A root node, each node type was populated with two data occurrences
whose value shown indicates their node type and data value for use in examples
to follow.
Figure 3: Data structure of view ABC
Data occurrences play an important part in LCA processing. Using
nodes F and G in Figure 3 and their derived LCA C node controls the range of nodes
under the LCA C node. With LCA nodes the level of containment is at the LCA
data occurrence level. This means in this F and G node example that only the F1,
F2 G1 and G2 data occurrences combination or the F3, F4, G3 and G4 combination can
be used together because they occur under different LCA data occurrences of C1
or C2. For example, nodes F1 and G3 cannot be processed together in order for
the LCA logic to keep the processing meaningful. In a different example using
nodes B and F the LCA node is A, its data occurrence A1 is shown. This allows all
of the data shown to be processed together. This allows for example node D2 to
be used with E3 if necessary. Data from a not shown A2 data occurrence node
could not be combined with the data shown. How the range of logic applied
between the two related nodes is applied depends on the operation being
performed, which is explained in the following section.
SQL LCA Used in Two Operations
In the previous article in this series, hierarchical
processing was explained as being controlled by the SQL-92 LEFT Outer Join operation,
which is hierarchical in nature. It was shown how the LEFT Join syntax can be
used to hierarchically model full multipath hierarchical structures. It was
then explained that the associated LEFT Outer Join syntax could be directly
processed by the SQL processor and the associated hierarchical semantics would
make it operate hierarchically. This allows SQL to naturally perform
hierarchical data preservation operations. The LCA operation was not covered at
that time because the LCA operation was not needed for the operations covered
in that article. The LCA operation is used in SQL processing in two important
ways. These are the SELECT LCA operation and the WHERE LCA operation, which are
handled differently. The LCA operations are needed to keep the query result
meaningful for multipath queries. It does this by limiting the processing
between hierarchical pathways to the point that both related paths are still
strongly related by their Lowest Common Ancestor (LCA).
The WHERE LCA Usage
WHERE clauses that reference multiple hierarchical paths
require LCA logic. For example, using the structure in Figure 3 let’s take the
WHERE clause: WHERE F.f=1 AND G.g=2. Nodes F and G produce the LCA node
as C. This means that all of the combinations of F1, F2, G1, and G2 under the
LCA occurrence C1 or all of the combinations of F3, F4, G3 and G4 under LCA
occurrence C2 are meaningful. Comparing F1 to G3 would not be meaningful. If we
change the WHERE clause to: WHERE D.d=1 AND G.g=3, then the LCA node is
node A and all the D and G node data occurrences under the current LCA node
data occurrence A1 are tested. You will notice that in this case that the
domain of the LCA logic is larger since the tested nodes were further apart
hierarchically. The LCA and its logic take that into consideration
automatically.
The SELECT LCA Usage
Selecting data from one pathway based on data from another
pathway requires a different form of LCA processing. An example using Figure 3
is: SELECT F.f FROM ABC WHERE G.g=2. In this case, the F node and G
node produce an LCA C node. This means that if any occurrences of G.g under the
current LCA C node tests true, then all F.f data occurrences under the LCA C1
node occurrence are selected for output. In this case, the output would be F1
and F2. An example where the hierarchical distance is increased is: SELECT
D.d FROM ABC WHERE G.g=2. In This case, the D node and G node produce an
LCA A node. This means that if any occurrence of G.g (G1, G2, G3, or G4) under
the current LCA A1 data occurrence node tests true, then all D.d data
occurrences under the LCA A node occurrence (F1, F2, F3, and F4) are selected
for output. In this case, the domain for this example is larger and spans
across parents.
The SELECT clause can specify multiple data type items for
output. This means that these different data types can be located on different paths,
which can change the LCA node for the different data type items in the SELECT
clause. An example using Figure 3 is: SELECT D.d, F.f FROM ABC WHERE G=1.
This is actually a combination of the previous two SELECT examples. In this
case, the selected data type items, D.d and F.f are separately processed for its
LCA node, which could be different. Each separate data type item has the same WHERE
clause G node. This means that the selected D.d node had an LCA node of A,
while the F.f data node had an LCA node of C. In addition to this multiple LCA
node SELECT processing, the WHERE clause can also require its own LCA because
it can also specify multiple tests on different pathways as in: SELECT D.d,
F.f FROM ABC WHERE F.f=1 AND G.g=2. These SELECT and WHERE LCAs can all naturally
nest. What this also means is that the complexity of LCA query processing
requires automatic processing.
Where is LCA Processing Occurring in SQL
Since multipath nonlinear processing requires automatic
processing, and XQuery does not support automatic LCA processing, it cannot
support multipath processing well, if at all, which is true for all XML query
processing today. In addition, LCA processing is not practical using user navigation,
which is required today for XML access in query languages. There is current
academic research today attempting to add LCA processing to XQuery using external
LCA functions. From the complexity of LCA of unrestricted multipath access
shown in this article, it should be clear that a complete LCA solution will
require automatic internal LCA support.
Built in LCA processing used in the past with the original hierarchical
databases and their hierarchical query processing used a tremendous amount of
hierarchical tree walking up and down the data structure. There is also the
problem of determining LCA nodes. Even today, there are still papers being published
with more efficient algorithms for locating LCA nodes based on two points in a
hierarchical structure. SQL was designed with no LCA processing or LCA locator
logic and concept of hierarchical tree walking. So how is SQL automatically
supporting LCA processing?
We know that the basic hierarchical data modeling and data
preservation processing is performed as a direct result of the hierarchical
processing from the LEFT Outer Join data modeling. We also know that LCA
processing is occurring naturally in SQL with its inherently correct multipath
hierarchical processing support, which is enabled by correctly modeling
hierarchical structures using LEFT Outer Joins. However, where the LCA
processing was occurring in SQL is not that clear. We found the inherent LCA
processing hiding in plain sight in SQL’s relational Cartesian product
operation. It automatically and strategically introduces data replications in
order to get around the limitation that with relational processing the data
remains flat even with one-to-many join operations applied.
How is LCA Processing Working Automatically in SQL
The data replication produced by the Cartesian product allows
relational processing to perform most of its processing a row at a time without
having to loop back through the rows for operations such as testing all
combinations of data. We found that this same process enables LCA processing to
occur naturally because of the LEFT Outer Join data modeling, which causes data
replication combinations to occur naturally around the join points, which act
as LCA points for hierarchical processing. When processed hierarchically with
the LEFT Outer Join, this simulates the hierarchical LCA processing. This
demonstrates that full multipath hierarchical processing is a natural subset of
relational processing and the way that LCA processing operates is a natural
principled operation naturally included in relational processing.
Figure 4 below is the flat relational Cartesian product representation
of the hierarchical data in Figure 3. The first data row of the top box contains
the first data occurrences for each node placed in left-to-right hierarchical level
order starting with the root node. Then starting with this initial data row a
Cartesian product algorithm is applied right-to-left moving the rest of the hierarchical
data from figure 3 into the flat relational result set. Looking at the starting
locations on the right side of the boxes you can notice how the data is
replicated around the hierarchical node join points. This means the
hierarchical structure is reflected in the rowset result. The structure is more
obvious on the right side of the rows, which vary faster. Now work through the previous
SELECT LCA and WHERE LCA query examples using the flat rowset data in Figure 4
a single row at a time instead of using the hierarchical structure in Figure 3.
You will see that the result is the same as when using the hierarchical data structure
proving that the inherent built in row-at-a-time LCA logic worked. For example
you will find no single row that has (F1 and G3) or (D1 and E3) because these
comparisons are not meaningful. This should also demonstrate why the Cartesian
product is so important to SQL.
Top | Bottom | ||||
---|---|---|---|---|---|
|
|
Figure 4: Cartesian product applied to hierarchical data in Figure 3
The Cartesian product flat rowset result in Figure 4 was
produced as described above from the hierarchical data structure result in
Figure 3. This result would also be produced directly from a relational
processor processing the Left Outer Join view in Figure 2. This means that the
relational join can affect the Cartesian product result, which can affect the
processing. So SQL performing hierarchically even at a multipath level with LCA
processing is operating naturally as would be expected. There is no reason for
SQL vendors not to internally utilize and build on this powerful and natural full
hierarchical processing capability.
The Cartesian product processing has taken the 21 data
values in the hierarchical data view shown in Figure 3 and blown them up to
over 400 values in figure 4 above. This is a significant overhead not present
in physical hierarchical data structures. However, using the entire view’s
complete data range is usually not necessary and only a small portion is needed
for any view being queried. Each view invocation can scale down the data range
with a WHERE clause data filter before the data is blown up in a Cartesian
product. The previous article in this hierarchical processing series of articles
covered a new powerful hierarchical semantic optimization that can be used in
SQL hierarchical processing.
As mentioned above, the hierarchical structure is reflected
in the Cartesian product. This makes it possible to convert the Cartesian
product back into a hierarchical structure with all data replications removed.
This can be used for many capabilities such as automatic XML structured
hierarchical output. This will be covered in a future article in this series of
hierarchical articles.
Conclusion
SQL’s multipath hierarchical query processing implies the
capability to unrestrictedly querying a full hierarchical multipath structure
by only listing the data items to be output and the data filtering to be
applied. This can allow the specification of complex hierarchical processing
not expressible with GUIs. The multipath hierarchical data structure is of no
concern to the query user and no navigation is necessary. Unlike XQuery, no
processing looping logic is necessary to specify, the required complex
multipath processing is automatically determined from the specified desired
output and the structure of the data to be processed. An extreme example query
is: SELECT C.c, F.f, D.d FROM ABC WHERE D.d.=G.g AND C.c=D.d. This query
can be processed in a single row-by-row processing pass through the generated Cartesian
product rowset in Figure 4. This article has shown what some of this automatic
complex multipath internal logic requires.
Additional Resources
Wikipedia Lowest common ancestor