The Power Behind SQL's Inherent Multipath LCA Hierarchical Processing
May 20, 2010
My previous article, SQL's Optimized Hierarchical Data Processing Driven by its Data Structure, covered global views, schemafree navigationless processing and the powerful SQL SELECT clause, which optimally and dynamically controls full multipath processing. This article continues describing advanced capabilities of hierarchical multipath query processing by describing the little known Lowest Common Ancestor (LCA) processing that enables multipath hierarchical query processing to always produce correct meaningful results automatically. It will explain how multipath processing dynamically increases the data value of the data by accurately querying it from multiple pathways.
Hierarchical Structures Create More Value than is Captured
Hierarchical structures are very unique not only in how well they organize data, but also in how they naturally capture more meaning than is stored with the data. Even more impressive is their ability to dynamically process this natural goldmine of meaning in unlimited ways that further increases the value of the stored data. With the increased use of hierarchical XML data structures today, there is an incredible amount of unused data value potential available.
Hierarchical structures are easily built and expanded naturally over time. Suppose we start building a hierarchical structure by creating and populating node A over node B. Then at some point we add node C also directly under node A as shown in Figure 1. This is an example of reuse by reusing the already existing node A and its data. We also know what nodes the A node has control over and how nodes B and C are related to each other. With the addition of each new node, the data value will increase in a nonlinear fashion. At this point, this multipath nonlinear hierarchical structure can look like the one in Figure 2.
Multipath Queries Dynamically Increase Data Value
The interesting question in Figure 1 is how nodes B and C are related. How does a query language handle this? Can node C be selected for output based on a data value in node B? Can node A be selected based on data values in multiple nodes B and C? Multipath queries dynamically increase data value by utilizing the additional semantic structure information that exists naturally between the pathways. The very nature of having to use additional semantic structure information in a query increases the value of the data used in the query and in the result. An example of such a query is selecting data from one path of the structure based on data in another path. For example using Figure 1: SELECT B.b WHERE C.c= 1. This also allows an unlimited number of queries to be asked of the same data view. To explore this further, Figure 1 needs to be expanded over time into Figure 2.
Figure 2 demonstrates how SQL can model hierarchical structures in a view. The SQL-92 standard introduced the LEFT Outer Join, which is hierarchical in nature allowing the left data argument to be preserved when there is no matching right data argument. This places the left argument hierarchically above the right argument as shown in the structure in Figure 2. Additionally with the SQL-92 standard the ON clause was added to replace the WHERE clause for specifying the join criteria. The ON clause is significantly different from the WHERE clause because it is specified at each join point and has a more controlled linear operation. This allows the entire hierarchical definition in Figure 2 to be precisely defined by SQL's LEFT Outer Join syntax. In addition, the associated Left Outer Join's semantics allows the syntax to be processed directly by the SQL processor allowing it to perform full hierarchical processing automatically. This has been covered in more detail in a previous Database Journal article in this series listed at the end of this article.
Multipath Queries Require LCA Processing
In a multipath hierarchical data structure, nodes or points on two paths are related by their Lowest Common Ancestor known as the LCA node. In Figure 2, nodes B and C have node A as the LCA. In the case of nodes F and G, node C is the LCA. And in the case of nodes B and F, node A is also the LCA. The purpose of the LCA node is to control or limit the domain range of the operation applied across the two related nodes. In Figure 2, nodes D and F have a larger domain than nodes F and G.
Hierarchical databases that handle XML and other large hierarchical databases like IBM's IMS or Structured VSAM used with COBOL's hierarchical processing require multiple node types and the support of multiple data occurrences. This is shown in the example of hierarchical data structure in Figure 3. Except for the A root node, each node type was populated with two data occurrences whose value shown indicates their node type and data value for use in examples to follow.
Data occurrences play an important part in LCA processing. Using nodes F and G in Figure 3 and their derived LCA C node controls the range of nodes under the LCA C node. With LCA nodes the level of containment is at the LCA data occurrence level. This means in this F and G node example that only the F1, F2 G1 and G2 data occurrences combination or the F3, F4, G3 and G4 combination can be used together because they occur under different LCA data occurrences of C1 or C2. For example, nodes F1 and G3 cannot be processed together in order for the LCA logic to keep the processing meaningful. In a different example using nodes B and F the LCA node is A, its data occurrence A1 is shown. This allows all of the data shown to be processed together. This allows for example node D2 to be used with E3 if necessary. Data from a not shown A2 data occurrence node could not be combined with the data shown. How the range of logic applied between the two related nodes is applied depends on the operation being performed, which is explained in the following section.
SQL LCA Used in Two Operations
In the previous article in this series, hierarchical processing was explained as being controlled by the SQL-92 LEFT Outer Join operation, which is hierarchical in nature. It was shown how the LEFT Join syntax can be used to hierarchically model full multipath hierarchical structures. It was then explained that the associated LEFT Outer Join syntax could be directly processed by the SQL processor and the associated hierarchical semantics would make it operate hierarchically. This allows SQL to naturally perform hierarchical data preservation operations. The LCA operation was not covered at that time because the LCA operation was not needed for the operations covered in that article. The LCA operation is used in SQL processing in two important ways. These are the SELECT LCA operation and the WHERE LCA operation, which are handled differently. The LCA operations are needed to keep the query result meaningful for multipath queries. It does this by limiting the processing between hierarchical pathways to the point that both related paths are still strongly related by their Lowest Common Ancestor (LCA).
The WHERE LCA Usage
WHERE clauses that reference multiple hierarchical paths require LCA logic. For example, using the structure in Figure 3 let's take the WHERE clause: WHERE F.f=1 AND G.g=2. Nodes F and G produce the LCA node as C. This means that all of the combinations of F1, F2, G1, and G2 under the LCA occurrence C1 or all of the combinations of F3, F4, G3 and G4 under LCA occurrence C2 are meaningful. Comparing F1 to G3 would not be meaningful. If we change the WHERE clause to: WHERE D.d=1 AND G.g=3, then the LCA node is node A and all the D and G node data occurrences under the current LCA node data occurrence A1 are tested. You will notice that in this case that the domain of the LCA logic is larger since the tested nodes were further apart hierarchically. The LCA and its logic take that into consideration automatically.
The SELECT LCA Usage
Selecting data from one pathway based on data from another pathway requires a different form of LCA processing. An example using Figure 3 is: SELECT F.f FROM ABC WHERE G.g=2. In this case, the F node and G node produce an LCA C node. This means that if any occurrences of G.g under the current LCA C node tests true, then all F.f data occurrences under the LCA C1 node occurrence are selected for output. In this case, the output would be F1 and F2. An example where the hierarchical distance is increased is: SELECT D.d FROM ABC WHERE G.g=2. In This case, the D node and G node produce an LCA A node. This means that if any occurrence of G.g (G1, G2, G3, or G4) under the current LCA A1 data occurrence node tests true, then all D.d data occurrences under the LCA A node occurrence (F1, F2, F3, and F4) are selected for output. In this case, the domain for this example is larger and spans across parents.
The SELECT clause can specify multiple data type items for output. This means that these different data types can be located on different paths, which can change the LCA node for the different data type items in the SELECT clause. An example using Figure 3 is: SELECT D.d, F.f FROM ABC WHERE G=1. This is actually a combination of the previous two SELECT examples. In this case, the selected data type items, D.d and F.f are separately processed for its LCA node, which could be different. Each separate data type item has the same WHERE clause G node. This means that the selected D.d node had an LCA node of A, while the F.f data node had an LCA node of C. In addition to this multiple LCA node SELECT processing, the WHERE clause can also require its own LCA because it can also specify multiple tests on different pathways as in: SELECT D.d, F.f FROM ABC WHERE F.f=1 AND G.g=2. These SELECT and WHERE LCAs can all naturally nest. What this also means is that the complexity of LCA query processing requires automatic processing.
Where is LCA Processing Occurring in SQL
Since multipath nonlinear processing requires automatic processing, and XQuery does not support automatic LCA processing, it cannot support multipath processing well, if at all, which is true for all XML query processing today. In addition, LCA processing is not practical using user navigation, which is required today for XML access in query languages. There is current academic research today attempting to add LCA processing to XQuery using external LCA functions. From the complexity of LCA of unrestricted multipath access shown in this article, it should be clear that a complete LCA solution will require automatic internal LCA support.
Built in LCA processing used in the past with the original hierarchical databases and their hierarchical query processing used a tremendous amount of hierarchical tree walking up and down the data structure. There is also the problem of determining LCA nodes. Even today, there are still papers being published with more efficient algorithms for locating LCA nodes based on two points in a hierarchical structure. SQL was designed with no LCA processing or LCA locator logic and concept of hierarchical tree walking. So how is SQL automatically supporting LCA processing?
We know that the basic hierarchical data modeling and data preservation processing is performed as a direct result of the hierarchical processing from the LEFT Outer Join data modeling. We also know that LCA processing is occurring naturally in SQL with its inherently correct multipath hierarchical processing support, which is enabled by correctly modeling hierarchical structures using LEFT Outer Joins. However, where the LCA processing was occurring in SQL is not that clear. We found the inherent LCA processing hiding in plain sight in SQL's relational Cartesian product operation. It automatically and strategically introduces data replications in order to get around the limitation that with relational processing the data remains flat even with one-to-many join operations applied.
How is LCA Processing Working Automatically in SQL
The data replication produced by the Cartesian product allows relational processing to perform most of its processing a row at a time without having to loop back through the rows for operations such as testing all combinations of data. We found that this same process enables LCA processing to occur naturally because of the LEFT Outer Join data modeling, which causes data replication combinations to occur naturally around the join points, which act as LCA points for hierarchical processing. When processed hierarchically with the LEFT Outer Join, this simulates the hierarchical LCA processing. This demonstrates that full multipath hierarchical processing is a natural subset of relational processing and the way that LCA processing operates is a natural principled operation naturally included in relational processing.
Figure 4 below is the flat relational Cartesian product representation of the hierarchical data in Figure 3. The first data row of the top box contains the first data occurrences for each node placed in left-to-right hierarchical level order starting with the root node. Then starting with this initial data row a Cartesian product algorithm is applied right-to-left moving the rest of the hierarchical data from figure 3 into the flat relational result set. Looking at the starting locations on the right side of the boxes you can notice how the data is replicated around the hierarchical node join points. This means the hierarchical structure is reflected in the rowset result. The structure is more obvious on the right side of the rows, which vary faster. Now work through the previous SELECT LCA and WHERE LCA query examples using the flat rowset data in Figure 4 a single row at a time instead of using the hierarchical structure in Figure 3. You will see that the result is the same as when using the hierarchical data structure proving that the inherent built in row-at-a-time LCA logic worked. For example you will find no single row that has (F1 and G3) or (D1 and E3) because these comparisons are not meaningful. This should also demonstrate why the Cartesian product is so important to SQL.
Figure 4: Cartesian product applied to hierarchical data in Figure 3
The Cartesian product flat rowset result in Figure 4 was produced as described above from the hierarchical data structure result in Figure 3. This result would also be produced directly from a relational processor processing the Left Outer Join view in Figure 2. This means that the relational join can affect the Cartesian product result, which can affect the processing. So SQL performing hierarchically even at a multipath level with LCA processing is operating naturally as would be expected. There is no reason for SQL vendors not to internally utilize and build on this powerful and natural full hierarchical processing capability.
The Cartesian product processing has taken the 21 data values in the hierarchical data view shown in Figure 3 and blown them up to over 400 values in figure 4 above. This is a significant overhead not present in physical hierarchical data structures. However, using the entire view's complete data range is usually not necessary and only a small portion is needed for any view being queried. Each view invocation can scale down the data range with a WHERE clause data filter before the data is blown up in a Cartesian product. The previous article in this hierarchical processing series of articles covered a new powerful hierarchical semantic optimization that can be used in SQL hierarchical processing.
As mentioned above, the hierarchical structure is reflected in the Cartesian product. This makes it possible to convert the Cartesian product back into a hierarchical structure with all data replications removed. This can be used for many capabilities such as automatic XML structured hierarchical output. This will be covered in a future article in this series of hierarchical articles.
SQL's multipath hierarchical query processing implies the capability to unrestrictedly querying a full hierarchical multipath structure by only listing the data items to be output and the data filtering to be applied. This can allow the specification of complex hierarchical processing not expressible with GUIs. The multipath hierarchical data structure is of no concern to the query user and no navigation is necessary. Unlike XQuery, no processing looping logic is necessary to specify, the required complex multipath processing is automatically determined from the specified desired output and the structure of the data to be processed. An extreme example query is: SELECT C.c, F.f, D.d FROM ABC WHERE D.d.=G.g AND C.c=D.d. This query can be processed in a single row-by-row processing pass through the generated Cartesian product rowset in Figure 4. This article has shown what some of this automatic complex multipath internal logic requires.
Wikipedia Lowest common ancestor