Top 10 SQL Hierarchical Data Processing Capabilities
December 8, 2010
Michael M. David reviews the top 10 new capabilities presented by ANSI SQL Transparent XML Hierarchical Processing, including Inherent hierarchical processing in SQL, Multipath Lowest Common Ancestor (LCA) processing occurring naturally in SQL, New hierarchical data modeling capabilities, and the solving of how these new capabilities work naturally, which lead to how they can be controlled.
I have been writing on the topic of ANSI SQL Transparent XML Hierarchical Processing for Database Journal for almost a year now. This topic has presented a number of new and powerful capabilities that I will review in this Top 10 article. This includes a number of discoveries in SQL operation that were recently made that support new capabilities described in my articles and recapped in this article. These SQL discoveries include Inherent hierarchical processing in SQL, Multipath Lowest Common Ancestor (LCA) processing occurring naturally in SQL, New hierarchical data modeling capabilities, and the solving of how these new capabilities work naturally, which lead to how they can be controlled. The order of the capabilities made possible by these SQL discoveries are described in this article in a building block order because the capabilities build upon each other.
1: SQL Transparent Hierarchical Processing
ANSI SQL Transparent Hierarchical Processing is naturally supported in SQL-92 by the SQL LEFT Outer Join operation. The LEFT Outer Join syntax performs full multipath data modeling and its associated semantics supports the hierarchical processing when processed directly by the SQL hierarchical processor. What makes this work reliably is the new Outer Join ON clause used at each join point to replace the WHERE clause single join usage insuring the SQL request is unambiguous. This natural hierarchical processing finally solves the relational/hierarchical integration problem with no data loss while preserving and utilizing the naturally existing hierarchical semantics automatically. This is used to seamlessly support SQL: hierarchical data filtering; navigationless processing; and automatic structured XML output. This is described in more detail in the article ANSI SQL Hierarchical Data Processing Basics.
2: Inherent LCA Multipath Processing
The initial computerized database technology four decades ago operated on hierarchical data structures and had query languages that supported full multipath navigationless queries. Multipath queries dynamically increase the customer's data value. This multipath hierarchical processing automatically utilizes the natural hierarchical semantics across pathways. This dynamically increases the data value by utilizing the additional semantic data and number of pathways used in deriving the multipath hierarchical result.
There is one more piece of logic that is required for multipath hierarchical processing. It introduced a richer level of semantics and meaning that must be taken into consideration to keep the query meaningful for processing multipath hierarchical structures. This logic uses Lowest Common Ancestor (LCA) logic to keep the hierarchical processing meaningful across pathways. During processing, whenever two points on different pathways in the structure require being associated in anyway, the LCA determines the shortest connection between the two points to keep the hierarchical result meaningful. This is a form of hierarchical proximity. This multipath LCA processing occurs naturally in SQL but is not performed by XQuery. This capability is described in more detail in the article The Power Behind SQL's Inherent Multipath LCA Hierarchical Processing.
3: Structure-Aware Processing
A number of new and advanced capabilities are made possible for SQL hierarchical processing by the capability to automatically and dynamically determine the hierarchical structure being processed. This is known as structure-aware processing. This is possible through interpreting the semantics in the hierarchical data modeling LEFT Outer Join string driving the query. The first and primary capability possible from this structure-aware processing that is required for other advanced capabilities is hierarchical optimization. Also tightly involved with these hierarchical optimization capabilities is SQL's powerful variable SELECT list.
3A: Hierarchical Optimization
SQL hierarchical optimization is controlled by SQL's variable SELECT list. It dynamically specifies the output result. This enables the output result to change from use to use for the same view. The output infers the required processing, which dynamically controls the hierarchical optimization. This optimization determines which nodes do not require access and rewrites the expanded LEFT Outer Join string eliminating unnecessary LEFT Outer Join entries. This automatically tailors the associated processing required to produce the desired result. This is easy for SQL to support because it is an interpretive language driven by the optimized LEFT Outer Join string. XQuery does not support a variable SELECT list type of operation so it does not automatically support this dynamic data access and processing optimization.
3B: Navigationless Access
Having structure knowledge of the hierarchical structure being accessed and the fact that SQL structured data and not semistructured data is being accessed, enables the database access to be performed automatically and navigationlessly. Structured and semistructured data require different processing. A choice has to be made between supporting accurate, unambiguous, and navigationless SQL structured processing or semistructured fuzzy processing that requires navigational processing. This navigationless access is also known as schema-free access. Most if not all XML and SQL/XML processors perform semistructured navigational processing. Our solution only processes SQL structured data in order to produce correct hierarchical structured results consistently and automatically.
3C: Global Views
The hierarchical optimization applied automatically to views enables any view to be a global view. Global views can contain very large structures with no overhead. This means that only the current required data in the view is accessed and no additional overhead occurs because of any additional unneeded data in the view. This means that fewer views are necessary and since no structure navigation is necessary, query use by non-technical users becomes even simpler and more powerful with seamless Global Views.
3D: Dynamically Formatted Output
Dynamically formatted output is controlled by the SQL variable SELECT list. This means that the output data for the same view can change from use to use. Another important feature of the automatic output is that it is dynamically aggregated by the automatic use of node promotion and node collection to slice out the unreferenced nodes and keep the pathways seamlessly connected. In SQL, this is called projection and is controlled by the SELECT list operator. The Global View capability also makes the dynamically formatted output easier to specify.
XQuery does not support structure-aware processing because it can freely use any type of structure. This flexibility comes with a tradeoff that hierarchical structures cannot be monitored or utilized fully. This Structure-aware processing and its capabilities are described in more detail in the article: SQL's Optimized Hierarchical Data Processing Driven by its Data Structure.
4: Dynamic Structure Generation
There are five ways that data structures can be generated. Four are covered in this Section; transformations are covered later in Section 6. Transformations are already existing structures that are modified at execution. This makes them different from standard structure generation and they contain their own multiple choices of different transformation capabilities.
4A: Basic Data Modeling
Basic data modeling is accomplished using the LEFT Outer Join to specify the structure a node at a time building the structure top (root) down. This is usually specified in a view and can be reused. This is used to define relational hierarchically modeled structures and to define physical structures like XML. With the newer SQL ON clause used at each join point any structure can be modeled more easily and unambiguously than trying to use a single WHERE clause for the entire structure. In addition, the ON and WHERE clause work differently. The WHERE clause filters the entire structure a row at a time, while the ON clause only affects its area of the structure and a given portion of the row. This is the same as hierarchical data preservation, which is necessary for correct hierarchical processing.
4B: Joining Hierarchical Data Structures
Joining hierarchical data structures is accomplished using the LEFT Outer Join operation. These hierarchical joins can consist of logical relational and/or physical XML structure views. This is because they both use the same LEFT Outer Join operation and can dynamically support heterogeneous hierarchical joins seamlessly to create a virtual hierarchical view. Standard hierarchical joining is applied from anywhere in the upper structure to the root of the lower structure. Linking anywhere under the root of the lower structure has presented hierarchical semantic problems for interpreting the appropriate resulting hierarchical semantics or structure. Section 5 does present a solution for linking below the root of the lower level structure.
4C: Dynamically Structured Output
Dynamically structured output is controlled by the SQL variable SELECT list. This means that the output structure for the same view can change from use to use controlled by the variable SELECT list. Another important feature of the automatic structured output is that it is dynamically smoothed by the automatic hierarchical node promotion and node collection (same as relational projection) to seamlessly splice out the missing nodes and data while keeping the pathways connected. The Global View capability also makes the dynamic structured output much easier to specify. This description is basically the same as what was described for dynamically formatted output in Section 3D but is listed here to complete the list for dynamic structure generation because it can be used for this purpose too.
4D: Data Driven Variable Structures
Data driven variable structure creation is controlled by the newer ON clause. The ON clause is not limited to specifying join criteria. It can also specify data filtering criteria. This allows the specifying of view structures and the applying of data filtering criteria on the ON clause to control whether the hierarchical view join is performed or not depending on the data value in the current data occurrence. The data value can be located up the active structure path or down the structure anywhere in the reachable lower structure view being joined. This ability to test data that is in the lower structure being joined is very powerful; it is actually a look-ahead capability described further in Section 5. These data driven variable structure processing capabilities described in more detail in the article: Dynamic Data Driven Variable Hierarchical Structures in SQL.
5: Extended Hierarchical Modeling for Data Mashups
While experimenting with our ANSI SQL transparent hierarchical XML processor, we determined that it was possible to join two structures and have the lower level linked to anywhere below the root node. By letting the ANSI SQL hierarchical processor perform its natural multipath hierarchical joining allowing linking below the lower level structure we determined that the result always made semantic sense for the query. This makes sense since SQL hierarchical processing is automatic, correct, and has no limitations. To apply this capability to our SQL hierarchical processor, which operates utilizing its structure-aware capability, we had to derive the resulting combined hierarchical structure.
We determined the resulting structure when linking below the lower level root structure was always as if the lower level structure's root was linked to. This is because the lower level structure is fully materialized before joining. This happens because of the natural view nesting controlled by the ending ON clause signals that the right view can now be joined to the left view as in: SELECT * FROM ABC LEFT OUTER JOIN XYZ ON A.a=X.x. This is very easy to process and naturally supports a look-ahead capability. This linking below the lower level root allows for a very powerful data mashup capability with no limitations for hierarchical joining. For more information on linking below the lower level structure's root and why it works see the article: Extending Hierarchical Data Modeling Demonstrated in SQL.
6: Data Structure Transforms and Data Virtualization
While researching data transformations with our ANSI SQL transparent XML hierarchical processor it was determined that there were two basic types of data transformations, which we classified. The first is Restructure transformations, which utilize data relationships in the data. It takes the structure apart and reassembles it utilizing different relationships in the data. The other transformation is Reshaping, which operates by using only the natural semantics of the data structure, which can allow any-to-any structure transformations. These two transformations have different uses for changing the structure versus changing the semantics, but can combine their different capabilities. The Reshaping transform is also useful because it can always be used when there are no data relationships available.
Data transformations normally operate on a single structure while data virtualization collects fragments from multiple structures. In order to support data virtualization operations, using a transformation that utilizes and combines fragments from multiple structures is used. Restructure transformations can utilize their natural multiple structure access ability to support data virtualization. These transformations utilize the natural capability of relational rowsets containing variable hierarchical structures. These relational rowsets also have a conceptual hierarchical fragment manipulation capability needed to perform flexible transformations. For more information on data structure transforms and virtualization see the article: Hierarchical Data Structure Virtualization in SQL.
7: Enhanced Multipath Processing Capabilities
As indicated in this article, most relational operations are transparently and seamlessly elevated to hierarchical operations. But some operations like data ordering require a different approach that more naturally takes advantage of multipath hierarchical processing that does not automatically fit into the existing SQL hierarchical processing. For example, the current Order By operation is linear. In order for it to be extended seamlessly to multipath hierarchical processing, it needed to support multiple path ordering. This is where data ordering can be independently occurring on separate pathways. This becomes very useful and powerful. This was supported in our ANSI SQL transparent XML hierarchical processor prototype with the current Order By syntax, which was semantically applied differently to support multiple path ordering. This type of operation will be extended to the prototype's data summaries and aggregates also. A hierarchical multipath Order BY operation is shown and explained in the article: Extending SQL's Inherent Hierarchical Processing Operation.
8: Automatic Data Renormalization
Automatic data renormalization requires structure-aware processing. It also deserves its own Top 10 entry because of its uniqueness and value. Operations like the relational Cartesian product and inverting 1-to-M relationships to M-to-1 and back to 1-to-M, require a process to remove the introduced data replications to produce the correct results. This is called data renormalization. Renormalization is a difficult operation to perform. Some processors like XQuery offer functions that can be used to perform renormalization when required. In this case, the user must be aware that this processing is necessary and specify it correctly. My company's ANSI SQL transparent XML hierarchical processor prototype supports automatic renormalization, which is a tremendous help because it is hard to keep track of replicated data. Automatic renormalization requires knowledge of the data structure and additional information added automatically to the input data. This data also can include information used to preserve the natural data order of XML data input. Renormalization is covered in more detail in the article: Extending SQL's Inherent Hierarchical Processing Operation.
9: Hierarchical Distributed Processing
The hierarchical oriented LEFT Outer Join can model the most complex heterogeneous virtual data structures. These structures can be easily optimized removing unnecessary pathways and fragments that are not necessary to the processing of the active query. This is possible because the SQL-92 Outer Join has a very recombinant capability. Hierarchical views ABC and XYZ can be easily joined hierarchically in the SQL statement: SELECT * FROM ABC LEFT OUTER JOIN XYZ ON A.a=X.x . This same SQL join statement will work automatically and seamlessly in a distributed computing environment. This is because the ABC and XYZ hierarchical views can also be automatically distributed to different SQL processors where they will naturally perform hierarchical processing. Their separate hierarchical results will be routed back and hierarchically joined together at the originating SQL processor containing the above LEFT Outer Join statement. This flexible hierarchical data modeling is shown in the article: Extended Hierarchical Data Modeling Demonstrated in SQL.
10: Future Growth Capabilities
Two powerful future capabilities that offer unlimited power and growth opportunities down the road are the replacement of the relational engine with a true hierarchical engine, and the support of automatic parallel processing for multicore processing.
10A: Hierarchical Engine Usage
It is possible to transparently swap out the relational engine operating hierarchically for a more powerful, efficient, and flexible hierarchical engine. Operating truly hierarchically this offers the most efficient method and avoids the relational problems of expensive joins and replicated data explosions.
10B: Parallel Processing and Multicore Support
Hierarchical processing of hierarchical structures offers the opportunity to fully parallel process hierarchical queries and support multicore processors. Hierarchical data structures with their separate pathways can be naturally parallel processed to a very significant extent. Their hierarchical structure naturally supplies a roadmap to enable fully automatic parallel processing. This is described in the article: Automatic Full Parallel Processing of Hierarchical SQL Queries.
The new hierarchical processing presented in this article is a new level of hierarchical processing driven by ANSI SQL. The now proven fact that SQL can support hierarchical processing means there is an overlap of the common hierarchical and relational processing. This point of common hierarchical and relational processing solves all current relational/hierarchical integration problems. It enables advanced concepts like joining hierarchical structures; mashups of hierarchical structures; transforms of hierarchical structures with any-to-any reshaping; and generating virtualized data structures. It also introduces new hierarchical concepts such as multipath ordering and performing summaries along multiple pathways. To test the capabilities described in this article, use our online interactive ANSI SQL Transparent XML Hierarchical Processor prototype at: www.adatinc.com/demo.html
Navigationless Database XML: Hierarchical Data Processing