Hierarchical Data Structure Transformation in SQL
August 19, 2010
The two basic types of data structure transformation terms, Restructuring and Reshaping, are used interchangeably with data structure transformations. This article categorizes and defines these two types of data structure transformations and uses SQL to demonstrate the transformations, using SQL's natural hierarchical data structure processing capability.
I have found that there are two basic types of data structure transformations. I have also noticed that the transformation terms Restructuring and Reshaping are used interchangeably with data structure transformations. These two terms bring to mind different types of transformations I have associated with the two basic transformations. In this article, I will categorize and define these two types of data structure transformations. I will be using SQL to demonstrate these types of data structure transformations using SQL's natural hierarchical data structure processing capability.
Data structures are built to a great extent by utilizing the related data values in the data to make the linkages that can bring the data together in different ways. Restructuring is performed by using these data relationships in the data as linkages in order to change the structure to model these new relationships. This is done by taking the structure apart as needed and applying new data relationships not currently used or reusing them differently. This introduces new meaning and semantics to the resulting structure. In addition, these data relationships can involve comparisons and formulations to make the new relationships linkages.
Changing the topic of Restructuring to Reshaping, reshaping brings to mind a molding process by shifting pieces of the structure around as in molding a piece of clay. This means that that there are no limitations to what the resulting structure can be, allowing any structure to be transformed into any other structure producing any-to-any structure transformations. The trick with performing this type of transform is that the semantics of the structure must drive how the transform is performed in order to preserve the basic semantics that are applied to the creation of the new structure. This means the naturally implied data relationships from the physically or logically juxtaposition of the data in the structure flows with the structure as it is molded. This produces the desired new structure while taking into consideration the current semantics represented by the data.
Restructuring is used when new semantics are needed and data relationships are available to support this. The resulting structure is not a concern. Reshaping is used when a specific result structure is necessary as in hierarchical data modeling. The resulting semantics are not a primary concern but are expected to be hierarchically derived from the source structure. Restructuring requires that the necessary data relationships are available while Reshaping does not require any external or internal requirements so it is always available to use.
The following SQL Restructure operation in Figure 1 slices out the Invoice node shown below, and makes it the new root in the resulting structure. The data relationship used in making this new structure may be the same or a previously unused data relationship. If this data structure was retrieved from a contiguous data structure like XML there may not have been any physical data relationships to use. But on the other hand, once any type of data structure (physical or logical, contiguous or linked) is retrieved into a relational rowset it can be freely taken apart. Reassembling will depend on the data relationships that can be made.
The Alias feature of SQL beside its renaming use allows making multiple, separately named copies of a structure in the rowset, which can be used in taking apart the structure above and hierarchically reassembling it. The separately identified data from the two named copies of the identified structure in the FROM statement above is identified by the prefixes of X and Y associated with their created data copies in the FROM clause. The LEFT Outer join in Figure 1 models the restructured data as Y over X since Y is preserved when X is not. This is why InvID is identified in the SELECT statement with a Y prefix and EmpID, DndID and AddrID are identified with an X prefix in the SELECT statement since InvID is desired at a higher hierarchical level than the others. The higher level Y.InvID is used in the SELECT List and is used in the ON clause to reflect its proper position in the new hierarchy.
Restructuring Example Additional Level
This is the same as the previous Restructuring example except in addition to moving the Invoice to the top of the new structure, the Addr node is repositioned under the Dpnd node. This is done by adding another level to the structure by using another LEFT Join to attach the Addr node under the Dpnd node.
Notice in the above SQL in Figure 2 compared to the previous SQL in Figure 1 that the additionally added LEFT Join creates another copy of the structure to reference using the prefix of W. The AddrId field in the SELECT list now uses the W prefix so it can be referenced in the new LEFT join. This LEFT join is used to move the Addr node under the Dpnd node.
The following examples will demonstrate a number of data structure reshaping examples that cover linear to linear, linear to nonlinear, nonlinear to linear and nonlinear to nonlinear structure transforms. The previous restructuring used physical data relationships in the data to perform the transform. This can not be used in reshaping and a technique similar to the restructuring is used. It uses the same processing of using multiple copies of the structure, but the coordination between the copies of the structures is different in that the relationship of the structures is now made between the same named node types of the two structure copies that are being moved in the order that they are being moved.
Inverting a 1 to M Linear Three Level Structure
This is a linear structure inversion using nodes Cust over Emp over Dpnd (a series of 1 to M relationships) in Figure 3 below. This will demonstrate that the alignment joining starts at the bottom of the structures, and drives the inversion upward at each node as the selected inverted node (in bold) from each level is selected: SELECT first: X.Dpnd, second: Y.Emp, third: Z.Dpnd. Only these three nodes represented by their data are selected once each for output so they are squeezed together through natural node promotion to keep their structure which is naturally inverted and now has M to 1 relationships.
Linear to Nonlinear Reshaping Logic
Linear structures can be reshaped into nonlinear structures. In this example in Figure 4 below, the linear structure Cust over Emp over Dpnd can be used to generate the structure Emp directly over the siblings Dpnd and Cust. Two copies of the input structure are necessary and are joined by their common Emp node since it is the starting node to building the new structure. Only two copies of the structure are necessary because the first copy can be used to move two nodes, Emp over Dpnd, which are already related as required. This places Emp over the Dpnd and Cust siblings because Emp is linked to Cust. This creates the nonlinear structure desired and these same node values (in the SELECT list) will be selected to create the nonlinear structure desired. By placing Emp over Cust, Cust is naturally and correctly replicated.
The previous linear examples and this nonlinear example have not lost the semantics of the input structure in the new structure because the semantics have been kept the same or inverted. This means the nodes have basically kept attached to the same nodes as shown below in the derived result structure. Emp over Dpnd remains the same while Emp over Cust has been inverted.
Nonlinear to Linear Reshaping
Nonlinear structures as input can also be used to build linear and nonlinear structures. In fact, nonlinear structures offer more flexibility in how they are utilized because their multiple paths offer more opportunity to find the correct reshaping being sought. This means less input copies need to be used.
The nonlinear input structure in Figure 5 below can be duplicated to create a linear structure reshaping of itself using the SQL below. Since we are starting with the Inv node as the root, this will be the first matching link. Cust becomes available in the second level structure, which is valid since it is related to Inv the related link point. Emp can also be selected from the same lower structure copy since it is already located under the Cust node.
Nonlinear to Nonlinear Reshaping
This nonlinear to nonlinear example, in figure 6 below, is very similar to the previous nonlinear to linear SQL example in Figure 5, where the linear structure Inv over Cust over Emp was produced easily. The current example below demonstrates that a nonlinear structure can transform into a different nonlinear structure. This example copies the previous example but places Emp not under Cust but under Inv making Cust and Emp siblings for this structure. This requires a third copy of the input structure also matched to Inv because that is where Emp is being attached to. The Emp node is accessed indirectly from Inv up to Cust then back down to Emp, a powerful related semantic reshape. Basically, the third structure and additional join in this example was necessary to move Emp from under Cust to under the Inv node.
A polymorphic reshaping does not rely on the structure of the input structure. This means that the input structure could be in any shape. The reshaping capability shown in Figure 3 and 6 do support polymorphic reshaping because only one node is moved per join. This means any shape structure with the same nodes can be transformed into the desired resulting structure. For example, the different input structure in Figure 7 would produce the same results from the SQL in Figure 6. The examples in Figure 4 and 5 are not polymorphic because it relies on a specific shape of the input structure by moving two related nodes at one time. The choice of using reduced steps or a polymorphic solution is up to the user. The examples in Figures 4 and 5 could have been written to be performed in three steps moving a single node at a time making them polymorphic when it is useful.
The SQL Reshaping and Restructuring shown do take a little bit of procedural programming. But a high level nonprocedural transform language could be designed for reshaping based only on the desired structure and the SQL can be generated or internally performed.
It may have occurred to you that the need to create multiple copies of the structure in the examples, which could be costly in memory use. Most SQL processors optimize the multiple copies of the structure to use one physical copy.
The different operations of Reshaping and Restructuring can be combined as needed. The transformation examples shown are described in more detail and can be tested at http://www.adatinc.com/demo.