Pivot and Unpivot

October 14, 2010

SQL has many flaws, and the fact that its syntax poorly reflects duality of pivot and unpivot queries is probably somewhere at the bottom of the list. To be fair, relational algebra doesn’t do any better. What are pivot and unpivot queries in QBQL?

Consider the following matrix example:

\begin{bmatrix} 3 & 1 \\1 & 0 \end{bmatrix}

This 2×2 matrix can be represented relationally in two ways:
• as ternary relation with two subscripts and one entry attribute:

Au=[i j s]
    1 1 3
    1 2 1
    2 1 1
    2 2 0
;

• as rectangular array with dimension j pivoted into the columns j1 and j2:

Ap=[i j1 j2]
    1 3  1
    2 1  0
;

The expressing Ap in terms of Au is pivot operation, while representing Au in terms of Ap is unpivot.

Au =
((Ap /^ [j1=s]) /^ [j]1)
v
((Ap /^ [j2=s]) /^ [j]2)
. 

Ap =
((Au /^ [j1=s]) /^ [j]1)
^
((Au /^ [j2=s]) /^ [j]2)
.

Notice that both expressinos are almost identical, so the duality of pivot and unpivot is exposed as formal relational lattice duality of join and union.

Advertisements

Finally

November 18, 2008

This is an answer to Troels Arvin comment to my earlier post comparing different hierarchical methods.

version

output

The predicate limiting the size of the Integers relation would fit more naturally to the outer query block:

with integers(num) as (
select 1 num from dual
union all
select num+1 from integers
) select * from integers where num < 10 ;

Pushing predicate inside recursive query definition doesn’t work, however (same as in competitor products).

When comparing different hierarchical SQL query methods many people cite inefficiency of tree structure organized into adjacency list. They claim inferior performance of the methods that the big three vendors supplied to handle such structure; in particular, oracle proprietary connect by extension, and DB2 & SQL Server recursive view. I usually dismiss blanket statements like this, because it requires a lot of SQL optimization knowledge to substantiate it. The devil is (as always) in the details.

Chapter 6 of “SQL Design Patterns” book summarized different methods in the following feature matrix:

Adjacency relation (tree edges; standalone, or combined with the tree nodes) Nested Sets Materialized Path Nested Intervals via Matrix encoding
Have to use proprietory SQL extensions for finding ancestors and descendants; although the queries are efficient Standard SQL Standard SQL Standard SQL
Finding descendants is relatively efficient (i.e. proportional to the size of the subtree) Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) Same as MP: Finding descendants is easy and relatively efficient
Finding ancestors is efficient Finding ancestors is easy but inefficient Finding ancestors is tricky but efficient Same as MP: Finding ancestors is tricky but efficient
Finding node’s children is trivial Finding node’s children as all the descendants restricted to the next level is inefficient (e.g. consider root node) Finding node’s children as descendants on next level is inefficient Same as AR: Finding node’s children is trivial
Finding node’s parent is trivial Finding node’s parent as ancestor on the previous level is inefficient due to inefficiency of ancestors search Finding node’s parent as ancestor on the previous level is efficient Same as AR: Finding node’s parent is trivial
Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) Aggregate queries are relatively efficient (except counting, which is super fast)! Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree)
tree reorganization is very simple tree reorganization is hard tree reorganization is easy tree reorganization is easy (but not as simple as in AR)

After the book were published I found some wrong assertions in this matrix as well. For example, the claim that nested set ancestors query is slow is not entirely correct. It is true that searching all the ancestors with simple self-join query

select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'

is not efficient, and indexes on lft and rgt columns can’t be leveraged. However, a while ago I came across a paper that devised some clever (although more complicated) query which is efficient. Unfortunately, I’m not able to locate the exact reference.

The feature matrix gives an impression that adjacency list is superior to the rest of the methods. One important type of query, which was brought to my attention by Stefan Gower from Oracle, is missing from it.

Consider searching for all the descendants by some criteria, say job code. The number of descendants of a node which is close to the top of a big hierarchy may be large, but the selection criteria can be narrow so that the expected result set is small. In adjacency list model there seem to be no way to push the job code equality predicate (which is quite selective) inside hierarchical query. In other words, the query have to fetch a large set of all the descendants and then apply the filter criteria. We may try to rewrite the query, of course: fetch all the nodes by the job code criteria first, then try to find if they are connected to the given ancestor node with a hierarchical query. Clearly, a solution where one have a differently structured query for different data distribution scenario is inferior.

Compare it to Nested sets/intervals/materialized path model where we just add job code predicate to the query and which would be executed differently with different data/query parameters. If the subtree of all node’s descendants is small, then an execution with index range scan of all the nodes which are in the given interval followed by filtering by job code is optimal. In the case when job code is selective, the index range scan by that criteria should be done first. The nested sets query amended with job code predicate is quite simple so that optimizer is unlikely to give us any surprises.

I just noticed an exchange Help Implementing Nested Invervals using Maxtrix Encoding which never showed up in Google groups and, therefore, left unanswered. There Steven Johnston pointed out that multiplication of 32 bit integers (which is used when comparing if one tree node is a descendant of the other)would normally produce 64 bit integer. Then, what is the point using slightly more efficient Matrix encoding/Nested Intervals with continued fractions if integer multiplication overflow seems to render the whole method invalid? Well, the fix is straightforward, although not pretty. To reliably evaluate if

descendant.a11*node.a21-descendant.a11*node.a22 <=
node.a11*descendant.a21-node.a12*descendant.a21

one have to introduce a user defined function:

function isDescendant( da11 integer, na21 integer, da11 integer, ....)

Within the function implementation chop each 32 bit function argument into two 16 bit chunks. When multiplying the numbers, use the formula:

(A*216 + B)*(C*216 + D) = A*C*232 + (A*D+B*C)*216 + B*D

I understand the frustration of people who have to spend their time to workaround snags like this, but this is not really an interesting issue. More exciting question may be what if we change the ordinary addition&multiplication algebra (+,*) into tropical semiring (min,+)? There is no multiplication, so no matter what calculations we do the overflow is impossible. Unfortunately this idea don’t work:-(

Once in a while an interesting problem pops up on Oracle OTN forums
http://forums.oracle.com/forums/thread.jspa?threadID=530584&tstart=0
where Joe Fuda published an inspiring solution leveraging Regular Expressions.

The fact that regular expressions are so powerful that can generate a very elegant solution is not surprising at all, because they are based on deep and fruitful Programming Language Theory. Furthermore, there is a certain similarity between database and language theory operators. They both have a union operator which have identical properties, and they both have join which is slightly different. This partially explains why regular expressions feature fits snugly into SQL.

Let’s see how regular expressions can be leveraged upon a standard SQL pattern — String Decomposition. Given a table of strings


create table sources (
num integer,
src varchar2(1000)
);
insert into sources values(1, '123,45,6,7,8');
insert into sources values(2, '8,9');
commit;
select * from sources;

 NUM   SRC 
 1   123,45,6,7,8 
 2   8,9 

the query should output a relation with tuples containing individual substrings that were “separated by commas” in the original table

 NUM   SRC   Decomposed String 
 1   123,45,6,7,8   8 
 1   123,45,6,7,8   123 
 1   123,45,6,7,8   7 
 1   123,45,6,7,8   6 
 1   123,45,6,7,8   45 
 2   8,9   8 
 2   8,9   9 

When we decompose a string into a set of substrings, we change concatenation into union. In our example, concatenation is defined via comma character, so what we want to do is to replace the comma with regular expression disjunction operator. Here is the query:


select distinct num, src,
regexp_substr
( src,
replace( src, ',', '|' ),
1,level
)
from sources, dual
connect by
regexp_substr
( src,
replace( src, ',', '|' ),
1,level
) is not null

An alternative solution leveraging regular expressions negation operator


select distinct src, regexp_substr(src,'[^,]+',1,level)
from sources
connect by regexp_substr(src,'[^,]+',1,level) is not null

is slightly shorter, but I fail to notice much theory behind it.

What makes the method less practical, however, is the fact that each string fragment is derived multiple times — a virtue of CONNECT BY execution. An alternative solution with MODEL operator is much more verbose, but also gives a programmer much more control how the execution is driven.