One more Nested Intervals vs. Adjacency List comparison
August 9, 2008
When comparing different hierarchical SQL query methods many people cite inefficiency of tree structure organized into adjacency list. They claim inferior performance of the methods that the big three vendors supplied to handle such structure; in particular, oracle proprietary connect by extension, and DB2 & SQL Server recursive view. I usually dismiss blanket statements like this, because it requires a lot of SQL optimization knowledge to substantiate it. The devil is (as always) in the details.
Chapter 6 of “SQL Design Patterns” book summarized different methods in the following feature matrix:
|Adjacency relation (tree edges; standalone, or combined with the tree nodes)||Nested Sets||Materialized Path||Nested Intervals via Matrix encoding|
|Have to use proprietory SQL extensions for finding ancestors and descendants; although the queries are efficient||Standard SQL||Standard SQL||Standard SQL|
|Finding descendants is relatively efficient (i.e. proportional to the size of the subtree)||Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree)||Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree)||Same as MP: Finding descendants is easy and relatively efficient|
|Finding ancestors is efficient||Finding ancestors is easy but inefficient||Finding ancestors is tricky but efficient||Same as MP: Finding ancestors is tricky but efficient|
|Finding node’s children is trivial||Finding node’s children as all the descendants restricted to the next level is inefficient (e.g. consider root node)||Finding node’s children as descendants on next level is inefficient||Same as AR: Finding node’s children is trivial|
|Finding node’s parent is trivial||Finding node’s parent as ancestor on the previous level is inefficient due to inefficiency of ancestors search||Finding node’s parent as ancestor on the previous level is efficient||Same as AR: Finding node’s parent is trivial|
|Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree)||Aggregate queries are relatively efficient (except counting, which is super fast)!||Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree)||Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree)|
|tree reorganization is very simple||tree reorganization is hard||tree reorganization is easy||tree reorganization is easy (but not as simple as in AR)|
After the book were published I found some wrong assertions in this matrix as well. For example, the claim that nested set ancestors query is slow is not entirely correct. It is true that searching all the ancestors with simple self-join query
select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'
is not efficient, and indexes on lft and rgt columns can’t be leveraged. However, a while ago I came across a paper that devised some clever (although more complicated) query which is efficient. Unfortunately, I’m not able to locate the exact reference.
The feature matrix gives an impression that adjacency list is superior to the rest of the methods. One important type of query, which was brought to my attention by Stefan Gower from Oracle, is missing from it.
Consider searching for all the descendants by some criteria, say job code. The number of descendants of a node which is close to the top of a big hierarchy may be large, but the selection criteria can be narrow so that the expected result set is small. In adjacency list model there seem to be no way to push the job code equality predicate (which is quite selective) inside hierarchical query. In other words, the query have to fetch a large set of all the descendants and then apply the filter criteria. We may try to rewrite the query, of course: fetch all the nodes by the job code criteria first, then try to find if they are connected to the given ancestor node with a hierarchical query. Clearly, a solution where one have a differently structured query for different data distribution scenario is inferior.
Compare it to Nested sets/intervals/materialized path model where we just add job code predicate to the query and which would be executed differently with different data/query parameters. If the subtree of all node’s descendants is small, then an execution with index range scan of all the nodes which are in the given interval followed by filtering by job code is optimal. In the case when job code is selective, the index range scan by that criteria should be done first. The nested sets query amended with job code predicate is quite simple so that optimizer is unlikely to give us any surprises.