When comparing different hierarchical SQL query methods many people cite inefficiency of tree structure organized into adjacency list. They claim inferior performance of the methods that the big three vendors supplied to handle such structure; in particular, oracle proprietary connect by extension, and DB2 & SQL Server recursive view. I usually dismiss blanket statements like this, because it requires a lot of SQL optimization knowledge to substantiate it. The devil is (as always) in the details.
Chapter 6 of “SQL Design Patterns” book summarized different methods in the following feature matrix:
Adjacency relation (tree edges; standalone, or combined with the tree nodes) | Nested Sets | Materialized Path | Nested Intervals via Matrix encoding |
---|---|---|---|
Have to use proprietory SQL extensions for finding ancestors and descendants; although the queries are efficient | Standard SQL | Standard SQL | Standard SQL |
Finding descendants is relatively efficient (i.e. proportional to the size of the subtree) | Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) | Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) | Same as MP: Finding descendants is easy and relatively efficient |
Finding ancestors is efficient | Finding ancestors is easy but inefficient | Finding ancestors is tricky but efficient | Same as MP: Finding ancestors is tricky but efficient |
Finding node’s children is trivial | Finding node’s children as all the descendants restricted to the next level is inefficient (e.g. consider root node) | Finding node’s children as descendants on next level is inefficient | Same as AR: Finding node’s children is trivial |
Finding node’s parent is trivial | Finding node’s parent as ancestor on the previous level is inefficient due to inefficiency of ancestors search | Finding node’s parent as ancestor on the previous level is efficient | Same as AR: Finding node’s parent is trivial |
Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) | Aggregate queries are relatively efficient (except counting, which is super fast)! | Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) | Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) |
tree reorganization is very simple | tree reorganization is hard | tree reorganization is easy | tree reorganization is easy (but not as simple as in AR) |
After the book were published I found some wrong assertions in this matrix as well. For example, the claim that nested set ancestors query is slow is not entirely correct. It is true that searching all the ancestors with simple self-join query
select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'
is not efficient, and indexes on lft and rgt columns can’t be leveraged. However, a while ago I came across a paper that devised some clever (although more complicated) query which is efficient. Unfortunately, I’m not able to locate the exact reference.
The feature matrix gives an impression that adjacency list is superior to the rest of the methods. One important type of query, which was brought to my attention by Stefan Gower from Oracle, is missing from it.
Consider searching for all the descendants by some criteria, say job code. The number of descendants of a node which is close to the top of a big hierarchy may be large, but the selection criteria can be narrow so that the expected result set is small. In adjacency list model there seem to be no way to push the job code equality predicate (which is quite selective) inside hierarchical query. In other words, the query have to fetch a large set of all the descendants and then apply the filter criteria. We may try to rewrite the query, of course: fetch all the nodes by the job code criteria first, then try to find if they are connected to the given ancestor node with a hierarchical query. Clearly, a solution where one have a differently structured query for different data distribution scenario is inferior.
Compare it to Nested sets/intervals/materialized path model where we just add job code predicate to the query and which would be executed differently with different data/query parameters. If the subtree of all node’s descendants is small, then an execution with index range scan of all the nodes which are in the given interval followed by filtering by job code is optimal. In the case when job code is selective, the index range scan by that criteria should be done first. The nested sets query amended with job code predicate is quite simple so that optimizer is unlikely to give us any surprises.
Recursive SQL has been standardized since 1999 and it is implemented by two of the “big three” by now. I find that recursive SQL can hardly be described as “proprietary” any more.
[…] 18, 2008 This is an answer to Troels Arvin comment to my earlier post about comparing different hierarchical […]
Perhaps “recursive SQL” has been standardized.
I don’t know about that. But reviewers of the various algorithms showed that they had to traverse the entire tree and cache it in main memory to be effective.
For very large hierarchies (think 10 million and more), this starts to be less and less efficient.
So the adjective “recursive” really isn’t a sign of good performance in this case. It is a sign of a problem.
On the other hand, I don’t claim to have a solution.
I have concocted, coded and tested code that can
incrementally maintain massive hierarchies in SQL.
It is possible to maintain an index such that every single descendant is within the numeric range of all of its ancestors. Such an index can be exploited by SQL. That is, one can readily find all the nodes in a subtree under any arbitrary node. That is, if the range of a node in a hierarchy is 10000-20000, find all the nodes that are within this range. This identifies all the nodes in any node’s subtree. And this is done without any recursion. Instead, a real persistent index is used.
On the other hand, it is still not clear to me if the selectivity of a range-based indexing method (not cited above) is clearly superior to the other methods. The only thing certain, to my mind, is that there is no accepted standard for defining efficient hierarchical indices in standard relational database.s
Dear Vadim,
having read your papers (and Dan Hazel’s) on Nested Intervals via Matrix encoding –
is there any library or project out there actually implementing this.
I would imagine something the PostgreSQL extension ltree (http://www.postgresql.org/docs/current/static/ltree.html)
I would like to participate in such project.
Kind Regards
Peter
I’ve linked to your comparison from Stackoverflow – I would be flattered if you weighed in on the question I’ve posted about the various options for representing hierarchical data:
http://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database
@Peter Buchmann https://github.com/clyfe/acts_as_nested_interval