SQL Developer code tips

September 28, 2009

Few code tips for SQL Developer 2.1 (EA is available on OTN).

Create query with few keystrokes

1. s then pause, or hit ctrl-space

in dropdown choose select * from

2. select * from then space

in dropdown choose multiple table names while holding the ctrl key. Suppose you have chosen dept and emp. The table/view names are auto aliased by default.

3. select * from dept d, emp e where then space

in dropdown choose the join condition d.depno=e.empno.

Multiple equality join conditions could be selected while holding ctrl key.

4. Return to the select clause, delete wildcard, and fire code insight. The column list allows multiple selections as well.

Automatic Group By

When assembling aggregation with group by query, don’t rush writing group by clause — it would be appended automatically. For example in the query

select deptno, count(1) from dept

as soon as you finished the aggregate function the group by clause would appear. If you edit the column list, then the group by list is adjusted automatically. (However, if you edit group by clause, then the column list is left unchanged).

There is a growing interest in Domain Specific Languages (DSL), and feeling that this technology would eventually render XML obsolete. Consider the following example of “2+2=4″ written in XML:

snap1

Aside from being insulting to reader’s intelligence, can you spot what is wrong with it? Where do these tree node annotations are coming from? Consider a grammar for expression parser:

constraint : expression ‘<‘ expression
| expression ‘=‘ expression

expression : expression ‘+’ expression
| number

In formal languages and parser theory world we have clear separation of the data (that is “2+2=4″), on one hand, and the schema describing the structure of the data (that is the expression grammar), on the other. In XML world the data and the schema are bundled together into some caricature of the parse tree. Yes, skipping parsing as the hard part of the job is somewhat handy, but what happened to the famous programming principle “Separate content from presentation”?

A tree structure is typically visualized as tree that “grows horizontally”, Windows file manager being a ubiquitous example. SQL execution plan is a tree which, until recently, has been rendered exclusively horizontal. Several client tools, however, broke up with this tradition and offered vertical display, while some articles cheer up this new option. Here is my analysis of pros and cons of each alternative.

1. Vertical display takes more space. The plan tree width is constrained by the amount of horizontal space, divided by the width of the average plan node. For big plans you have to scroll both horizontally and vertically.

2. Standard database textbooks present execution plan in vertical display and, evidently, the entire CS community is preferring to render trees this way. However, textbook examples are rarely chosen to illustrate large trees, which are common in practice.

3. One can argue that it is easier to see the order of execution on tree planted vertically. For parent nodes that have two children it is not easy to see the children if indentation is small, and child nodes are expanded.

The last one seems to be legitimate argument, but my question is: does it matter? Unless you are a professional optimizer developer, do you have time to analyze each and every node of the plan? The question is why the particular SQL query is slow, and how to tune it. Studying the execution flow of rather large plan contributes little to understanding the problem. A performance analyst would get much better return on investment of his time by searching for performance bottlenecks.

Therefore, in addition to plan structure we need some numbers as a basis for quantitative analysis of the performance problem. Horizontal tree display provides some advantage into this respect, because it naturally integrates with the grid. Have a look into this example

dba_tablespace_usage_metrics

and notice additional columns on the right side displaying cost, buffer gets, elapsed time, and rowsource start counter. First, notice the cost entirely mismatching buffer gets and execution time. Therefore, the plan is effectively random. Second, please note that the full scan of the X$KTFBHC has been started 139 times. There is no nested loops of filter node in the ancestor chain, so how is this possible? The answer is that the DBA_TABLESPACE_USAGE_METRICS view definition

CREATE OR REPLACE VIEW "SYS"."DBA_TABLESPACE_USAGE_METRICS" AS
 SELECT t.name,
       sum(f.allocated_space) - (
            select nvl(sum( rb.space), 0)
                   from sys.recyclebin$ rb
                        where t.ts# = rb.ts# ) +  (
            select sum(fc.ktfbhcsz - fc.ktfbhcusz)
                   from x$ktfbhc fc
                    where t.ts# = fc.ktfbhctsn),
...

includes scalar subqueries, which are executed with tuple iteration semantics.

The last question about this execution statistics is why buffer gets and execution tree don’t accumulate in the ancestor chain. Sure the time at any node should be at least as big as the sum of the children? The answer is that this is a partial execution statistics. For most long running queries there is no point waiting them to complete. Canceling after, say, 5 min would still provide pretty good picture where the bottleneck is.

To conclude, SQL Optimization is well known to be a difficult problem. Statistics information is incomplete, robust cost metrics is elusive, and the search space is explosive. The optimization goals are often conflicting. The very first idea that every SQL performance analyst discovers: “The optimization is only as good as its cost estimates”. In the grand scheme of things plan display orientation has minuscule value.

Finally, if you are not convinced, please vote for your favorite plan display orientation here.

Variable number of joins

March 25, 2009

Iggy Fernandezs posted a challenge which looks like routine application of recursive query technique:

with njoins (face_value, probability, joins) as (
SELECT d1.face_value , d1.probability, 1
            FROM die d1
union all
SELECT d1.face_value + d2.face_value AS SUM,
                 d1.probability * d2.probability AS probability, joins+1
            FROM njoins d1, die d2 where d1.joins < 10
) select * from njoins

This is expected to work on Oracle 11.2, DB2, SQL Server (the later two not tested). Filtering out the result with joins = :n aggregating and grouping is left as an exercise to the reader.

Finally

November 18, 2008

This is an answer to Troels Arvin comment to my earlier post comparing different hierarchical methods.

version

output

The predicate limiting the size of the Integers relation would fit more naturally to the outer query block:

with integers(num) as (
select 1 num from dual
union all
select num+1 from integers
) select * from integers where num < 10 ;

Pushing predicate inside recursive query definition doesn’t work, however (same as in competitor products).

Humor challenged audience?

November 17, 2008

There were a couple of inspiring presentations at the latest PDC conference: the technical DSL session, and the introduction to the Oslo Modeling platform. Watch out for an amusing moment closer to the end of the second show (around 45 min), when somebody in the audience asked how this tool is compared to XML. Don replied: “Well, it is not as beautiful as XML…” taking a long pause after it. A barely audible laugh (or two) in the audience. Perhaps he failed to deliver the subtitle for humor impaired: “XML” stands for “eXtremely Messy Language”.

When comparing different hierarchical SQL query methods many people cite inefficiency of tree structure organized into adjacency list. They claim inferior performance of the methods that the big three vendors supplied to handle such structure; in particular, oracle proprietary connect by extension, and DB2 & SQL Server recursive view. I usually dismiss blanket statements like this, because it requires a lot of SQL optimization knowledge to substantiate it. The devil is (as always) in the details.

Chapter 6 of “SQL Design Patterns” book summarized different methods in the following feature matrix:

Adjacency relation (tree edges; standalone, or combined with the tree nodes) Nested Sets Materialized Path Nested Intervals via Matrix encoding
Have to use proprietory SQL extensions for finding ancestors and descendants; although the queries are efficient Standard SQL Standard SQL Standard SQL
Finding descendants is relatively efficient (i.e. proportional to the size of the subtree) Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) Finding descendants is easy and relatively efficient (i.e. proportional to the size of the subtree) Same as MP: Finding descendants is easy and relatively efficient
Finding ancestors is efficient Finding ancestors is easy but inefficient Finding ancestors is tricky but efficient Same as MP: Finding ancestors is tricky but efficient
Finding node’s children is trivial Finding node’s children as all the descendants restricted to the next level is inefficient (e.g. consider root node) Finding node’s children as descendants on next level is inefficient Same as AR: Finding node’s children is trivial
Finding node’s parent is trivial Finding node’s parent as ancestor on the previous level is inefficient due to inefficiency of ancestors search Finding node’s parent as ancestor on the previous level is efficient Same as AR: Finding node’s parent is trivial
Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) Aggregate queries are relatively efficient (except counting, which is super fast)! Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree) Aggregate queries are relatively efficient (i.e. proportional to the size of the subtree)
tree reorganization is very simple tree reorganization is hard tree reorganization is easy tree reorganization is easy (but not as simple as in AR)

After the book were published I found some wrong assertions in this matrix as well. For example, the claim that nested set ancestors query is slow is not entirely correct. It is true that searching all the ancestors with simple self-join query

select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'

is not efficient, and indexes on lft and rgt columns can’t be leveraged. However, a while ago I came across a paper that devised some clever (although more complicated) query which is efficient. Unfortunately, I’m not able to locate the exact reference.

The feature matrix gives an impression that adjacency list is superior to the rest of the methods. One important type of query, which was brought to my attention by Stefan Gower from Oracle, is missing from it.

Consider searching for all the descendants by some criteria, say job code. The number of descendants of a node which is close to the top of a big hierarchy may be large, but the selection criteria can be narrow so that the expected result set is small. In adjacency list model there seem to be no way to push the job code equality predicate (which is quite selective) inside hierarchical query. In other words, the query have to fetch a large set of all the descendants and then apply the filter criteria. We may try to rewrite the query, of course: fetch all the nodes by the job code criteria first, then try to find if they are connected to the given ancestor node with a hierarchical query. Clearly, a solution where one have a differently structured query for different data distribution scenario is inferior.

Compare it to Nested sets/intervals/materialized path model where we just add job code predicate to the query and which would be executed differently with different data/query parameters. If the subtree of all node’s descendants is small, then an execution with index range scan of all the nodes which are in the given interval followed by filtering by job code is optimal. In the case when job code is selective, the index range scan by that criteria should be done first. The nested sets query amended with job code predicate is quite simple so that optimizer is unlikely to give us any surprises.

Matrix encoding growth

July 12, 2008

Limitation on computer integer representation put a limit on the size of a tree that Nested intervals/Matrix method can accommodate. For example it is reasonable to ask if the method would be able to deal with a tree with 200 levels with 2000 children at each node. It is easy to see that it can’t. Each node corresponds to some unique 2×2 matrix. Assuming 64 bit integers these matrices can’t possibly represent more than (2^64)^4 nodes. This value is significantly less than 2000^200. On the other hand, no practical tree can possibly have 2000 children at each node. Therefore,for making any realistic size estimation it is important to know the shape of the tree.

Here is an interesting finding related to the tree size estimation. Let scope problem to less ambitious goal of watching the encoding growth of the sequence of the nodes: 1, 1.1, 1.1.1, 1.1.1.1,… It is easy to see that for atomic matrices of the kind

[n 1]
[1 0]

we just have to look into matrix powers

[1 1]
[1 0]

[1 1][1 1]
[1 0][1 0]

[1 1][1 1][1 1]
[1 0][1 0][1 0]

which have elements growing as a Fibonacci sequence. This sequence grows as fast as powers of golden ratio or approximately 1.6. This is one of the features that make Matrix/Farey encoding more attractive than dyadic fractions (which grew as powers of 2 at each level, and moving to the older siblings).

Given that matrix encoding of the various kind have so many in common, I assumed that the growth rate of power 1.6 applies universally. It was surprising to see this is not the case! Consider powers of

[2 -1]
[1 0]

(this is the encoding studied in “SQL design patterns” book). They grow linearly! Fairly quickly (at level 6) we have:

[1 1]^6 [13 8]
[1 0] = [8 5]

compared to

[2 -1]^6 [7 -6]
[1 0] = [6 -5]

Dan Hazel developed one more version of continued fraction tree encoding. It is equivalent to matrix based approach with atomic matrices of the kind

[n + 1 n]
[ ]
[ 1 1]

but should be appealing to the readers who find continued fractions more intuitive compared to matrices.

Here is matrix perspective to his work. The atomic matrix in the foundation of Dan’s approach is

[n + 1 n]
[ ]
[ 1 1]

It has determinant equal to one, therefore general path determinants are not alternating, and nested interval orientation does not change when we ascend to the next tree level. Each Dan’s atomic matrix is a product of the familiar atomic matrix

[n 1]
[ ]
[1 1]

with an auxiliary matrix

[1 1]
[ ]
[1 0]

This is why Dan’s continued fractions contain interleaving sequences of 1s.

This issue is becoming a recurring theme of mine. This time it has been triggered by OTN exchange. To my surprise, people have difficulty accepting this seemingly obvious idea.

In SQL there is syntactic difference between a predicate and a table. However, let’s look beyond what you may have learned from syntax diagrams in Oracle SQL Reference (or other vendor for that matter). For our purposes let’s create a simple table Dept10:

create table Dept10 (
deptno integer;
);
insert into Dept10 values (10);

Here is the question: is there a difference between this table and the predicate deptno = 10?

Sure there is: we can examine the content of Dept10 with a simple query:

select * from Dept10

and we have no means to “examine the content” of a predicate! However, let’s include the predicate deptno = 10 and the Dept10 table into a bigger query:

select dname from dept where deptno = 10

select dname from Dept natural join Dept10;

Now, there is no difference between the two! In fact, you can construct many more queries: for every query involving Dept10 we may compose equivalent query without Dept10 and vice versa, with the only exception being the early mentioned single table query:

select * from Dept10

Therefore, I propose to extend SQL to allow queries like this

select * where deptno = 10

One consequence of this amendment is that the famous Dual table is no longer required. The other implication of equating tables with predicates is that physical query optimization can be achieved in a simpler optimizer framework.

Some people suggested to go even further and just write

select 10

how this proposal fits? Let’s analyze this expression step by step. The legitimate SQL query which achieves this is

select 10 from Dual

where the Dual could be any table or view containing one row. Next, what is this constant 10 in the select clause? I’m asking because in relational algebra you can’t just take a projection operator and substitute a column with an arbitrary expression. We may think of this expression being evaluated for each row, but a better idea is that we invoke some function. In our case it happen to be a function taking an empty argument list and returning a constant

function return10 return integer is
BEGIN
RETURN 10;
END return10;

Now the query transforms to

select return10() from dual

Next, every function is a relation (or predicate) that matches its input with output. Which predicate the function return10 corresponds to? Well, it must be deptno=10! Certainly function output is unnamed, so the predicate argument deptno can be anything. With the query output column being aliased

select return10() as deptno from dual

however, we can interpret row-wise function invocation as a join:

select deptno from dual natural join Dept10

Removing the redundant Dual we get

select deptno from Dept10

Once again, the Dept10 can be thought as a table which materializes the deptno = 10 predicate, and we established that there is unambiguous interpretation of the queries

select 10

and

select 10 as deptno