Counting repetitive values

February 8, 2007

The counting method in the SQL Design Patterns book bursted into the following finale:

How about counting rows that are not necessarily distinct? This is where our method breaks. It is challenging to distinguish duplicate rows by purely logical means, so that various less “pure” counting methods were devised. They all, however, require extending the SQL syntactically, which was the beginning of slipping along the ever increasing language complexity slope.

Then, the discussion slips into SQL analytics extensions.

Yet, it is possible to adapt this counting method to general case. If we apply it as it is the the result looks somewhat puzzling:


select t.sal, count(*) from emp t, emp tt
where tt.sal <= t.sal
group by t.sal

 SAL   COUNT(*) 
 5000   18 
 3000   34 
 2975   15 
 2850   14 
 2450   13 
 1600   12 
 1500   11 
 1300   10 
 1250   18 
 1100   7 
 950   6 
 800   5 
 5   16 

How can there be 34 values less than 3000, when there are only 18 values less than 5000? The issue is clearly the duplicate value 3000, so my immediate reaction was to suggest switching to analytic query.

On afterthought, if salaries in the emp table contain duplicates, perhaps, we just fix that in the inline view? Here is the amended query:

with e as (
  select sal, count(1) dupl
  from emp
  group by sal
)
select t.sal, t.dupl, count(*)
from e t, e tt
where tt.sal <= t.sal
group by t.sal, t.dupl

This query, however, might not be what the original poster had in mind. It ask for “How many salaries are below any given salary“. This is little different from “How many employees have their salary below any given salary“. However, it is easy to adjust our query to this slight change in the requirement:

with e as (
  select sal, count(1) dupl
  from emp
  group by sal
)
select t.sal, t.dupl, count(*)
from e t, emp tt
where tt.sal <= t.sal
group by t.sal, t.dupl

In a word, the relational style query proved to be quite flexible. Arguably, it involves little thinking, whereas in SQL analytics world you just mindlessly plug in some magic function — if you are lucky to have one.

Next, there is a recurring performance argument. Analytic query iseems to be faster, but is it always the case?

Consider salary rank query of all the people in the department 10, or a single person in a big corporation. The self-join query is easily adjusted to the new requirements:

SELECT E1.ENAME, E1.SAL, COUNT(*) RNK
FROM EMP E1, EMP E2
WHERE E2.SAL <= E1.SAL
AND E1.ename = 'BLAKE'
GROUP BY E1.ENAME, E1.SAL

Here a reasonable execution strategy would be fetching employee called Blake by unique index scan, then finding all matching rows from table aliased E2, where full table scan is unavoidable. This has the same or better performance than analytics query, where the execution plan remains the same — scanning the full table, then filtering the data related to employee called Blake.

Furthermore, what if EMP is not actually a table but a view, how can we be sure that optimizer would be able to find all the optimizations that it can do in simple select-project-join-group by operation space? Moreover, we can easily accommodate the above query to the problem like

“What is the salary rank of Blake among all the employees in the department 40″

This query is legitimate even though Blake doesn’t belong to department 40, while in analytics case you would be forced to artificially add Blake to the department 40 before firing analytics query.

In a word, join is the most powerful operation in the relational algebra. It appears everywhere in the database and programming fields, even in places which you wouldn’t suspect to. You certainly know that set intersection and Cartesian product are joins and, probably, that the selection is join also; but are you aware that function invocation is join, and function composition is join as well? Recall that when relational model appeared many people were skeptical of the join operation performance. And then, couple decades later, some people insist that analytic solution solution which removed joins somehow “rocks”?

Leave a Reply