Proving that standard relational algebra operations are independent is considered an exercise in standard database theory course. Here is extract from prof. Kolaitis slides:

Theorem: Each of the five basic relational algebra operations is
independent of the other four, that is, it cannot be expressed by a
relational algebra expression that involves only the other four.

Proof Sketch: (projection and cartesian product only)
 Property of projection:
 It is the only operation whose output may have arity smaller than its input.
 Show, by induction, that the output of every relational algebra expression
in the other four basic relational algebra is of arity at least as big as the
maximum arity of its arguments.
 Property of cartesian product:
 It is the only operation whose output has arity bigger than its input.
 Show, by induction, that the output of every relational algebra expression
in the other four basic relational algebra is of arity at most as big as the
maximum arity of its arguments.
Exercise: Complete this proof.

This seems convincing until we carry over this question to relational bilattice. Then it is evident that \{ \wedge,\vee,\lhd NOT \rhd, \lhd INV \rhd \} , that is natural join, generalized union, negation, and inversion are independent. Indeed, natural join and generalized union are monothonic on both attributes and tuples, so he proof sketch from relational algebra can be employed here as well. Next, negation is domain-dependent operations which can produce tuples not present in the original relation. Likewise, inversion is “attribute”-dependent operation which can add attributes not present in the original relation.

What about the second lattice structure, are those operations independent of these four? Outer union can be expressed via De Morgan’s law:
x \lhd OR \rhd y = \lhd NOT \rhd (\lhd NOT \rhd x \wedge \lhd NOT \rhd y )
and likewise its lattice dual:
\lhd NOT \rhd (\lhd NOT \rhd x \vee \lhd NOT \rhd y )

Next, there are 0-ary operations, that is constants R00 (aka TABLE_DUM), R01 (aka TABLE_DEE), R10, and R11 (not to be confused with universal relation). The later three can be expressed in therms of the first one

R10 = \lhd INV \rhd R00
R01 = \lhd NOT \rhd R00
R11 = \lhd INV \rhd \lhd NOT \rhd R00

so the question extends if the set \{ \wedge,\vee,\lhd NOT \rhd, \lhd INV \rhd, R00 \} is independent. It is instructive to compare our conjecture with Boolean Algebra. There are many independent sets of operations in BA, the most ubiquitous being \{ \wedge,\vee,\lhd NOT \rhd \}. Then, boolean constants are expressible in terms of those, e.g.:
0 = \lhd NOT \rhd x \wedge x
Here is similar relational lattice expression
R00 = \lhd NOT \rhd(x \vee \lhd NOT \rhd(x \vee \lhd INV \rhd x))

To conclude, relational algebra cast in genuine algebraic shape contains a set of four independent operations.

A long long proof

May 7, 2012

Decomposition of a relation into join of projections serves as motivation for database normalization theory. In relational lattice terms relation x projected into sets of attributes (that is empty relations) s and t:

x = (x v s) ^ (x v t)

Lets investigate dual perspective and switch the roles of join and inner union:

x = (x ^ s) v (x ^ t)

One particular instance of this identity is known as fundamental decomposition identity

x = (x ^ R00) v (x ^ R11)

which informally asserts that a relation is an inner union of the relation header (i.e. set of attributes) and content (set of tuples). Fundamental decomposition identity can be generalized into

x = (x ^ y`) v (x ^ y')

where the ' is relation complement and ` is attribute inversion. Both operations are domain dependent (which might explain a reluctance of some researchers adopting them). Automated proof of generalized decomposition identity

% Language Options

op(300, postfix, "`" ).

formulas(assumptions).

% Standard lattice axioms
x ^ y = y ^ x.
(x ^ y) ^ z = x ^ (y ^ z).
x ^ (x v y) = x.
x v y = y v x.
(x v y) v z = x v (y v z).
x v (x ^ y) = x.

% Litak et.al.
x ^ (y v z) = (x ^ (z v (R00 ^ y))) v (x ^ (y v (R00 ^ z))).
(x ^ y) v (x ^ z) = x ^ ( ((x v z) ^ y) v ((x v y) ^ z) ).
(R00 ^ (x ^ (y v z))) v (y ^ z) = ((R00 ^ (x ^ y)) v z) ^ ((R00 ^ (x ^ z)) v y).

% Unary postfix negation operation ' aka <NOT>
x' ^ x = x ^ R00.
x' v x = x v R11.

% Unary postfix inversion operation ` aka <INV>
x` ^ x = R11 ^ x.
x` v x = R00 v x.

% FDI
x = (x ^ R00) v (x ^ R11).
% Distributivity of join over outer union aka <OR>
x ^ (y' ^ z')' = ((x ^ y)' ^ (x ^ z)')'.

end_of_list.

formulas(goals).
x = (x ^ y') v (x ^ y`).

end_of_list.

turned out to be quite CPU consuming

% -------- Comments from original proof --------
% Proof 1 at 4894.53 (+ 50.37) seconds.
% Length of proof is 867.
% Level of proof is 47.
% Maximum clause weight is 61.
% Given clauses 10875.

How fast is your system?

Shadows of the Truth by Alexandre Borovik offers an interesting perspective upon human learning experience. The book is abundant with examples of concepts being challenging at the early age, but clarified later (sometimes, much later) in life. One of my stumbling blocks was the definition of implicit and explicit dependencies. Now, with modest help of relational thinking (that is, basic dependency theory) I can report some progress.

Intuition behind implicit and explicit dependencies is clear. For example, given the two functions

y(x,t) = x^2 + t
x(t) = t^2

then, in the formula for y we notice two variables x and t , which suggests that y explicitly depends on t . Compare it with

y(x,t) = x^2
x(t) = t^2

where formula for y involves variable x only. Since, at the second line we have x expressed in terms of t , y still depends on t , but the dependence is implicit.

The concept of implicit and explicit dependencies surfaces in many places, for example Partial Derivative Equations and Noether conservation theorems, which both are part of undergraduate math and physics curriculum. Nevertheless, most textbooks take this concept for granted, perhaps implying that mathematically mature reader should have no problems understanding it. Wikipedia offers couple dedicated articles: Dependent and Independent Variables giving overview and intuition, and more ambitious Time-Invariant System with an attempt to formal definition.

The concept of time-invariant system belongs to physics, the idea being that if we shift the time variable t , then it doesn’t affect time-invariant system behavior. This is illustrated by “formal example” in the middle of the page, where by comparing values of y(x,t) with two arguments t vs. t+\delta they suggest that y(x,t) = t x(t) is time-dependent, while y(x,t) = 10 x(t) is not. Compared to math, physics standards of rigor are lax, so it takes little effort to find a flaw. Take x(t) = t , then y(x,t) = t x(t) = {x(t)}^2 so y(x,t) is time-invariant with proper choice of x(t) !

Can relational dependency theory suggest any insight? By glancing over standard definitions of functional dependency:

\forall y_{1} \forall y_{2} \forall x : S(x, y_{1}) \wedge S(x, y_{2}) \implies y_{1} = y_{2}

and independence:

\forall y_{1} \forall y_{2} \forall x_{1} \forall x_{2} : S(x_{1}, y_{1}) \wedge S(x_{2}, y_{2}) \implies y_{1} = y_{2}

it becomes obvious that dependency concepts hinge upon equality/inequality (amended with some quantifiers, perhaps), and not upon domain algebraic structure (time-shifts). Let’s examine closely two examples:

y(x,t) = t-x
x(t) = t^2-2 t

vs.

y(x,t) = x^2-3 x
x(t) = t^2-2 t

Tabulating values at points t=0,1,2,3 we’ll get relations

R=[t  x  y]
   0  0  0
   1  -1  2 
   2  0  2  
   3  3  0
;

and

S=[t  x  y]
   0  0  0
   1  -1  4
   2  0  0
   3  3  0
;

correspondingly. The second relation S honors FDs t->x and x->y (and by Armstrong transitivity t->y), while the first one R does only t->x and t->y. Therefore, the formal definition of variable y being not [explicitly] dependent of t is equivalent to the absence of functional dependency x->y — if not counter intuitive, then terminologically confusing to say the least!

20th century has been highlighted by the two notable insights onto the nature of information. In 1948, Claude Shannon published classic paper “A Mathematical Theory of Communication” suggesting quantitative information measure — entropy — as average number of bits needed to store or communicate one symbol in a message. The paper has tremendous impact: it launched the whole new discipline — the Information Theory. Two decades later Edgar Codd invented the Relational Model, which equipped previously vague concept of “information pieces” with formal structure.

Since then there was at least one direct attempt to marry the two perspectives when Jürg Kohlas proposed Information Algebra. Other, less ambitious connections between the two disciplines were discovered within relational database dependency theory. Squeezed into one sentence this approach introduces quantitative measure of attribute dependency and proceeds expressing known dependencies (functional, multivalued, etc) in their terms. This blog posting pursues even more modest goal: first, describing a well known characterization of functional dependencies in terms of lattice partitions, then calculating partition lattice entropy and Gini index and, finally, proving that all three orders are consistent.

Let’s start with relation example:

Classes=[Prof    Course  Time]
         Libkin  DB101   Tue200
         Libkin  DB101   Thu500
         Gromov  Math    Tue200
         Gromov  Math    Thu500
         Vianu   DB101   Tue200
;

A binary partition of relation attributes, say \{Prof\}|\{Course,Time\} partitions the Classes tuples into 3 sets:

QBQL Relational programming system calculates partitions via binary operation; in our running example

Classes#[Prof];

which outputs

Classes#[Prof] = <Libkin,DB101,Tue200> <Libkin,DB101,Thu500> | 
                 <Gromov,Math,Tue200> <Gromov,Math,Thu500> | 
                 <Vianu,DB101,Tue200>;

Partitions can be compared pairwise: informally, partition x is greater than partition y if it is more coarse. Our next step is proving that this ordering is consistent with two numerical partition measures.

Partition entropy is defined via Shannon information measure formula

\displaystyle\sum\limits_{i=0}^n p_i\ln(p_i)

where summation ranges over partitions and p_i is probability of selecting a tuple from partition i . Let’s calculate entropy of the Classes#[Prof] partition in our example. Again, we automate this task with QBQL, and since later on we would like to calculate entropy of other partitions, we’ll provide generic definition. Let’s emphasize that this theoretically humble idea is one of the major practical benefits of QBQL over SQL and Datalog. First, we define 3 empty relations

CntRelHdr = [cnt];
CardRelHdr = [card];
plpHdr = [plp];

This is just implementation artifact/bug as QBQL should be able to inline these relvars into any query. Then we progresively define partition probabilities <Prob> and entropy <Entropy> as binary operations:

x <Prob> y = ((x |v| (y ^ CntRelHdr)) ^ (x |v| CardRelHdr)) /^ "cnt / card = prob".
x <Entropy> y = ((((x <Prob> y) ^ "ln(prob)=lp") /^ "prob*lp=plp") v plpHdr) /= "result += plp".

The definitions are generic, because you can plug in any two relations in place of x and y. For example, if you are interested in calculating the entropy of Classes#[Prof], you just type:

Classes <Entropy> [Prof];

Let’s walk trough <Prob> implementation. The expression y ^ CntRelHdr is a join of relation y with previously defined empty relation CntRelHdr, and since their attribute sets are disjoint it is a Cartesian product. In other words, we just want a set of attributes from relation y amended by one more attribute cnt. At the next step — x |v| (y ^ CntRelHdr) — we see something that appears as unfamiliar binary infix operator |v| applied to our previous result and relation x. The v symbols is reminiscent of inner union (which is generalization of union and projection) amended with the vertical bars, which in standard math notation usually denotes cardinality. It is SQL group by with counting a counterpart in QBQL notation; the grouping is done over the set of common arguments of the two relations x and y ^ CntRelHdr, while the count function column header is set of attributes of y ^ CntRelHdr which is not in x. In a typical application, such as Classes <Entropy> [Prof], the set of attributes of x is subset of that of y. Therefore, the name of the counting attribute is cnt. At the other branch, of the expression tree we compute x |v| CardRelHdr. Here relation attributes are disjoint, therefore this is just counting with no grouping (which is equivalent to counting with grouping over empty set). The two previous results (x |v| (y ^ CntRelHdr) and x |v| CardRelHdr again have disjoint set of attributes, so joining them is essentially a cartesian product. The final operation — relational composition with the "cnt / card = prob" (which is user-defined ternary predicate Times in disguise) — calculates probabilities for each table partition. Again, let’s not be distracted by the fact that this query is elementary in SQL:

select Prof, count(*)/card prob
from Classes, (select count(*) card from Classes)
group by Prof

First, it is not generic (as it should be rewritten for every other target relation and different subsets of its attributes); second, it uses ad-hock tuple-level expression evaluation (division operator). In our example, after we have defined generic binary entropy operation, then calculating entropy values for all possible splits of relation attributes is as succinct as it can possibly be:

Classes <Entropy> [Prof]; 
Classes <Entropy> [Course]; 
Classes <Entropy> [Time]; 
Classes <Entropy> [Prof Course]; 
Classes <Entropy> [Course Time]; 
Classes <Entropy> [Prof Time]; 
Classes <Entropy> [Prof Course Time];

The entropy values numerical order is consistent with partitioning order and this is not a coincidence. Consider a single partition \{a_1,a_2,...,a_n\} split into the two \{a_1,a_2,...,a_i\} | \{a_i,...,a_n\} and assume that both were the part of bigger partition of the set of N elements. Then, the entropy changes from

\frac{n}{N} \ln(\frac{n}{N}) + entropy of other pieces

to

\frac{i}{N} \ln(\frac{i}{N}) + \frac{n-i}{N} \ln(\frac{n-i}{N}) + the same entropy of the rest of the pieces

Since the first term of the former sum can be rewritten as

\frac{i}{N} \ln(\frac{i+n-i}{N}) + \frac{n-i}{N} \ln(\frac{n-i+i}{N})

it is greater than the later. Therefore, we have established that entropy is monotonically increasing along with partition granularity. Shannon Entropy function is not unique in that respect. The similar proposition is valid for Gini Index:

\displaystyle\sum\limits_{i=0}^n p_i(1-p_i)

Documentation Code Snippets

October 11, 2011

Code templates/snippets is one of modern IDE facilities accelerating programming. Some time ago I used to complain about cumbersome main() function signature in Java which escaped my limited memorization abilities until learning that there is convenient code template offered by code assist, so the syntax doesn’t matter. Sometimes the answer is just ctrl-space away!

SQL Developer 3.1 preview release is out and offers improvements in snippet/template arena. First, the old user-programmed templates and snippets are demoted from code assist/insight. You still can invoke them traditional way. The SQL&PL/SQL fragments shown in code insight are fetched from documentation. That’s right, Oracle documentation contains hundreds if not thousands of code snippets, therefore why insist on entering them manually?

Let’s go through examples. Starting anonymous PL/SQL block prompts code insight like this:

The first entry here is an individual keyword suggestion. The three next listings are code fragments captured from SQL history. Then there is "..." prompting that the list can be expanded, followed by the three entries from documentation. Selecting the first one outputs:

Code insight is syntax driven; the code snippet is suggested only in the appropriate context. For example, a standalone "exc" prefix won’t bring any exception snippets, while within our anonymous PL/SQL block it would:

If you are wondering where particular snippet is coming from, you can press the button with library icon. It will open new browser window pointing to the relevant documentation page:


This feature may compete with Google for the fastest way to a forgotten corner of oracle documentation. Try it: REGEXP signature, recursive WITH, every other clause from humongous SQL syntax is right there at you keystroke.

Select from Insert

September 22, 2011

People imagination has no limits. Where does it lead SQL? Here is some bizarre syntax:

SELECT column1
  FROM FINAL TABLE
  (INSERT INTO t1 VALUES(1,'John Smith') )

Wow, a whole semicolon has been spared — at the expense of syntax confusing relation with operator!

You would think DB2 designers would become laughing stock of SQL community? Wrong. Be prepared for some competition:

WITH sal1 AS
(
   SELECT * FROM NEW
   (
     INSERT INTO EMPLOYEES(EMPLOYEE_ID, SALARY) VALUES(207, 1000)
   )
)
, sal2 AS
(
   SELECT * FROM NEW
   (
     UPDATE EMPLOYEES SET SALARY = SALARY*1.1
   )
)
SELECT * FROM EMPLOYEES;

Then, why not INSERT as SELECT FROM (INSERT as SELECT), and so on.

TABLE_DUM divided by Relation

September 1, 2011

Here is curious identity:

TABLE_DUM divided by a relation evaluates to the relation’s complement.

Formally in QBQL:

TABLE_DUM /= x = <NOT> x.

What division am I talking about? Googling “relational division” brings up a list of usual suspects — articles by Date, Celko, etc — with some confusion leading to “Todd’s division”, ternary (!) division operation and so on. QBQL follows much cleaner definitions of set equality join, set containment join and alike readily found in academic literature. In the above identity the “/=” is binary set equality join operation, informally known as relational division.

Dual identity is also interesting. First, the dual of TABLE_DUM, aka R00 is the universal relation R11. The dual of unary complement operation is inversion. Duality among multiple division-like operations is not evident, but the reader might verify that set intersection join analogous to composition in the algebra of Binary Relations fits the bill:

R11 /^ x = <INV> x.

Empty String vs NULL

July 29, 2011

How weird is the language you are programming on? User survey on stackoverflow is the second most voted thread. Somewhere on page 2 we encounter the following SQL code snippet:

create table wtf (key number primary key, animal varchar2(10));
insert into wtf values (1,'dog');
insert into wtf values (2,'');
insert into wtf values (3,'cat');
select * from wtf where animal 'cat';

I don’t agree with the sentiment of that post. How exactly returning an animal (or any other thing for that matter) named with empty string of characters would make user happy? For all practical purposes empty string of data is useless and equating it with NULL just reduces complexity.

Oracle 11g introduced “syntax aware” code search facility — PL/Scope. One can look up identifier definitions, usages, exceptions, which is much more sophisticated and targeted search compared to the old way of finding all of the occurrences of a string in the ALL_SOURCE. Certainly few developers are expected to use PL/Scope À la carte. IDEs, such as SQL Developer, accommodated it within its [object] search. Here is a screen shot illustrating a search of formal parameter P3:

So, with PL/Scope available is the old way, that is finding all of the occurrences of a string in the ALL_SOURCE, obsolete? Yes, it has been effectively deprecated, which turned out to be ill-considered decision. Apparently, some people like the old way better, especially that there is no way to search the comments. The fix is expected in next SQL Developer release, but for now here is a workaround. Among the other things it demonstrates that many features are nearly duplicated, and certainly a search can be accommodated by SQL Developer reports. In fact, there is a off-the-shelf report “Search Source Code” already, although many would find the owner = USER predicate impeding their goal. With little effort, however, one can customize it to approximate to what the Search Panel does (for RDBMS versions below 11.1). One can even prettify the output with html formatting; here are the steps:

1. Go to reports tab -> User Defined Reports and select “Add Report” from context menu
2. Name the report, e.g. “Fancy all_source”, and copy and paste the following prettified query

select
    owner "Owner",
    name  "PL/SQL Object Name",
    type  "Type",
    line  "Line",
    '<html><font bgcolor=#dddddd>'
    ||substr(text,1,instr(upper(text),upper(:TEXT_STRING))-1)||'</font>'
    ||'<font bgcolor=#bbffbb>'
    ||substr(text,instr(upper(text),upper(:TEXT_STRING)),length(:TEXT_STRING))
    ||'</font>'
    ||'<html><font bgcolor=#dddddd>'
    ||substr(text,instr(upper(text),upper(:TEXT_STRING))+length(:TEXT_STRING))
          "Text",
    owner  sdev_link_owner,
    name   sdev_link_name,
    type   sdev_link_type,
    line   sdev_link_line 
from sys.all_source
where (:TEXT_STRING is null or
           instr(upper(text),upper(:TEXT_STRING)) > 0)
    and length(text)<3000  --not like '%wrapped%'
    and name not like 'BIN$%'
    order by owner, name, type, line

3. Use Go To option from context menu on PL/SQL object name to navigate to the PL/SQL compilation unit.

Here is a screen shot of the report output:

ANSI join syntax

March 2, 2011

Here is unconventional way to write a query in ANSI SQL join syntax

SELECT 1
   FROM DEPARTMENTS C
        JOIN EMPLOYEES A
             JOIN JOBS B
     ON C.DEPARTMENT_ID = A.DEPARTMENT_ID
     ON A.JOB_ID = B.JOB_ID

It is valid syntax according to ANSI 92. Indeed, the fragment

EMPLOYEES A
             JOIN JOBS B
     ON C.DEPARTMENT_ID = A.DEPARTMENT_ID

is a qualified join which is a joined table which is a table reference.

Therefore, one can just take conventional restrict-project-cartesian product query, and replace all the commas in the FROM clause with the JOIN keyword, then replace WHERE and AND keywords with the ON.

An argument in favor of ANSI style is allegedly cleaner textbook example

SELECT 1
   FROM DEPARTMENTS C
        JOIN EMPLOYEES A  ON C.DEPARTMENT_ID = A.DEPARTMENT_ID
        JOIN JOBS B  ON A.JOB_ID = B.JOB_ID

although, as we have witnessed, ANSI grammar doesn’t enforce it. A developer is free to mix the first and the second variant in unbelievably messy query!

My feelings is reflected in a quote by Anthony Molinaro (author of “SQL Cookbook”):

“Old style is short and sweet and perfect. ANSI dumbed it down, and for people who’ve been developing for sometime, it’s wholly unnecessary”.

Follow

Get every new post delivered to your Inbox.