## Relational Algebra syntax in QBQL

### September 3, 2013

When it comes to teaching query languages within introductory database class, a system that supports relational algebra is a must. The first step is ASCII-matizing relational algebra syntax. One approach is to be faithful to RA notation by adopting LATEX syntax. Here is an example from popular Stanford class2go (registration required):

\project_{pizza} \select_{20 <= age and gender='female'} (Eats \join Person);

When implementing support of Relational Algebra in QBQL, we have decided not to emphasize query rendering in its “native” relational algebra expression form in favor of more readable ASCII syntax. Here is the same query adapted to QBQL syntax:

project pizza select "20 <= age" ^ "gender='female'" (Eats join Person); 

It is instructive to compare it to native QBQL syntax, where one can leverage the full power of Relative Composition (aka Set Intersection Join):
 Eats /^ Person /^ "20 <= age" /^ "gender='female'"; 

With the following input (QBQL syntax):

Eats = [name, pizza] Amy mushroom Amy pepperoni Ben cheese Ben pepperoni Cal supreme Dan cheese Dan mushroom Dan pepperoni Dan sausage Dan supreme Eli cheese Eli supreme Fay mushroom Gus cheese Gus mushroom Gus supreme Hil cheese Hil supreme Ian pepperoni Ian supreme ;

 

Person = [name, age, gender] Amy 16 female Ben 21 male Cal 33 male Dan 13 male Eli 45 male Fay 21 female Gus 24 male Hil 30 female Ian 18 male ; 

all three queries output

[pizza] cheese mushroom supreme 

## Paramodulation vs. Resolution

### February 2, 2013

Equations are ubiquitous in mathematics, physics, and computer science. In database field an optimizer rewriting a query to standard restrict-project-join leverages relational algebra identities. The craft of those reasoning abilities is culminated in Automated Theorem Proving systems.

Modern systems, such as Prover9, employs many sophisticated techniques, but as far as equality is concerned, paramodulation rules them all. In a seminal paper Robinson & Wos give an example of a problem which solution takes 47 steps with paramodulation vs. 136 steps with resolution.

The difference can be even more dramatic. Consider lattice idempotence law derived from the two absorption axioms:

formulas(assumptions). x ^ (x v y) = x. x v (x ^ y) = x. end_of_list.

 

formulas(goals). x ^ x = x. end_of_list.

It takes Prover9 just a single paramodulation step and 0.00 (+ 0.03) seconds. To compare it with resolution, we have to explicitly write down equality identities fit to the two lattice operations:

x = x. x != y | y = x. x != y | y != z | x = z. x != y | x ^ z = y ^ z. x != y | z ^ x = z ^ y. x != y | x v z = y v z. x != y | z v x = z v y. 

This explosion of axioms alone hints that the proof length must increase. It is the magnitude of the increase that caught me by surprise: 98 steps and 101 sec.

## NOT EXISTS in Relational Algebra and QBQL

### September 24, 2012

Many people criticize SQL for bloated and inconsistent syntax. However, one must admit that some of its features give Relational Algebra run for the money. Consider recent stack overflow question “Relational Algebra equivalent of SQL “NOT IN””

Is there a relational algebra equivalent of the SQL expression NOT IN? For example, if I have the relation:
 A1 | A2 ---------- x | y a | b y | x 
I want to remove all tuples in the relation for which A1 is in A2. In SQL I might query:
 SELECT * FROM R WHERE A1 NOT IN( SELECT A2 FROM R ); 
What is really stumping me is how to subquery inside the relational algebra selection operator, is this possible?:

$\sigma_{some subquery here}R$

The answer, which was accepted, suggested the query:

$R - \rho_{A1,A2} ( \pi_{a_{11},a_{22}} \sigma_{a_{11}=a_{22}} (\rho_{a_{11},a_{12}} R \bowtie \rho_{a_{21},a_{22}} R) )$

The length and complexity of this expression is puzzling: one can’t convincingly argue of inferiority of SQL to “truly relational” languages without being able to crack it.

On close inspection it becomes evident that the author used more renaming than its necessary, so we have:

$R - ( \pi_{A1,A2} \sigma_{A1=A2} (\rho_{A1,a_{12}} R \bowtie \rho_{a_{21},A2} R) )$

It is still longer and less intuitive than SQL query. Can Relational Algebra do better than this? Also, if SQL is allowed to use extra operations which are equivalent to Relational Algebra antijoin, then perhaps the competition ground is not fair?

There is one operation — relational division, initially included by Codd into the list of base operation, but excluded later when it has been discovered that it can be expressed in terms of the others. We’ll demonstrate that with division-like operators at hand we can achieve much shorter query.

Modern database theory treats relational division as set-containment-join. Set containment join is easy to describe if we temporarily step outside of first normal form relations. Consider the following relations

Certified=[name  skill]
claire  java
max  java
max  sql
peter  html
peter  java
;
Requirements=[job  skill]
appl  java
appl  sql
teaching  java
web  html
;

We can nest their common attribute skill into the set

Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
;

Next, for each pair of applicant and job we check if their skills matches job requirements, that is if one is subset of the other, and keep those pairs which meet this criteria.

This is classic set containment join, but mathematically inclined person would be tempted to do something about empty sets. What about the other applicants and jobs in their respective domains, shouldn’t they be included into both relations as well? Suppose there is a person named “Joe” who doesn’t have any qualification, and let the “entry” job assume no requirements at all:

Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
joe {}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
entry {}
;

Then, the amended set containment join would output additional tuple of Joe qualified for entry level job.

Certainly, the amended set join is no longer domain independent, but one may argue that the benefits overweight the drawbacks. For example, compared to classic set containment join, the amended operation expressed in terms of the basic ones is much more concise. We’ll show relational algebra with amended set containment join would greatly simplify our puzzle query as well.

In general, to sharpen intuition around problems like this it is useful to leverage relational algebra query engine. As usual, we use QBQL. Most of QBQL operations generalize Relational Algebra, and, therefore, are easily translatable into the later.

Here is QBQL query step by step. The most surprising is the very first step where we divide the input relation R over the empty relation with attribute A2:

R=
[A1   A2]
x     y
a     b
y     x
;
R /< [A2];


which outputs
 [A1] b 
Here an explanation may be warranted. In order to evaluate set containment join we have to partition sets of attributes of R and [A2] into exclusive and common attributes. The attribute set of the result is common attributes that is symmetric difference of the attributes of both inputs. In our example the output header is, as expected, {A1}. Now, let’s evaluate the output content. For R the exclusive/common attributes partition is easy: the A1 is exclusive attribute, while A2 is common. In “de-first-normalized” form the table R looks like this:
 R= [A1 A2] x {y} y {x} a {b} b {} 
For the relation [A2] the common attribute is again A2, while the exclusive one is empty. How we set-join such a relation? In de-normalized form the table [A2] looks like this:
 [A2 {}] {} R01 
That’s right, sets are relations with single attribute, and when we are challenged to exibit a nested relation with no attributes, then we have only two choices: R01 and R00 (i.e. TABLE_DEE and TABLE_DUM). After set containment join we obtain
 [A1 {}] b R01 
where the fictitious empty set column should be excluded from the normalized result.

After obtaining the relation
 [A1] b 
the next steps are easy: rename the attribute to A2 and join with R. In QBQL renaming is technically set-intersection join (/^) with binary equality relation (although set-equality /^ and set-containment would produce the same result), so the query becomes:

R ^ ("A1=A2" /^ (R /< [A2]));


and it outputs
 [A1 A2] a b 
as expected.

## Independence of Relational Operations

### September 4, 2012

Proving that standard relational algebra operations are independent is considered an exercise in standard database theory course. Here is extract from prof. Kolaitis slides:

Theorem: Each of the five basic relational algebra operations is
independent of the other four, that is, it cannot be expressed by a
relational algebra expression that involves only the other four.

Proof Sketch: (projection and cartesian product only)
 Property of projection:
 It is the only operation whose output may have arity smaller than its input.
 Show, by induction, that the output of every relational algebra expression
in the other four basic relational algebra is of arity at least as big as the
maximum arity of its arguments.
 Property of cartesian product:
 It is the only operation whose output has arity bigger than its input.
 Show, by induction, that the output of every relational algebra expression
in the other four basic relational algebra is of arity at most as big as the
maximum arity of its arguments.
Exercise: Complete this proof.

This seems convincing until we carry over this question to relational bilattice. Then it is evident that $\{ \wedge,\vee,\lhd NOT \rhd, \lhd INV \rhd \}$ , that is natural join, generalized union, negation, and inversion are independent. Indeed, natural join and generalized union are monothonic on both attributes and tuples, so he proof sketch from relational algebra can be employed here as well. Next, negation is domain-dependent operations which can produce tuples not present in the original relation. Likewise, inversion is “attribute”-dependent operation which can add attributes not present in the original relation.

What about the second lattice structure, are those operations independent of these four? Outer union can be expressed via De Morgan’s law:
$x \lhd OR \rhd y = \lhd NOT \rhd (\lhd NOT \rhd x \wedge \lhd NOT \rhd y )$
and likewise its lattice dual:
$\lhd NOT \rhd (\lhd NOT \rhd x \vee \lhd NOT \rhd y )$

Next, there are 0-ary operations, that is constants R00 (aka TABLE_DUM), R01 (aka TABLE_DEE), R10, and R11 (not to be confused with universal relation). The later three can be expressed in therms of the first one

$R10 = \lhd INV \rhd R00$
$R01 = \lhd NOT \rhd R00$
$R11 = \lhd INV \rhd \lhd NOT \rhd R00$

so the question extends if the set $\{ \wedge,\vee,\lhd NOT \rhd, \lhd INV \rhd, R00 \}$ is independent. It is instructive to compare our conjecture with Boolean Algebra. There are many independent sets of operations in BA, the most ubiquitous being $\{ \wedge,\vee,\lhd NOT \rhd \}$. Then, boolean constants are expressible in terms of those, e.g.:
$0 = \lhd NOT \rhd x \wedge x$
Here is similar relational lattice expression
$R00 = \lhd NOT \rhd(x \vee \lhd NOT \rhd(x \vee \lhd INV \rhd x))$

To conclude, relational algebra cast in genuine algebraic shape contains a set of four independent operations.

## A long long proof

### May 7, 2012

Decomposition of a relation into join of projections serves as motivation for database normalization theory. In relational lattice terms relation x projected into sets of attributes (that is empty relations) s and t:

x = (x v s) ^ (x v t)

Lets investigate dual perspective and switch the roles of join and inner union:

x = (x ^ s) v (x ^ t)

One particular instance of this identity is known as fundamental decomposition identity

x = (x ^ R00) v (x ^ R11)

which informally asserts that a relation is an inner union of the relation header (i.e. set of attributes) and content (set of tuples). Fundamental decomposition identity can be generalized into

x = (x ^ y) v (x ^ y')

where the ' is relation complement and  is attribute inversion. Both operations are domain dependent (which might explain a reluctance of some researchers adopting them). Automated proof of generalized decomposition identity

% Language Options

op(300, postfix, "" ).

formulas(assumptions).

% Standard lattice axioms
x ^ y = y ^ x.
(x ^ y) ^ z = x ^ (y ^ z).
x ^ (x v y) = x.
x v y = y v x.
(x v y) v z = x v (y v z).
x v (x ^ y) = x.

% Litak et.al.
x ^ (y v z) = (x ^ (z v (R00 ^ y))) v (x ^ (y v (R00 ^ z))).
(x ^ y) v (x ^ z) = x ^ ( ((x v z) ^ y) v ((x v y) ^ z) ).
(R00 ^ (x ^ (y v z))) v (y ^ z) = ((R00 ^ (x ^ y)) v z) ^ ((R00 ^ (x ^ z)) v y).

% Unary postfix negation operation ' aka <NOT>
x' ^ x = x ^ R00.
x' v x = x v R11.

% Unary postfix inversion operation  aka <INV>
x ^ x = R11 ^ x.
x v x = R00 v x.

% FDI
x = (x ^ R00) v (x ^ R11).
% Distributivity of join over outer union aka <OR>
x ^ (y' ^ z')' = ((x ^ y)' ^ (x ^ z)')'.

end_of_list.

formulas(goals).
x = (x ^ y') v (x ^ y).

end_of_list.

turned out to be quite CPU consuming

% -------- Comments from original proof -------- % Proof 1 at 4894.53 (+ 50.37) seconds. % Length of proof is 867. % Level of proof is 47. % Maximum clause weight is 61. % Given clauses 10875.

## Implicit and Explicit dependencies

### February 18, 2012

Shadows of the Truth by Alexandre Borovik offers an interesting perspective upon human learning experience. The book is abundant with examples of concepts being challenging at the early age, but clarified later (sometimes, much later) in life. One of my stumbling blocks was the definition of implicit and explicit dependencies. Now, with modest help of relational thinking (that is, basic dependency theory) I can report some progress.

Intuition behind implicit and explicit dependencies is clear. For example, given the two functions

$y(x,t) = x^2 + t$
$x(t) = t^2$

then, in the formula for $y$ we notice two variables $x$ and $t$, which suggests that $y$ explicitly depends on $t$. Compare it with

$y(x,t) = x^2$
$x(t) = t^2$

where formula for $y$ involves variable $x$ only. Since, at the second line we have $x$ expressed in terms of $t$, $y$ still depends on $t$, but the dependence is implicit.

The concept of implicit and explicit dependencies surfaces in many places, for example Partial Derivative Equations and Noether conservation theorems, which both are part of undergraduate math and physics curriculum. Nevertheless, most textbooks take this concept for granted, perhaps implying that mathematically mature reader should have no problems understanding it. Wikipedia offers couple dedicated articles: Dependent and Independent Variables giving overview and intuition, and more ambitious Time-Invariant System with an attempt to formal definition.

The concept of time-invariant system belongs to physics, the idea being that if we shift the time variable $t$, then it doesn’t affect time-invariant system behavior. This is illustrated by “formal example” in the middle of the page, where by comparing values of $y(x,t)$ with two arguments $t$ vs. $t+\delta$ they suggest that $y(x,t) = t x(t)$ is time-dependent, while $y(x,t) = 10 x(t)$ is not. Compared to math, physics standards of rigor are lax, so it takes little effort to find a flaw. Take $x(t) = t$, then $y(x,t) = t x(t) = {x(t)}^2$ so $y(x,t)$ is time-invariant with proper choice of $x(t)$!

Can relational dependency theory suggest any insight? By glancing over standard definitions of functional dependency:

$\forall y_{1} \forall y_{2} \forall x : S(x, y_{1}) \wedge S(x, y_{2}) \implies y_{1} = y_{2}$

and independence:

$\forall y_{1} \forall y_{2} \forall x_{1} \forall x_{2} : S(x_{1}, y_{1}) \wedge S(x_{2}, y_{2}) \implies y_{1} = y_{2}$

it becomes obvious that dependency concepts hinge upon equality/inequality (amended with some quantifiers, perhaps), and not upon domain algebraic structure (time-shifts). Let’s examine closely two examples:

$y(x,t) = t-x$
$x(t) = t^2-2 t$

vs.

$y(x,t) = x^2-3 x$
$x(t) = t^2-2 t$

Tabulating values at points $t=0,1,2,3$ we’ll get relations

R=[t  x  y]
0  0  0
1  -1  2
2  0  2
3  3  0
;


and

S=[t  x  y]
0  0  0
1  -1  4
2  0  0
3  3  0
;


correspondingly. The second relation S honors FDs t->x and x->y (and by Armstrong transitivity t->y), while the first one R does only t->x and t->y. Therefore, the formal definition of variable y being not [explicitly] dependent of t is equivalent to the absence of functional dependency x->y — if not counter intuitive, then terminologically confusing to say the least!

## Caclulating Entropy and Gini Index for a partitioned table

### January 26, 2012

20th century has been highlighted by the two notable insights onto the nature of information. In 1948, Claude Shannon published classic paper “A Mathematical Theory of Communication” suggesting quantitative information measure — entropy — as average number of bits needed to store or communicate one symbol in a message. The paper has tremendous impact: it launched the whole new discipline — the Information Theory. Two decades later Edgar Codd invented the Relational Model, which equipped previously vague concept of “information pieces” with formal structure.

Since then there was at least one direct attempt to marry the two perspectives when Jürg Kohlas proposed Information Algebra. Other, less ambitious connections between the two disciplines were discovered within relational database dependency theory. Squeezed into one sentence this approach introduces quantitative measure of attribute dependency and proceeds expressing known dependencies (functional, multivalued, etc) in their terms. This blog posting pursues even more modest goal: first, describing a well known characterization of functional dependencies in terms of lattice partitions, then calculating partition lattice entropy and Gini index and, finally, proving that all three orders are consistent.

Classes=[Prof    Course  Time]
Libkin  DB101   Tue200
Libkin  DB101   Thu500
Gromov  Math    Tue200
Gromov  Math    Thu500
Vianu   DB101   Tue200
;


A binary partition of relation attributes, say $\{Prof\}|\{Course,Time\}$ partitions the Classes tuples into 3 sets:

QBQL Relational programming system calculates partitions via binary operation; in our running example

Classes#[Prof];


which outputs

Classes#[Prof] = <Libkin,DB101,Tue200> <Libkin,DB101,Thu500> |
<Gromov,Math,Tue200> <Gromov,Math,Thu500> |
<Vianu,DB101,Tue200>;


Partitions can be compared pairwise: informally, partition $x$ is greater than partition $y$ if it is more coarse. Our next step is proving that this ordering is consistent with two numerical partition measures.

Partition entropy is defined via Shannon information measure formula

$\displaystyle\sum\limits_{i=0}^n p_i\ln(p_i)$

where summation ranges over partitions and $p_i$ is probability of selecting a tuple from partition $i$. Let’s calculate entropy of the Classes#[Prof] partition in our example. Again, we automate this task with QBQL, and since later on we would like to calculate entropy of other partitions, we’ll provide generic definition. Let’s emphasize that this theoretically humble idea is one of the major practical benefits of QBQL over SQL and Datalog. First, we define 3 empty relations

CntRelHdr = [cnt];
CardRelHdr = [card];
plpHdr = [plp];


This is just implementation artifact/bug as QBQL should be able to inline these relvars into any query. Then we progresively define partition probabilities <Prob> and entropy <Entropy> as binary operations:

x <Prob> y = ((x |v| (y ^ CntRelHdr)) ^ (x |v| CardRelHdr)) /^ "cnt / card = prob".
x <Entropy> y = ((((x <Prob> y) ^ "ln(prob)=lp") /^ "prob*lp=plp") v plpHdr) /= "result += plp".


The definitions are generic, because you can plug in any two relations in place of x and y. For example, if you are interested in calculating the entropy of Classes#[Prof], you just type:

Classes <Entropy> [Prof];


Let’s walk trough <Prob> implementation. The expression y ^ CntRelHdr is a join of relation y with previously defined empty relation CntRelHdr, and since their attribute sets are disjoint it is a Cartesian product. In other words, we just want a set of attributes from relation y amended by one more attribute cnt. At the next step — x |v| (y ^ CntRelHdr) — we see something that appears as unfamiliar binary infix operator |v| applied to our previous result and relation x. The v symbols is reminiscent of inner union (which is generalization of union and projection) amended with the vertical bars, which in standard math notation usually denotes cardinality. It is SQL group by with counting a counterpart in QBQL notation; the grouping is done over the set of common arguments of the two relations x and y ^ CntRelHdr, while the count function column header is set of attributes of y ^ CntRelHdr which is not in x. In a typical application, such as Classes <Entropy> [Prof], the set of attributes of x is subset of that of y. Therefore, the name of the counting attribute is cnt. At the other branch, of the expression tree we compute x |v| CardRelHdr. Here relation attributes are disjoint, therefore this is just counting with no grouping (which is equivalent to counting with grouping over empty set). The two previous results (x |v| (y ^ CntRelHdr) and x |v| CardRelHdr again have disjoint set of attributes, so joining them is essentially a cartesian product. The final operation — relational composition with the "cnt / card = prob" (which is user-defined ternary predicate Times in disguise) — calculates probabilities for each table partition. Again, let’s not be distracted by the fact that this query is elementary in SQL:

select Prof, count(*)/card prob
from Classes, (select count(*) card from Classes)
group by Prof


First, it is not generic (as it should be rewritten for every other target relation and different subsets of its attributes); second, it uses ad-hock tuple-level expression evaluation (division operator). In our example, after we have defined generic binary entropy operation, then calculating entropy values for all possible splits of relation attributes is as succinct as it can possibly be:

Classes <Entropy> [Prof];
Classes <Entropy> [Course];
Classes <Entropy> [Time];
Classes <Entropy> [Prof Course];
Classes <Entropy> [Course Time];
Classes <Entropy> [Prof Time];
Classes <Entropy> [Prof Course Time];
`

The entropy values numerical order is consistent with partitioning order and this is not a coincidence. Consider a single partition $\{a_1,a_2,...,a_n\}$ split into the two $\{a_1,a_2,...,a_i\} | \{a_i,...,a_n\}$ and assume that both were the part of bigger partition of the set of $N$ elements. Then, the entropy changes from

$\frac{n}{N} \ln(\frac{n}{N}) +$ entropy of other pieces

to

$\frac{i}{N} \ln(\frac{i}{N}) + \frac{n-i}{N} \ln(\frac{n-i}{N}) +$ the same entropy of the rest of the pieces

Since the first term of the former sum can be rewritten as

$\frac{i}{N} \ln(\frac{i+n-i}{N}) + \frac{n-i}{N} \ln(\frac{n-i+i}{N})$

it is greater than the later. Therefore, we have established that entropy is monotonically increasing along with partition granularity. Shannon Entropy function is not unique in that respect. The similar proposition is valid for Gini Index:

$\displaystyle\sum\limits_{i=0}^n p_i(1-p_i)$