## Distributivity of union over join

### August 19, 2014

Recently, Litak & Mikulas & Hidders published the extended version of their earlier work. However, in a typical mathematical tradition following the famous saying “when fine building is revealed to the public eye, the scaffolding should be removed”, reading it requires some detective work. In particular, lets investigate the motivation behind their axiom system:

x ^ (y v z) = (x ^ (z v (R00 ^ y))) v (x ^ (y v (R00 ^ z))). % AxRH1

```
```(R00 ^ (x ^ (y v z))) v (y ^ z) =

= ((R00 ^ (x ^ y)) v z) ^ ((R00 ^ (x ^ z)) v y). % AxRH2

`(x ^ y) v (x ^ z) = x ^ ( ((x v z) ^ y) v ((x v y) ^ z) ). % AxRL1`

Let’s start with well known lattice inequality

x v (y ^ z) > (x v y) ^ (x v z).

Here we abused notation a little and assume the order symbol > to be reflexive relation. If the LHS of inequality is greater (or equal) than the RHS, then perhaps one can hope to substitute `y`

and `z`

with greater values at RHS, or with lower values at LHS so that inequality becomes an identity? This is the technique employed by Sergey Polin when refuting McKenzie’s conjecture. Specifically, he introduced a non-decreasing chain of values

y1 = y

z1 = z

```
```y2 = y v (x ^ z1).

z2 = z v (x ^ y1).

`y3 = y v (x ^ z2).`

z3 = z v (x ^ y2).

...

together with weakened distributivity law

x v (y ^ z) = (x v yN) ^ (x v zN).

Unfortunately, literal Polin-style weakened distributivity law fails in relational lattices for n=1,…,6, at least. However, Polin’s technique can be naturally generalized when an algebra offers constant(s). The axiom `AxRH1`

is weakened distributivity law with `y`

elevated to `y v (R00 ^ z)`

and `z`

elevated to `y v (R00 ^ z)`

.

In a similar venue, the axiom `AxRL1`

is Polin-style weakening applied to LHS of distributivity law. (Litak et.al. attribute the axiom to Padmanabhan work.)

Genesis of `AxRH2`

is probably various earlier discovered conditional distributivity identities involving the header relation `R00`

. For professional logician converting implication into equality is a walk in the park.

How about the dual distributivity law of union over join? The stricter conditional distributivity of union over join established in an earlier paper hints that the axiom might be more complicated. One might have to lower/elevate both sides of inequality. After some trial permutations of variables and the header constant `R00`

, it revealed the following identity:

(x v (y ^ (x v z))) ^ (x v (z ^ (x v y))) = x v ((y v (x ^ R00)) ^ (z v (x ^ R00))).

This can be translated into relational calculus expression

(( exists p1 exists p2 exists p6(x(p1,p3,p5,p7) | (y(p2,p3,p6,p7)& exists p1 exists p3 exists p4 exists p6(x(p1,p3,p5,p7) | z(p4,p5,p6,p7))))& exists p1 exists p4 exists p6(x(p1,p3,p5,p7) | (z(p4,p5,p6,p7)& exists p1 exists p5 exists p2 exists p6(x(p1,p3,p5,p7) | y(p2,p3,p6,p7)))))

<->

exists p1(x(p1,p3,p5,p7) | ( exists p2 exists p6 exists p1 exists p5(y(p2,p3,p6,p7) | (x(p1,p3,p5,p7)&$F))& exists p4 exists p6 exists p1 exists p3(z(p4,p5,p6,p7) | (x(p1,p3,p5,p7)&$F))))).

which validity is easily verified with Prover9.

This axiom doesn’t seem to be independent from the `{AxRH1,AxRH2,AxRL1}`

with Mace4 models lookup up to cardinality 26 at least, although Prover9 fails to confirm it after 3 hours running time. (This is not entirely surprising, with union over join distributivity law being often enforced by join over union).

## Automatic database links in SQLDeveloper

### April 7, 2014

People invent really complicated ways to achieve simple tasks such as copying a table from one database to another. With database link (which is arguably RDBMS’s the most under appreciated feature) it is just one SQL command:

create table emp as select * from scott.emp@dblink1;

But who has time to create database links? In SQLDeveloper 4.0 you don’t have to. Assuming that you have the right connection the script

set autodblink on create table emp as select * from scott.emp@connection1; -- assuming there is connection "connection1"

would create temporary db link, then run the statement and, finally, drop the link.

Another example, which benefits from this functionality is comparing a table in two different databases:

set autodblink on (select * from employees@connection1 minus select * from employees) union (select * from employees minus select * from employees@connection1);

Watch out for pitfalls: if connection info doesn’t store a password, you won’t be asked for it — the fix is scheduled for 4.0.2 patch.

Relational Lattice is simplified “attribute-free” Relational Algebra. It allows to axiomatize Relational Algebra in the same spirit as Relation Algebra. Here is axiom system by Litak&Hidders&Mikulas:

`x ^ y = y ^ x.`

(x ^ y) ^ z = x ^ (y ^ z).

x ^ (x v y) = x.

x v y = y v x.

(x v y) v z = x v (y v z).

x v (x ^ y) = x.

```
```

`x ^ (y v z) = (x ^ (z v (R00 ^ y))) v (x ^ (y v (R00 ^ z))).`

(x ^ y) v (x ^ z) = x ^ ( ((x v z) ^ y) v ((x v y) ^ z) ).

(R00 ^ (x ^ (y v z))) v (y ^ z) = ((R00 ^ (x ^ y)) v z) ^ ((R00 ^ (x ^ z)) v y).

The authors hint the proof of each individual axiom “by direct calculation”. Let’s prove them by converting each equality into equivalent Predicate Calculus formula.

The first snag is that Relational Lattice expressions don’t refer to [relation] attributes. We have to equip each individual relation variable with attributes. This is inherently ambiguous step, although one observation might help.

Consider an expression of two variables `x`

and `y`

, e.g. `x ^ (x v y)`

. There might be more than one common attribute for `x`

and `y`

, but when calculating a result of either operation *join* `^`

or *meld* `v`

, two common attributes match the same way as a single *composite attribute*. For example, for all practical purposes joining by two conditions matching first and last name is the same as matching by their concatenation — full name. Therefore, considering expressions with two variables we equip `x`

with two attributes: `p1`

, which is in `x`

but not in `y`

, together with `p2`

, which is both in `x`

and `y`

. Likewise, we equip `y`

with two attributes: `p2`

, which is both in `x`

and `y`

, together with `p3`

, which is in `y`

but not in `x`

. This canonical attribute introduction generalizes to any number of relvars.

Now that all relvars are equipped with attributes we can translate complex Relational Lattice expressions into Predicate Calculus as well. Join operation is straightforward: it is Boolean conjunction. The set of free variables in the resulting formula is a union of free variables of the operands. Meld operation is Boolean disjunction, however we must add existential quantification over arguments which are in the first operand but are absent from the second one, and arguments which are in the second operand but not the first one (symmetric difference). For example,

`x v y `

translates to

`exists p1 exists p3 x(p1,p2) | y(p2,p3)`

Translation procedure is entirely mechanical and begs for automation. In QBQL you just write down an axiom into the input file `qbql/lang/parse/RL.axiom`

, then run `qbql.lang.parse.Translate.main()`

and, finally, witness the translated expression in the output. The result then can be fed directly into Prover9.

Let’s try this algorithm for several axioms. The absorption law

`x v (x ^ y) = x.`

translates into predicate equivalence

`exists p2(x(p1,p3) | (x(p1,p3)&y(p2,p3)))`

<->

x(p1,p3).

The Prover9 proof of the later is remarkable:

`1 (exists p2 (x(p1,p3) | x(p1,p3) & y(p2,p3))) x(p1,p3) # label(non_clause) # label(goal). [goal].`

2 $F. [deny(1)].

It is certainly shorter than the proof in one of the earlier paper, and warrants an explanation, to say the least. Prover9 succinct output is a result of several steps done internally, such as Scolemization, rewriting to Conjunctive Normal Form, etc. In this case, Scolemization followed by CNF transformation produced the false value, thus proving the theorem.

Let’s investigate more relational operations. Relational complement translates into Boolean negation, while inversion is a little bit more sophisticated. Here is a theorem, which can be considered definition for `R00`

(better known in D&D community as `DUM`

):

`(y' v y`)'=R00.`

It translates into

`-( exists p1(-(y(p1)) | ( exists p1 y(p1))))`

<->

$F.

where one don’t need help of any automatic proof system to be convinced of its validity. There we see yet again that `R00`

is analog of nullary predicate “false”.

Now that we established conversion of Relational Lattice expressions into Predicate Calculus it is naturally to ask if we can translate in the opposite direction. Consider arbitrary calculus formula and let’s drill down into its structure. There are several possibilities of what syntactic constructions we encounter on each level.

If we come across existential quantifiers, then we introduce projection, i.e. meld with an empty relation. Universal quantifier is dual to existential via De Morgan’s formula. There we have to introduce negation, which is domain dependent. If we encounter Boolean conjunction, then we introduce [natural] join (leaving all free variables unaffected). The only step which is not straightforward is handling disjunction. We translate it into the *outer union* AKA `<OR>`

. Outer union is domain-dependent operation, but as you see ignoring domain dependence rewarded us with very succinct proof of Codd’s theorem. Free variables flow unobstructed through outer union, similar to natural join; giving another reason why this transformation step was so easy.

The last step is rewriting outer union in terms of standard lattice operations using one of the following equivalent definitions:

`x <OR> y = (x ^ R11) v (y ^ (x v R11)).`

x <OR> y = (R11 ^ y) v (x ^ y) v (x ^ R11).

x <OR> y = (x ^ (y v R11)) v (y ^ (x v R11)).

x <OR> y = (x' ^ y')'.

x <OR> y = ((R00 ^ x) ^ y) v (R11 ^ y) v (x ^ R11).

## Inclusion Dependencies in SQL

### March 5, 2014

Constraints have always been under appreciated in database practice. Today we’ll focus on *inclusion dependencies*, which in SQL community are better known as *foreign keys*. Let’s start with an example borrowed from David Spivak’s work on categorical foundation of databases — the familiar `Employees`

table

Employees = [empno mgr dept] 1 3 10 2 2 20 3 4 10 4 4 10 5 4 10

together with constraint

*“For every employee e, the manager of e works in the same department
that e works in.”*

One may argue that this is not very realistic constraint. If an employee and his manager have to be in the same department, then an employee and his manager’s manager have to be in the same department as well. Then, by transitivity it follows that the entire hierarchy tree of employees must be in the same department. If hierarchy graph is disjoint (i.e. a forest of trees), then the department number is just a label distinguishing a connected component of hierarchy.

Yet, a business rule is just that: an [arbitrary] rule, which is silly to argue with. Let see if we can formally enforce it. For each pair `(mgr,dept)`

there should exist a pair `(emp,dept)`

. That sounds like inclusion dependency, so one may be tempted to enforce it as a foreign key:

CREATE TABLE employees ( empno INT, mgr INT, dept INT, PRIMARY KEY (empno,dept) , CONSTRAINT fk FOREIGN KEY fk (mgr,dept) REFERENCES employees (empno,dept) ON DELETE NO ACTION ON UPDATE NO ACTION );

So far, so good. However, the initial enthusiasm of discovering foreign keys referencing the same table hits the first snag:

insert into emps (empno,mgr,dpt) values(3,4,10);

`SQL Error: Cannot add or update a child row: a foreign key constraint fails (`test`.`employees`, CONSTRAINT `fk` FOREIGN KEY (`mgr`, `dept`) REFERENCES `employees` (`empno`, `dept`) ON DELETE NO ACTION ON UPDATE NO ACTION`

)

A bug? Nope. Apparently, the constraint doesn’t allow to insert an employee before manager. The correct table update sequence should be like this:

insert into emps (empno,mgr,dpt) values(4,4,10); insert into emps (empno,mgr,dpt) values(3,4,10);

## Union via Join

### February 12, 2014

Here is really bizarre technique: one can perform union via join! Consider:

`(([p] 0 <OR> [q] a b) <AND> ([p] 1 <OR> [q] c)) v [q];`

which outputs

`[q]`

a

b

c

(We assumed domain `p`

to include `{0,1}`

and `q`

to include `{a,b,c}`

, at least). You may object that the expression involves [domain dependent] union operation `<OR>`

, but it is applied for two relations with disjoint attributes, so it is nothing like classic union.

This observation leads to more general assertion:

`(z <OR> x) ^ (<NOT>z <OR> y) = (<NOT>z ^ x) <OR> (z ^ y).`

which by leveraging relational lattice De Morgan’s law can be rewritten as

`(x' ^ z')' ^ (y' ^ z)' = ((z' ^ x)' ^ (z ^ y)')'.`

Here we have switched notation a little from D&D to Prover9Mace4, otherwise, it is just too many negation operators to struggle with. Loading this assertion into automatic prover demonstrates that it is not trivial. It doesn’t follow from semi-lattice axioms amended with distributivity of `<OR>`

over join:

`x ^ x = x.`

x ^ y = y ^ x.

(x ^ y) ^ z = x ^ (y ^ z).

x ^ (y' ^ z')' = ((x ^ y)' ^ (x ^ z)')'.

## Analytic View of Functional Dependency

### January 3, 2014

A recurring theme of this blog is that there is more to a relation than being merely a set of tuples. First, a relation is an object which obeys certain rules. The structure of this object is less important as long as it satisfies relational algebra axioms. Second, in previous post we have learned that relations are systems of constraints. The idea that a relation can be viewed as a set of constraints has been promoted in the 90s by Paris Kanellakis. Our intuition, however, is that by narrowing scope to equalities we can leverage powerful machinery of algebraic geometry.

Let’s examine the example that we ended up last time. The relation

[x y z] 1 1 1 2 1 1 3 2 1 3 2 2

corresponds to the variety

which is convenient to rewrite with some equations factored

The first and the last equations are univariate. They assert that values x belong to the set {1,2,3}, and values z are restricted to the set {1,2}. The third equation asserts that either x=3 or z=1. What is the second constraint?

Before answering that question, though, let’s comment that these equation are akin to “columnar” constraints. This is refreshing but somewhat perverse perspective to what “column-store” database formally is.

Returning back to the second equation we see that y is limited in power to 1. Therefore, it can be rewritten as

In other words, y is a function of x. This is just an example of general phenomenon: given a functional dependency `x->y`

, one can always apply polynomial interpolation, thus explicitly expressing y as a function of x. That is right, in algebraic geometry view functional dependency holds whenever we can exhibit an analytic function.

Next, Heath’s theorem declares that given a relation Q such that `Q ^ [] = [x y z]`

, and functional dependency `x->y`

, then Q can be decomposed into join of projections:

`Q = (Q v [x y]) ^ (Q v [x z])`

What is Heath’s theorem in algebraic geometry terms?

First, let’s focus on join of two relations. If both relations have the same set attributes, then join is set intersection of tuples. Relation corresponds to finite variety, and we already know how to perform intersection of varieties: just combine both sets of defining equations. If two varieties have different sets of attributes, then we can expand them into larger space spanning the common set of attributes. This procedure doesn’t affect the set of defining equations. For example, if we consider a variety defined by a single equation

in , then it is also a variety in — there is simply no constraints onto the other variable. Likewise, the variety defined by the system

is actually two varieties (at least): one defined in space of variables {x,y}, and the other in space of variables {x,y,z}. In a similar venue, the system

defines one variety in {x,z}, and the other in {x,y,z}.

What about going the other way: can we take a variety in space of the two variables x and y and project it into x?

Unfortunately, in general, a projection of a variety is not a variety, as witnessed by the following counterexample:

Elimination of variables is the proper way to project a variety. Fortunately, in our example, variety in space

is projection of variety defined by the same equations in space : the variable z being not present in the system makes it elimination trivial.

Let’s summarize what we have achieved so far. We have split 4 equations defining a variety in into two parts:

and

We can view both systems either in the same space , or project the first one to {x,y}, and the second one into {x,z}. If we look onto this pair of varieties in , then the result is geometrically intersection of varieties. If we consider them as residing in different spaces (i.e. {x,y} and {x,z}), then it is a new operation: [natural] join of varieties. These “projected” varieties correspond to relations

[x y] 1 1 2 1 3 2

[x z] 1 1 2 1 3 1 3 2

The first one is immediate, because the roots of the first cubic univariate equation in x are the domain of three values {1,2,3}. The second equation is a function unambiguously defining each (x,y) value pair.

For the second system its correspondence to the relation is little less obvious. A brute force method to check it is Wolfram Alpha/Mathematica command

`Solve[(x - 3)(x - 2)(x - 1) == 0 && (x - 3)(z - 1) == 0 && (z - 2)(z - 1) == 0, {x, y, z}]`

Checking it manually is not hard either. The last constraint assets that z is either 1 or 2. If it is 1, then both the second and the third constraint are satisfied, so x can be either of the three values {1,2,3} from the first constraint. If z = 2, then x has to be 3 in order to satisfy the second constraint.

To summarize, the Heath’s theorem in algebraic geometry setting is obvious. Most of the effort here has been spent establishing correspondence between join of relations and intersection/join of varieties. This is typical: learning curve for a new language is steep, but (hopefully) would get amortized over the benefits.

There is one last detail, that we have missed. In general case, what if we calculated Groebner basis, but still have the two constraints: one in all three variables {x,y,z}, and the other one in {y,z} — sure we can’t decompose the system anymore? Well, the functional dependency would still be there; then we express y as an explicit function of x, and eliminate it in both equations by substitution.

## Relations as Finite Varieties

### January 2, 2014

It is common wisdom that database field is firmly grounded in the two math disciplines: predicate logic and set theory. However, neither logic nor set theory are dominant math subjects. Just counting tags at mathoverflow.net can give rough idea of the importance of an individual math topic, and it is evident that *algebraic geometry* governs the math world. This field is so rich that it spawned new sub-fields, such as category theory. Furthermore, *mathoverflow.net* hints that category theory is more popular than either predicate logic or set theory! Some category theorists even proposed to rebuild the entire math foundation with category theory toolkit…

Does algebraic geometry offer any insights to database theory? First, we recall what basic mathematical objects of algebraic geometry are. The basic geometric object is *affine variety* — a system of polynomial equations. Affine variety is a set of tuples in , and our focus on database applications prompts that we must narrow our scope to finite varieties, when polynomial system has finite number of roots.

OK, this abstract motivation is fine, but given a [database] relation, how do we construct a variety out of it? This is accomplished after we learn the two fundamental operations over varieties.

1. Set intersection of two varieties and is a set which is, again, a variety. The defining set of polynomials for is a union of the polynomial sets defining the and .

2. Set union of two varieties and is a set which is, again, a variety. The defining set of polynomials for is a set of all pairwise products of the polynomials from and .

Union and intersection of varieties is exactly what is needed for being able to exhibit a variety corresponding to a relation. Consider an unary relations with a single tuple

[x] 1

Geometrically, it corresponds to a single point on the x axis, so it becomes immediately obvious what equation defines it:

Likewise, the relation

[y] 1

is defines by

Next, we construct join of the two relations

[x y] 1 1

Join is set intersection, and intersection of varieties gives us the system of defining equations

Let’s expand our example and add one more attribute:

[x y z] 1 1 1

Our system of equations grows with one more constraint:

Now, knowing how to construct “single tuple varieties”, we are ready to move onto relations with more than one tuple. This is accomplished via union. Consider a relation

[x y z] 1 1 1 2 1 1

which is a union of already familiar relation

[x y z] 1 1 1

with

[x y z] 2 1 1

To build the union of varieties we need a polynomial system defining the second variety

The polynomial system for the union is

Wait a minute, 9(!) equations for a relation with just couple of tuples — this doesn’t look promising… The critical observation is that not all of these equations are independent. A well known method to greatly reduce polynomial system is Groebner basis. Executing the following command at Wolfram Alpha:

`GroebnerBasis[{(x-1)(x-2), (y-1)(x-2), (z-1)(x-2),`

(x-1)(y-1), (y-1)(y-1), (z-1)(y-1),

(x-1)(z-1), (y-1)(z-1), (z-1)(z-1)} ,{z, y, x}]

we obtain the system

As an afterthought, this result is obvious. The first equation constraints `x`

to being either 1 or 2, the second equation asserts that `y`

is equal to 1, while the third one asserts `z=1`

. It is good to learn the general method, though.

With this technique we can proceed and find a variety corresponding to the relation

[x y z] 1 1 1 2 1 1 3 2 1 3 2 2

An inquisitive reader might have already noticed functional dependency `x -> y`

lurking in plain view. In the next installment we’ll investigate what functional dependency is from algebraic geometry perspective.