## Inclusion Dependencies in SQL

### March 5, 2014

Constraints have always been under appreciated in database practice. Today we’ll focus on inclusion dependencies, which in SQL community are better known as foreign keys. Let’s start with an example borrowed from David Spivak’s work on categorical foundation of databases — the familiar Employees table

Employees = [empno mgr dept]
1    3   10
2    2   20
3    4   10
4    4   10
5    4   10


together with constraint

“For every employee e, the manager of e works in the same department
that e works in.”

One may argue that this is not very realistic constraint. If an employee and his manager have to be in the same department, then an employee and his manager’s manager have to be in the same department as well. Then, by transitivity it follows that the entire hierarchy tree of employees must be in the same department. If hierarchy graph is disjoint (i.e. a forest of trees), then the department number is just a label distinguishing a connected component of hierarchy.

Yet, a business rule is just that: an [arbitrary] rule, which is silly to argue with. Let see if we can formally enforce it. For each pair (mgr,dept) there should exist a pair (emp,dept). That sounds like inclusion dependency, so one may be tempted to enforce it as a foreign key:

CREATE TABLE employees (
empno INT,
mgr INT,
dept INT,
PRIMARY KEY (empno,dept) ,
CONSTRAINT fk FOREIGN KEY fk (mgr,dept)
REFERENCES employees (empno,dept)
ON DELETE NO ACTION
ON UPDATE NO ACTION
);


So far, so good. However, the initial enthusiasm of discovering foreign keys referencing the same table hits the first snag:

insert into emps (empno,mgr,dpt) values(3,4,10);


SQL Error: Cannot add or update a child row: a foreign key constraint fails (test.employees, CONSTRAINT fk FOREIGN KEY (mgr, dept) REFERENCES employees (empno, dept) ON DELETE NO ACTION ON UPDATE NO ACTION)

A bug? Nope. Apparently, the constraint doesn’t allow to insert an employee before manager. The correct table update sequence should be like this:

insert into emps (empno,mgr,dpt) values(4,4,10);
insert into emps (empno,mgr,dpt) values(3,4,10);


## Union via Join

### February 12, 2014

Here is really bizarre technique: one can perform union via join! Consider:

(([p] 0 <OR> [q] a b) <AND> ([p] 1 <OR> [q] c)) v [q];

which outputs

[q] a b c

(We assumed domain p to include {0,1} and q to include {a,b,c}, at least). You may object that the expression involves [domain dependent] union operation <OR>, but it is applied for two relations with disjoint attributes, so it is nothing like classic union.

This observation leads to more general assertion:
(z <OR> x) ^ (<NOT>z <OR> y) = (<NOT>z ^ x) <OR> (z ^ y).
which by leveraging relational lattice De Morgan’s law can be rewritten as

(x' ^ z')' ^ (y' ^ z)' = ((z' ^ x)' ^ (z ^ y)')'.

Here we have switched notation a little from D&D to Prover9Mace4, otherwise, it is just too many negation operators to struggle with. Loading this assertion into automatic prover demonstrates that it is not trivial. It doesn’t follow from semi-lattice axioms amended with distributivity of <OR> over join:

x ^ x = x. x ^ y = y ^ x. (x ^ y) ^ z = x ^ (y ^ z). x ^ (y' ^ z')' = ((x ^ y)' ^ (x ^ z)')'. 

## Analytic View of Functional Dependency

### January 3, 2014

A recurring theme of this blog is that there is more to a relation than being merely a set of tuples. First, a relation is an object which obeys certain rules. The structure of this object is less important as long as it satisfies relational algebra axioms. Second, in previous post we have learned that relations are systems of constraints. The idea that a relation can be viewed as a set of constraints has been promoted in the 90s by Paris Kanellakis. Our intuition, however, is that by narrowing scope to equalities we can leverage powerful machinery of algebraic geometry.

Let’s examine the example that we ended up last time. The relation

[x y z]
1 1 1
2 1 1
3 2 1
3 2 2


corresponds to the variety

$-6 + 11 x - 6 x^2 + x^3 = 0$
$-4 + 3 x - x^2 + 2 y = 0$
$3 - x - 3 z + x z = 0$
$2 - 3 z + z^2 = 0$

which is convenient to rewrite with some equations factored

$(x-3)(x-2)(x-1) = 0$
$-4 + 3 x - x^2 + 2 y = 0$
$(x-3)(z-1) = 0$
$(z-2)(z-1) = 0$

The first and the last equations are univariate. They assert that values x belong to the set {1,2,3}, and values z are restricted to the set {1,2}. The third equation asserts that either x=3 or z=1. What is the second constraint?

Before answering that question, though, let’s comment that these equation are akin to “columnar” constraints. This is refreshing but somewhat perverse perspective to what “column-store” database formally is.

Returning back to the second equation we see that y is limited in power to 1. Therefore, it can be rewritten as

$y = (-4 + 3 x - x^2)/2$

In other words, y is a function of x. This is just an example of general phenomenon: given a functional dependency x->y, one can always apply polynomial interpolation, thus explicitly expressing y as a function of x. That is right, in algebraic geometry view functional dependency holds whenever we can exhibit an analytic function.

Next, Heath’s theorem declares that given a relation Q such that Q ^ [] = [x y z], and functional dependency x->y, then Q can be decomposed into join of projections:

Q = (Q v [x y]) ^ (Q v [x z])

What is Heath’s theorem in algebraic geometry terms?

First, let’s focus on join of two relations. If both relations have the same set attributes, then join is set intersection of tuples. Relation corresponds to finite variety, and we already know how to perform intersection of varieties: just combine both sets of defining equations. If two varieties have different sets of attributes, then we can expand them into larger space spanning the common set of attributes. This procedure doesn’t affect the set of defining equations. For example, if we consider a variety defined by a single equation

$(x-3)(x-2)(x-1) = 0$

in $R^1$, then it is also a variety in $R^2$ — there is simply no constraints onto the other variable. Likewise, the variety defined by the system

$(x-3)(x-2)(x-1) = 0$
$-4 + 3 x - x^2 + 2 y = 0$

is actually two varieties (at least): one defined in space $R^2$ of variables {x,y}, and the other in space $R^3$ of variables {x,y,z}. In a similar venue, the system

$(x-3)(x-2)(x-1) = 0$
$(x-3)(z-1) = 0$
$(z-2)(z-1) = 0$

defines one variety in {x,z}, and the other in {x,y,z}.

What about going the other way: can we take a variety in space of the two variables x and y and project it into x?

Unfortunately, in general, a projection of a variety is not a variety, as witnessed by the following counterexample:

$x^2 + y^2 - 1 = 0$

Elimination of variables is the proper way to project a variety. Fortunately, in our example, variety in space $R^2$

$(x-3)(x-2)(x-1) = 0$
$-4 + 3 x - x^2 + 2 y = 0$

is projection of variety defined by the same equations in space $R^3$ : the variable z being not present in the system makes it elimination trivial.

Let’s summarize what we have achieved so far. We have split 4 equations defining a variety in $R^3$ into two parts:

$(x-3)(x-2)(x-1) = 0$
$-4 + 3 x - x^2 + 2 y = 0$

and

$(x-3)(x-2)(x-1) = 0$
$(x-3)(z-1) = 0$
$(z-2)(z-1) = 0$

We can view both systems either in the same space $R^3$, or project the first one to {x,y}, and the second one into {x,z}. If we look onto this pair of varieties in $R^3$, then the result is geometrically intersection of varieties. If we consider them as residing in different $R^2$ spaces (i.e. {x,y} and {x,z}), then it is a new operation: [natural] join of varieties. These “projected” varieties correspond to relations

[x y]
1 1
2 1
3 2

[x z]
1 1
2 1
3 1
3 2


The first one is immediate, because the roots of the first cubic univariate equation in x are the domain of three values {1,2,3}. The second equation is a function unambiguously defining each (x,y) value pair.

For the second system its correspondence to the relation is little less obvious. A brute force method to check it is Wolfram Alpha/Mathematica command

Solve[(x - 3)(x - 2)(x - 1) == 0 && (x - 3)(z - 1) == 0 && (z - 2)(z - 1) == 0, {x, y, z}]

Checking it manually is not hard either. The last constraint assets that z is either 1 or 2. If it is 1, then both the second and the third constraint are satisfied, so x can be either of the three values {1,2,3} from the first constraint. If z = 2, then x has to be 3 in order to satisfy the second constraint.

There is one last detail, that we have missed. In general case, what if we calculated Groebner basis, but still have the two constraints: one in all three variables {x,y,z}, and the other one in {y,z} — sure we can’t decompose the system anymore? Well, the functional dependency would still be there; then we express y as an explicit function of x, and eliminate it in both equations by substitution.

To summarize, the interpretation of Heath’s theorem in algebraic geometry setting is quite intuitive. Most of the effort here has been spent establishing correspondence between join of relations and intersection/join of varieties. Heath’s theorem itself can be eloquently described in few steps:

• Write down the system of constraints defining the relation
• Partition all constraints into 3 groups:
1.  Unary [polynomial] constraint defining the values of functional dependency’s domain. In our example its $(x-3)(x-2)(x-1) = 0$
2.  Functional dependency itself, e.g. $y = (-4 + 3 x - x^2)/2$
3.  The rest of the system
• Eliminate the dependent variable in 3 by substitution.
• We have required decomposition of the constraint system into {1,2} and {1,3}.

## Relations as Finite Varieties

### January 2, 2014

It is common wisdom that database field is firmly grounded in the two math disciplines: predicate logic and set theory. However, neither logic nor set theory are dominant math subjects. Just counting tags at mathoverflow.net can give rough idea of the importance of an individual math topic, and it is evident that algebraic geometry governs the math world. This field is so rich that it spawned new sub-fields, such as category theory. Furthermore, mathoverflow.net hints that category theory is more popular than either predicate logic or set theory! Some category theorists even proposed to rebuild the entire math foundation with category theory toolkit…

Does algebraic geometry offer any insights to database theory? First, we recall what basic mathematical objects of algebraic geometry are. The basic geometric object is affine variety — a system of polynomial equations. Affine variety is a set of tuples in $R^n$, and our focus on database applications prompts that we must narrow our scope to finite varieties, when polynomial system has finite number of roots.

OK, this abstract motivation is fine, but given a [database] relation, how do we construct a variety out of it? This is accomplished after we learn the two fundamental operations over varieties.

1. Set intersection of two varieties $V$ and $W$ is a set which is, again, a variety. The defining set of polynomials for $V \cap W$ is a union of the polynomial sets defining the $V$ and $U$.
2. Set union of two varieties $V$ and $W$ is a set which is, again, a variety. The defining set of polynomials for $V \cup W$ is a set of all pairwise products of the polynomials from $V$ and $W$.

Union and intersection of varieties is exactly what is needed for being able to exhibit a variety corresponding to a relation. Consider an unary relations with a single tuple

[x]
1


Geometrically, it corresponds to a single point on the x axis, so it becomes immediately obvious what equation defines it:

$x - 1 = 0$

Likewise, the relation

[y]
1


is defines by

$y - 1 = 0$

Next, we construct join of the two relations

[x y]
1 1


Join is set intersection, and intersection of varieties gives us the system of defining equations

$x - 1 = 0$
$y - 1 = 0$

Let’s expand our example and add one more attribute:

[x y z]
1 1 1


Our system of equations grows with one more constraint:

$x - 1 = 0$
$y - 1 = 0$
$z - 1 = 0$

Now, knowing how to construct “single tuple varieties”, we are ready to move onto relations with more than one tuple. This is accomplished via union. Consider a relation

[x y z]
1 1 1
2 1 1


which is a union of already familiar relation

[x y z]
1 1 1


with

[x y z]
2 1 1


To build the union of varieties we need a polynomial system defining the second variety

$x - 2 = 0$
$y - 1 = 0$
$z - 1 = 0$

The polynomial system for the union is

$(x-1)(x-2) = 0$
$(y-1)(x-2) = 0$
$(z-1)(x-2) = 0$
$(x-1)(y-1) = 0$
$(y-1)(y-1) = 0$
$(z-1)(y-1) = 0$
$(x-1)(z-1) = 0$
$(y-1)(z-1) = 0$
$(z-1)(z-1) = 0$

Wait a minute, 9(!) equations for a relation with just couple of tuples — this doesn’t look promising… The critical observation is that not all of these equations are independent. A well known method to greatly reduce polynomial system is Groebner basis. Executing the following command at Wolfram Alpha:

GroebnerBasis[{(x-1)(x-2), (y-1)(x-2), (z-1)(x-2), (x-1)(y-1), (y-1)(y-1), (z-1)(y-1), (x-1)(z-1), (y-1)(z-1), (z-1)(z-1)} ,{z, y, x}] 
we obtain the system

$x^2-3 x+2 = 0$
$y-1 = 0$
$z-1 = 0$

As an afterthought, this result is obvious. The first equation constraints x to being either 1 or 2, the second equation asserts that y is equal to 1, while the third one asserts z=1. It is good to learn the general method, though.

With this technique we can proceed and find a variety corresponding to the relation

[x y z]
1 1 1
2 1 1
3 2 1
3 2 2


$-6 + 11 x - 6 x^2 + x^3 = 0$
$-4 + 3 x - x^2 + 2 y = 0$
$3 - x - 3 z + x z = 0$
$2 - 3 z + z^2 = 0$

An inquisitive reader might have already noticed functional dependency x -> y lurking in plain view. In the next installment we’ll investigate what functional dependency is from algebraic geometry perspective.

## Relational Algebra syntax in QBQL

### September 3, 2013

When it comes to teaching query languages within introductory database class, a system that supports relational algebra is a must. The first step is ASCII-matizing relational algebra syntax. One approach is to be faithful to RA notation by adopting LATEX syntax. Here is an example from popular Stanford class2go (registration required):

\project_{pizza} \select_{20 <= age and gender='female'} (Eats \join Person);

When implementing support of Relational Algebra in QBQL, we have decided not to emphasize query rendering in its “native” relational algebra expression form in favor of more readable ASCII syntax. Here is the same query adapted to QBQL syntax:

project pizza select "20 <= age" ^ "gender='female'" (Eats join Person); 

It is instructive to compare it to native QBQL syntax, where one can leverage the full power of Relative Composition (aka Set Intersection Join):
 Eats /^ Person /^ "20 <= age" /^ "gender='female'"; 

With the following input (QBQL syntax):

Eats = [name, pizza] Amy mushroom Amy pepperoni Ben cheese Ben pepperoni Cal supreme Dan cheese Dan mushroom Dan pepperoni Dan sausage Dan supreme Eli cheese Eli supreme Fay mushroom Gus cheese Gus mushroom Gus supreme Hil cheese Hil supreme Ian pepperoni Ian supreme ;

 

Person = [name, age, gender] Amy 16 female Ben 21 male Cal 33 male Dan 13 male Eli 45 male Fay 21 female Gus 24 male Hil 30 female Ian 18 male ; 

all three queries output

[pizza] cheese mushroom supreme 

## Paramodulation vs. Resolution

### February 2, 2013

Equations are ubiquitous in mathematics, physics, and computer science. In database field an optimizer rewriting a query to standard restrict-project-join leverages relational algebra identities. The craft of those reasoning abilities is culminated in Automated Theorem Proving systems.

Modern systems, such as Prover9, employs many sophisticated techniques, but as far as equality is concerned, paramodulation rules them all. In a seminal paper Robinson & Wos give an example of a problem which solution takes 47 steps with paramodulation vs. 136 steps with resolution.

The difference can be even more dramatic. Consider lattice idempotence law derived from the two absorption axioms:

formulas(assumptions). x ^ (x v y) = x. x v (x ^ y) = x. end_of_list.

 

formulas(goals). x ^ x = x. end_of_list.

It takes Prover9 just a single paramodulation step and 0.00 (+ 0.03) seconds. To compare it with resolution, we have to explicitly write down equality identities fit to the two lattice operations:

x = x. x != y | y = x. x != y | y != z | x = z. x != y | x ^ z = y ^ z. x != y | z ^ x = z ^ y. x != y | x v z = y v z. x != y | z v x = z v y. 

This explosion of axioms alone hints that the proof length must increase. It is the magnitude of the increase that caught me by surprise: 98 steps and 101 sec.

## NOT EXISTS in Relational Algebra and QBQL

### September 24, 2012

Many people criticize SQL for bloated and inconsistent syntax. However, one must admit that some of its features give Relational Algebra run for the money. Consider recent stack overflow question “Relational Algebra equivalent of SQL “NOT IN””

Is there a relational algebra equivalent of the SQL expression NOT IN? For example, if I have the relation:
 A1 | A2 ---------- x | y a | b y | x 
I want to remove all tuples in the relation for which A1 is in A2. In SQL I might query:
 SELECT * FROM R WHERE A1 NOT IN( SELECT A2 FROM R ); 
What is really stumping me is how to subquery inside the relational algebra selection operator, is this possible?:

$\sigma_{some subquery here}R$

The answer, which was accepted, suggested the query:

$R - \rho_{A1,A2} ( \pi_{a_{11},a_{22}} \sigma_{a_{11}=a_{22}} (\rho_{a_{11},a_{12}} R \bowtie \rho_{a_{21},a_{22}} R) )$

The length and complexity of this expression is puzzling: one can’t convincingly argue of inferiority of SQL to “truly relational” languages without being able to crack it.

On close inspection it becomes evident that the author used more renaming than its necessary, so we have:

$R - ( \pi_{A1,A2} \sigma_{A1=A2} (\rho_{A1,a_{12}} R \bowtie \rho_{a_{21},A2} R) )$

It is still longer and less intuitive than SQL query. Can Relational Algebra do better than this? Also, if SQL is allowed to use extra operations which are equivalent to Relational Algebra antijoin, then perhaps the competition ground is not fair?

There is one operation — relational division, initially included by Codd into the list of base operation, but excluded later when it has been discovered that it can be expressed in terms of the others. We’ll demonstrate that with division-like operators at hand we can achieve much shorter query.

Modern database theory treats relational division as set-containment-join. Set containment join is easy to describe if we temporarily step outside of first normal form relations. Consider the following relations

Certified=[name  skill]
claire  java
max  java
max  sql
peter  html
peter  java
;
Requirements=[job  skill]
appl  java
appl  sql
teaching  java
web  html
;

We can nest their common attribute skill into the set

Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
;

Next, for each pair of applicant and job we check if their skills matches job requirements, that is if one is subset of the other, and keep those pairs which meet this criteria.

This is classic set containment join, but mathematically inclined person would be tempted to do something about empty sets. What about the other applicants and jobs in their respective domains, shouldn’t they be included into both relations as well? Suppose there is a person named “Joe” who doesn’t have any qualification, and let the “entry” job assume no requirements at all:

Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
joe {}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
entry {}
;

Then, the amended set containment join would output additional tuple of Joe qualified for entry level job.

Certainly, the amended set join is no longer domain independent, but one may argue that the benefits overweight the drawbacks. For example, compared to classic set containment join, the amended operation expressed in terms of the basic ones is much more concise. We’ll show relational algebra with amended set containment join would greatly simplify our puzzle query as well.

In general, to sharpen intuition around problems like this it is useful to leverage relational algebra query engine. As usual, we use QBQL. Most of QBQL operations generalize Relational Algebra, and, therefore, are easily translatable into the later.

Here is QBQL query step by step. The most surprising is the very first step where we divide the input relation R over the empty relation with attribute A2:

R=
[A1   A2]
x     y
a     b
y     x
;
R /< [A2];


which outputs
 [A1] b 
Here an explanation may be warranted. In order to evaluate set containment join we have to partition sets of attributes of R and [A2] into exclusive and common attributes. The attribute set of the result is common attributes that is symmetric difference of the attributes of both inputs. In our example the output header is, as expected, {A1}. Now, let’s evaluate the output content. For R the exclusive/common attributes partition is easy: the A1 is exclusive attribute, while A2 is common. In “de-first-normalized” form the table R looks like this:
 R= [A1 A2] x {y} y {x} a {b} b {} 
For the relation [A2] the common attribute is again A2, while the exclusive one is empty. How we set-join such a relation? In de-normalized form the table [A2] looks like this:
 [A2 {}] {} R01 
That’s right, sets are relations with single attribute, and when we are challenged to exibit a nested relation with no attributes, then we have only two choices: R01 and R00 (i.e. TABLE_DEE and TABLE_DUM). After set containment join we obtain
 [A1 {}] b R01 
where the fictitious empty set column should be excluded from the normalized result.

After obtaining the relation
 [A1] b 
the next steps are easy: rename the attribute to A2 and join with R. In QBQL renaming is technically set-intersection join (/^) with binary equality relation (although set-equality /^ and set-containment would produce the same result), so the query becomes:

R ^ ("A1=A2" /^ (R /< [A2]));


and it outputs
 [A1 A2] a b 
as expected.