## Set Joins and Aggregation

Introduction

This is the third article about QBQL relational programming language. In previous pieces [1,2] we walked through QBQL foundation: natural join and inner union operations, together with an interface to a procedural language (Java) which allows defining relations with potentially infinite content — predicates. Then, we glimpsed to set joins as a means of introducing renaming operation. Here we’ll investigate the matter in more detail, and demonstrate that set equality join is a carrier for aggregation with grouping, so that no dedicated aggregation/grouping operator is necessary.

Relational Division and Set Joins

Relational division is one of eight Codd’s original relational algebra operations. Soon after introduction it has been discovered that the division can be expressed in terms of the other operations. This, together with the fact that division is quite uncommon, negatively affected its popularity, and in contemporary database textbooks the division is presented mostly as afterthought.

Classic relational division requires the set of attributes of the divisor being the subset of that of the dividend. It is natural to lift this restriction and make division applicable to any pair of operands. This leads to a family of division-like operations known in database research literature as set joins.

Let’s get to definition. Given two relations `x` and `y` we partition all the attributes of `x` and `y` into three sets:

• Attributes which are exclusively in `x`
• Attributes which are exclusively in `y`
• Common `x` and `y` attributes

Set join operations always produce a relation with attributes which were exclusively in `x` and in `y` or, in other words, symmetric difference of the relation headers. For example, a set join applied to the relations

```Certified=[name  skill]
claire  java
max  java
max  sql
peter  html
peter  java
;
Requirements=[job  skill]
appl  java
appl  sql
teaching  java
web  html
;```

would output a relation with attributes `{name,job}`.

In order to define the output relation content it is convenient to step outside of 1NF. We have already partitioned the attributes of `x` and `y` into exclusive and common attributes. The values in the exclusive attributes are interpreted as set identifiers, and the values in the common attributes are grouped into the sets. Perhaps our running example would convey this idea better:

```Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
;```

Next, there is rather subtle issue whether these sets are allowed to be empty. Mathematically inclined person would find it hard to justify why skillSet is never empty, therefore would be tempted to generalize the above relations into something like that

```Certified=[name  skillSet]
claire  {java}
max  {java,sql}
peter  {html,java}
joe {}
;
Requirements=[job  skillSet]
appl  {java,sql}
teaching  {java}
web  {html}
entry {}
;```

where we assumed two more values `joe` and `entry` in the domains `name` and `job`.

Finally, we join the two relations as if the common attribute values were scalars rather than sets. Like with ordinary relational algebra join the result is determined by what condition we choose to compare the sets. There are at least 5 possibilities:

1. Sets required to be disjoint. For example, Claire skill set — `{java}` — doesn’t match requirements for the Web job, that is `{html}`, therefore the tuple `<name=claire,job=web>` makes its way into the result. The notation for disjoint set join is `/0` (the division symbol, followed by the set intersection cardinality constrained to `0`). The example:

```Certified /0 Requirements;
-- QBQL output:
Certified/0Requirements=[job  name]
appl  joe
entry  claire
entry  joe
entry  max
entry  peter
teaching  joe
web  claire
web  joe
web  max
;```

Please note that because of our treatment of empty sets the output of set disjoint join is domain dependent.

2. Sets intersection required to have exactly one element. For example, Peter skill set — `{html,java}` — have exactly one element in common with the `appl={java,sql}` set, therefore the tuple `<name=peter,job=appl>` appears in the result. The notation for this almost disjoint set join is `/1` (the division symbol, followed by the set intersection cardinality being 1). The example:

```Certified /1 Requirements;
-- QBQL output:
Certified/1Requirements=[job  name]
appl  claire
appl  peter
teaching  claire
teaching  max
teaching  peter
web  peter
;```

3. Sets intersection required to be not empty. In previous article we have already leveraged set intersection join to simulate renaming. The notation for set intersection join is `/^`. The example:

```Certified /^ Requirements;
-- QBQL output:
Certified/^Requirements=[job  name]
appl  claire
appl  max
appl  peter
teaching  claire
teaching  max
teaching  peter
web  peter
;```

We could have invented many more conditions constraining set intersection. For example, a dual constraint to disjoint sets would require set intersection to be “big”. Slightly more exotic possibility would be requiring one set cardinality to be at least half of the other. This last case is not entirely made up: Language Proof and Logic textbook introduces unconventional quantifiers such as “most of”, and set joins are quantifier-like operations in relational algebra. We’ll not explore this venue in any depth, and focus on the two most important set joins.

4. One set contained in the other. For example, Claire skill set — `{java}` — is contained in `appl={java,sql}` set, therefore the tuple `<name=claire,job=appl>` appears in the result. Unlike previous cases this operation is not symmetric, therefore the notation for set containment join bifurcates into `/<` and `/>`. The example:

```Certified /> Requirements;
-- QBQL output:
Certified/>Requirements=[job  name]
appl  max
entry  claire
entry  joe
entry  max
entry  peter
teaching  claire
teaching  max
teaching  peter
web  peter
;```

Both operations `/<` and `/>` are naturally combined into one:

5. Sets are equal. For example, Max skill set — `{java,sql}` — matches the requirements `{java,sql}` for `apps` exactly, therefore the result contains the tuple `<name=max,job=appl>`. The notation for set equality join is `/=`. The example:

```Certified /= Requirements;
-- QBQL output:
Certified/=Requirements=[job  name]
appl  max
entry  joe
teaching  claire
;```

In the beginning of the section we have mentioned that relational division is not a fundamental operation, because it can be expressed via standard RA operations. This idea carries over onto set joins (with an exception of almost disjoint set join). Compared to classic relational division, which is expressed in terms classic RA operations, set join expressions in relational lattice terms are much more succinct and symmetric, which is attributed to two factors: mathematical effectiveness of relational lattice, and amendment for the case of empty sets. Our focus, however, is QBQL programming, this is why we don’t pursue this venue here. All that we need to know is that QBQL somehow implements all the set join operations for relation operands. More interesting, however, is the fact that QBQL also knows how to perform set equality join of relation with predicate!

Aggregation

Counting (and, more generally, summation) is a prototypical aggregation operation. Let’s introduce a predicate that informally asserts that “the `summands` values add up to `result`“:

```public class Sum {
public static NamedTuple summands_result(
Relation summands
) {
int ret = 0;
for( Tuple t: summands.content ) {
ret += (Integer)t.data[0];
}
String[] columns = new String[]{"result"};
Object[] data = new Object[]{ret};
return new NamedTuple(columns,data);
}
}
;```

The reader might have noticed that we had to amend our relational-to-java binding and introduce methods that take a whole relation as an argument. By any means this is not a new idea: RDBMS engines had table functions for ages.

How do we apply the `Sum` predicate to add values? We would have to match the value set against the `summands`, so set equality join seems to be the right candidate. A usual, an example is much more eloquent than words:

```Sum /= [summands] 1 2 3 4;
-- QBQL output:
Sum/=[summands]1234=[result]
10
;```

How about aggregation with grouping? “List all the departments with total salary of their employees?” Almost as easy as previous example:

```Emp = [EMPNO ENAME JOB MGR SAL DEPTNO]
7369 SMITH  CLERK 7902 800 20
7499 ALLEN SALESMAN 7698 1600 30
7521 WARD SALESMAN 7698 1250 30
7566 JONES MANAGER 7839 2975  20
7654 MARTIN SALESMAN 7698 1250 30
7698 BLAKE MANAGER 7839 2850  30
7782 CLARK MANAGER 7839 2450  10
7788 SCOTT ANALYST 7566 3000  20
7839 KING PRESIDENT 7839 5000  10
7844 TURNER SALESMAN 7698 1500 30
7876 ADAMS CLERK 7788 1100  20
7900 JAMES CLERK 7698 950  30
7902 FORD ANALYST 7566 3000  20
7934 MILLER CLERK 7782 1300  10
;
((Emp v [DEPTNO SAL]) /^ [SAL=summands]) /= Sum;
-- QBQL output:
((Empv[DEPTNOSAL])/^[SAL=summands])/=Sum=[DEPTNO  result]
10  8750
20  7875
30  8150
;```

Duplicates

It is encouraging that there is nothing special about aggregation in QBQL. Generally, the more features language has, the more bloated and incoherent the whole programming system becomes. However, there is a little problem. What if there are employees with the same salary; wouldn’t projection onto `{DEPTNO,SAL}` collapse these salaries? One may argue that, say, SQL and Tutorial D which both have dedicated operators for aggregation and grouping don’t have this problem. Is it really so?

Note that the problem appears at the moment when a naive user projects the results into the two columns, and his logic is perfectly sound: “When querying totals of department salaries only the `DEPTNO` and `SAL` matter, so why not to project everything else away?” Actually, SQL and Tutorial D differ at this point with Tutorial D merging the duplicates, and SQL producing a multiset. Here is our decision point: do we side with relational purists and dismiss multisets, or adopt SQL treatment of duplicates?

Certainly, we don’t want to introduce multisets: this would put the algebraic foundation onto a shaky ground. Moreover, SQL `distinct` keyword looks like is a throwback to COBOL era. On the other hand, there is something rotten in the relation land. Consider relation:

```Person = [name age]
max  33
peter 33
;```

Projecting it to the `age` column would collapse it into a single tuple, but is it really aligned with user’s expectations? The time value is not an integer! If the `age` values are disturbed slightly, say, like this

```Person = [name age]
max  33.1
peter 32.87
;```

then there is no duplicates, and projection is no longer counterintuitive. So if we have floating type values, they can always be considered unequal; futhermore, not using equality comparisons between float values is actually well respected programming rule of thumb.

Therefore, we have two ends of the spectrum. On one hand, we have string literals that identify things like person names; on the other hand, we have real numbers as results of measurements subject to measurement error. What about gray area between them: how would we classify some integer column? It could be an identity, for example, social security number, or it could be some rounded real number. We suggest that the answer should be derived from the usage. If an attribute is a part of foreign or unique key, that is certainly an identity; whereas an attribute used in an expression involving aggregation should be treated as a measurement.

### 8 Responses to “Set Joins and Aggregation”

1. AntC Says:

Hi Vadim, this lattice stuff is hurting my head, but it seems much ‘cleaner’ than the Relational Algebra operators.

I don’t like “to step outside 1NF” with these multivalued attributes.

I wonder if there is another way?
On the page ‘Perspective into Functional Dependencies’ you use set partition to make sets of sets of tuples.
Could this work for your family of division-like operators? Like this:
1. Partition each operand by its exclusive attributes.
2. Apply the set-based join between the member sets.
3. ‘Flatten’ each set of sets back to a set of tuples.

Presumably your summands=result could also work on a partitioning basis for aggregates(?)

I guess this wouldn’t work for the “subtle issue whether these set[-valued attribute]s are allowed to be empty”. But I really don’t like your examples with empty sets.

I notice that the partitioning approach means you can use axiomatic reasoning. But there’s no reasoning on this page re set joins(?)

Another parallel: the Haskell language has recently introduced an experimental extension to support SQL-like aggregates (GROUP BY, HAVING, ORDER BY, SUM(), COUNT()) for its list comprehension syntax. (Lists are the usual way to simulate sets.) This also works on a partitioning approach: comprehend the list into a list-of-lists, then ‘flatten’ back to a list.

AntC

• AntC Says:

P.S. Perhaps that could also work round the problem you discuss re duplicates?

First rename SAL to summand (singular).
Then partition on DEPTNO.
Set-join on Sum to extend each tuple with result. — beware!
Then ‘flatten’.
Then project [DEPTNO result].

Beware! that the set-join is ‘inside’ the set-of-sets, and extends each tuple with the Sum result for its grouping by DEPT. (So it’s not very division-like.)

It would be nicer to put the rename and project ‘inside’ as well — but then they wouldn’t be the same operators as work at relation level.

• AntC Says:

P.P.S. The difficulty with duplicates is really going to bite with things like COUNT(*) or COUNT(DISTINCT JOB).
The approach would need to extend (compose) each tuple to 1, then sum the 1’s. But what could the Sum relation constant look like? It can’t have a set-valued attribute (summands) with the 1’s in, because you’d get duplicate elements in the set.

• There are two issues here: set join operations, and aggregation with grouping. For set joins there is little mystery: they are algebraic analogs of quantifiers in calculus. Basic set joins — subset-of, subset-equality, nonempty subset intersection — are well known to be expressible via primitive relational algebra (or lattice) operations. Only when we broaden our scope to algebraic analogs of generalized quantifiers (e.g. “more than half”), the expressiveness fails.

As for aggregation, the matter becomes more challenging. For one thing, I’m not entirely convinced either with my own method advocated in this article, or with alternative formalizations found in the literature.

You mention partitioning, and this might be the key: this week I’m under deep impression of Ellerman’s Logic of Partitions:
http://www.ellerman.org/Davids-Stuff/Maths/Logic-of-Partitions-Reprint.pdf

• AntC Says:

Thanks Vadim, I agree that set join operations are separate to aggregation/grouping.

The treatment on this page uses set joins to achieve aggregation. And it actually produces the wrong answer in the Emp DEPTNO example. (What that
example achieves is SUM(DISTINCT SAL) — is that valid SQL? Are there any applications in which it’s useful?)

The question re set joins: are they to be primitive within the algebra? Or to be defined in terms of more basic operations? And if so, what operations?

Yes, set joins are usually explained in terms of quantification over tuples, or relational algebra operations over the operands. So those are staying
within the ontology of sets of tuples. That’s why I don’t like explaining in terms of multivalued attributes: it’s stepping outside the relational model
(neither is it lattice-y ;-).

I’m saying: it’s better to explain set join operations in terms of partitions of the operands by their exclusive attributes; then joining/comparing the sets-within-partitions.

So with partitions already introduced into the semantics, it’s easy to extend the explanation to aggregation/grouping. (I’m saying: let’s minimise the
concepts needed; apply Occam’s razor to the ontology.)

And as an added bonus: the partition approach to grouping avoids those problems with duplicates!

2. Anthony: To clarify that all set-join operations discussed in this article are expressible in terms of primitive lattice operations here are the identities:
```x /< y = ((y' ^ x) v (R00 ^ (y v x))`)'. x /^ y = (R00 ^ (y v x))` v (y ^ x). x /= y = (x /< y) ^ (y /< x).```

• AntC Says:

Hmm, I’m not sure what you’re trying to persuade me to.

Let’s get compose (/^) out of the way first:
You define it as good old-fashioned Natural Join, with project out of the attributes in common. (So this is the same as D&D’s = with .) So (/^) is not a set join. So we don’t need set join for Rename (nor for Extend).

Next Aggregation with Grouping:
Set joins just plain give the wrong answer, in general.

Next set equality join (/=):
That’s a cute definition, but the double-negations (absolute complements in the definition for /<) lead into the "subtle issue" of empty sets. So (/=) is not the same as Codd's relational division. So I think it's no good for the examples on your page "… so uncommon", or "Nested Sets".
(To be precise, in those examples you don't show the full domain of the attributes. But (/=) will give a wrong answer assuming that in general there are values in the domain that don't appear in the relations. Like the values name=joe and job=entry in the examples on this page.)

I almost wonder if you've introduced the empty sets explanation just so you can give cute definitions. That's putting the cart before the horse.

Relational division can be defined in terms of more primitive operations, but the definition is not so cute as for (/=). OTOH it can use relative complement of body and attributes (heading), so no domain safety issues.

• Why do you reject `/^` as set intersection join (set join with condition that subset intersection is empty)? One way to convince yourself is to notice that it is algebraic counterpart of existential quantifier.

The reason why I prefer unsafe set join definition is that special treatment of empty set would obscure algebra. One can easily recover Codd’s division/safe set join by taking unsafe join and restricting the resulting relation (via joining it with the original arguments).