## Caclulating Entropy and Gini Index for a partitioned table

### January 26, 2012

20th century has been highlighted by the two notable insights onto the nature of information. In 1948, Claude Shannon published classic paper “A Mathematical Theory of Communication” suggesting quantitative information measure — entropy — as average number of bits needed to store or communicate one symbol in a message. The paper has tremendous impact: it launched the whole new discipline — the Information Theory. Two decades later Edgar Codd invented the Relational Model, which equipped previously vague concept of “information pieces” with formal structure.

Since then there was at least one direct attempt to marry the two perspectives when Jürg Kohlas proposed Information Algebra. Other, less ambitious connections between the two disciplines were discovered within relational database dependency theory. Squeezed into one sentence this approach introduces quantitative measure of attribute dependency and proceeds expressing known dependencies (functional, multivalued, etc) in their terms. This blog posting pursues even more modest goal: first, describing a well known characterization of functional dependencies in terms of lattice partitions, then calculating partition lattice entropy and Gini index and, finally, proving that all three orders are consistent.

Classes=[Prof    Course  Time]
Libkin  DB101   Tue200
Libkin  DB101   Thu500
Gromov  Math    Tue200
Gromov  Math    Thu500
Vianu   DB101   Tue200
;

A binary partition of relation attributes, say $\{Prof\}|\{Course,Time\}$ partitions the Classes tuples into 3 sets:

QBQL Relational programming system calculates partitions via binary operation; in our running example

Classes#[Prof];

which outputs

Classes#[Prof] = <Libkin,DB101,Tue200> <Libkin,DB101,Thu500> |
<Gromov,Math,Tue200> <Gromov,Math,Thu500> |
<Vianu,DB101,Tue200>;

Partitions can be compared pairwise: informally, partition $x$ is greater than partition $y$ if it is more coarse. Our next step is proving that this ordering is consistent with two numerical partition measures.

Partition entropy is defined via Shannon information measure formula

$\displaystyle\sum\limits_{i=0}^n p_i\ln(p_i)$

where summation ranges over partitions and $p_i$ is probability of selecting a tuple from partition $i$. Let’s calculate entropy of the Classes#[Prof] partition in our example. Again, we automate this task with QBQL, and since later on we would like to calculate entropy of other partitions, we’ll provide generic definition. Let’s emphasize that this theoretically humble idea is one of the major practical benefits of QBQL over SQL and Datalog. First, we define 3 empty relations

CntRelHdr = [cnt];
CardRelHdr = [card];
plpHdr = [plp];

This is just implementation artifact/bug as QBQL should be able to inline these relvars into any query. Then we progresively define partition probabilities <Prob> and entropy <Entropy> as binary operations:

x <Prob> y = ((x |v| (y ^ CntRelHdr)) ^ (x |v| CardRelHdr)) /^ "cnt / card = prob".
x <Entropy> y = ((((x <Prob> y) ^ "ln(prob)=lp") /^ "prob*lp=plp") v plpHdr) /= "result += plp".

The definitions are generic, because you can plug in any two relations in place of x and y. For example, if you are interested in calculating the entropy of Classes#[Prof], you just type:

Classes <Entropy> [Prof];

Let’s walk trough <Prob> implementation. The expression y ^ CntRelHdr is a join of relation y with previously defined empty relation CntRelHdr, and since their attribute sets are disjoint it is a Cartesian product. In other words, we just want a set of attributes from relation y amended by one more attribute cnt. At the next step — x |v| (y ^ CntRelHdr) — we see something that appears as unfamiliar binary infix operator |v| applied to our previous result and relation x. The v symbols is reminiscent of inner union (which is generalization of union and projection) amended with the vertical bars, which in standard math notation usually denotes cardinality. It is SQL group by with counting a counterpart in QBQL notation; the grouping is done over the set of common arguments of the two relations x and y ^ CntRelHdr, while the count function column header is set of attributes of y ^ CntRelHdr which is not in x. In a typical application, such as Classes <Entropy> [Prof], the set of attributes of x is subset of that of y. Therefore, the name of the counting attribute is cnt. At the other branch, of the expression tree we compute x |v| CardRelHdr. Here relation attributes are disjoint, therefore this is just counting with no grouping (which is equivalent to counting with grouping over empty set). The two previous results (x |v| (y ^ CntRelHdr) and x |v| CardRelHdr again have disjoint set of attributes, so joining them is essentially a cartesian product. The final operation — relational composition with the "cnt / card = prob" (which is user-defined ternary predicate Times in disguise) — calculates probabilities for each table partition. Again, let’s not be distracted by the fact that this query is elementary in SQL:

select Prof, count(*)/card prob
from Classes, (select count(*) card from Classes)
group by Prof

First, it is not generic (as it should be rewritten for every other target relation and different subsets of its attributes); second, it uses ad-hock tuple-level expression evaluation (division operator). In our example, after we have defined generic binary entropy operation, then calculating entropy values for all possible splits of relation attributes is as succinct as it can possibly be:

Classes <Entropy> [Prof];
Classes <Entropy> [Course];
Classes <Entropy> [Time];
Classes <Entropy> [Prof Course];
Classes <Entropy> [Course Time];
Classes <Entropy> [Prof Time];
Classes <Entropy> [Prof Course Time];

The entropy values numerical order is consistent with partitioning order and this is not a coincidence. Consider a single partition $\{a_1,a_2,...,a_n\}$ split into the two $\{a_1,a_2,...,a_i\} | \{a_i,...,a_n\}$ and assume that both were the part of bigger partition of the set of $N$ elements. Then, the entropy changes from

$\frac{n}{N} \ln(\frac{n}{N}) +$ entropy of other pieces

to

$\frac{i}{N} \ln(\frac{i}{N}) + \frac{n-i}{N} \ln(\frac{n-i}{N}) +$ the same entropy of the rest of the pieces

Since the first term of the former sum can be rewritten as

$\frac{i}{N} \ln(\frac{i+n-i}{N}) + \frac{n-i}{N} \ln(\frac{n-i+i}{N})$

it is greater than the later. Therefore, we have established that entropy is monotonically increasing along with partition granularity. Shannon Entropy function is not unique in that respect. The similar proposition is valid for Gini Index:

$\displaystyle\sum\limits_{i=0}^n p_i(1-p_i)$

### 2 Responses to “Caclulating Entropy and Gini Index for a partitioned table”

1. william pearson Says: