Querying Parse Trees
October 12, 2016
In previous blog entry we have glimpsed into new SQL Developer 4.2 feature — parse trees. Again, this is something that development tools don’t readily expose with the most likely reason being the lack of foresight what users can do with this information. Blunt answer proposed here is: “query it”.
In general, querying is a formal and meaningful way to ask questions about something. In context of parsing the structure of the objects which we ask those questions about is well defined. In other words, the input for a query is a parse tree. Then, what is the structure of query answer?
There are at least two possibilities. The most natural is perhaps to have the output of the query to have the same structure as the input, that is the tree. However, I yet to see a compelling query language over trees, so perhaps choosing something more familiar to database developer – a relation (table) – would be a better choice?
A [query] language which takes an input in one form (a tree), but produces the results in another form (a relation/table) might seem counter intuitive, yet we’ll demonstrate that it is practical. The first objection might be that such query language, which input is a structure of one kind, but the output is something else is “not closed”. This is the matter of wrong appearance. Trees are binary relations; therefore, it is just that our proposed query language makes no effort to keep the query structures confined to this narrow kind of relations.
As we are going to talk in depth about this new query language, it would become awkward without naming it. So let’s give it a name — Arbori. (The name credit goes to Brian Jeffries from SQL Dev team). With this prelude, let’s give a taste of Arbori query.
Code quality metrics, or code guidelines can be viewed as queries or constraints over a code base. For example, Trivadis coding guideline #33 reads: “Always close locally opened cursors“. Reformulating it as a query, it would sound like this: “Find all cursors which are not closed in the enclosing procedure“.
When querying parse trees it is always insightful to focus on a concrete code sample and figure out what parse tree nodes we are after.
CREATE PACKAGE BODY "Trivadis code guideline #33" AS --bad PROCEDURE not_close_cursor ( out_count OUT INTEGER ) AS CURSOR c1 IS SELECT COUNT(*) FROM all_users; BEGIN out_count := 0; OPEN c1; FETCH c1 INTO out_count; END not_close_cursor; -- Good PROCEDURE close_cursor ( out_count OUT INTEGER ) AS CURSOR c1 IS CURSOR c1 IS SELECT COUNT(*) FROM all_users; BEGIN out_count := 0; OPEN c1; FETCH c1 INTO out_count; CLOSE c1; END close_cursor; END;
The parse tree for this code is
The cursor declaration is a node with payload “cursor_d” (1). It has two children, the node labeled with keyword “CURSOR”, followed by the identifier “C1”.
Next, we identify the parse node where the cursor is closed (2). It is a node labeled as “close_statement” and we have to make sure that its second child matches the name of the cursor that we have identified on previous step.
Finally, the cursor declaration and closing nodes that we just have identified have to be in the same scope, that is there has to be an ancestor node corresponding to the procedure that encloses both (3).
This is all the thought that is needed. The further development is just the formal translation of those 3 insights into formal Arbori query.
However, a proposition of mastering new query language is something that was hard to convince SQL Dev feature decision makers. This is why Arbori query is hidden and could only be enabled by an auxiliary command:
set hidden param arboriEditor = true;
Executing this command results in appearance of an additional button on the Code Outline panel:
Clicking it invokes Arbori dockable query panel. For our purposes it is convenient to organize all three panels side-by-side:
If a reader feels intimidated by a perspective of learning yet another query language, then this perception is without merit. In fact, anybody proficient in SQL should have no problem learning it. In SQL terms, each Arbori query accesses a single input object – the parse tree, which allows to greatly simplify query syntax: there is no need for the FROM clause. Also the attributes restriction is implicit from the predicates (or is achieved by other means), so that there is no SELECT clause. The entire query is reduced to a single WHERE clause.
Unlike SQL, each query is named. Here we started writing a query appropriately named “leaking_cursors”. There is no substance in query itself yet (it is invalid Arbori syntax).
The major Arbori query components are unary predicates. Basically, we want to say that the node payload is such and such. Before we expressed our interest finding a nodes labeled “cursor_d”. This is how it is formally written in Arbori syntax:
We need to introduce an attribute name — “cursor_decl” (1) — and specify that a parse node named “cursor_decl” is required to have a payload “cursor_d” (2). If this syntax feels alien to the reader, written in SQL this would look something like this:
select cursor_decl from parse_tree where payload(cursor_decl).includes('cursor_d')
Admittedly, coupling square and round parenthesis in Arbori might be too much for some readers. Please don’t hesitate to propose a better syntax!
Executing this simple query produces the list of cursor declarations (3). Changing list selection updates node references on the parse tree and, as I described in the previous article, tree node selection is consistent with selection in the original PL/SQL or SQL code editor. The numbers in the odd brackets/parenthesis [13,15) mean semi-open interval/segment of integers: at position 13 of the input PL/SQL sample code there is a token “CURSOR”, while at the next position 14 there is a token “C1”.
This is some progress towards desired query, so let’s find all the nodes labeled “subprg_body”. We know how to do that already, so we just add one more condition. More importantly, we need a condition which requires procedure node to be an ancestor of the cursor declaration node. This is formally specified as a binary ordering relation enclosed_proc < cursor_decl:
Informally, ancestor date of birth precedes that of descendant. A reader familiar with concept of nested sets would find this notation quite intuitive, for example the node [49,51) “cursor_decl” is nested inside the [41,80) “enclosed_proc” node.
The next step, is finding the nodes corresponding to cursor closures. This is easy:
We can also add a requirement for “cursor_close_stmt” to be nested inside “enclosed_proc”. However, we also need to express new requirement:
- The nodes “cursor_decl” and “cursor_close_stmt” refer to the same cursor name
This warrants additional Arbori syntax, and slight tactical maneuver. First, let’s forget about “cursor_decl” and “cursor_close_stmt” and shift our attention onto their children. Specifically, we are interested in their children which convey the cursor name, that is the ones labeled “identifier” (among other things). Let’s call these modified attributes “cursor_decl_id” and “cursor_close_stmt_id”. Since we have conditions onto their parents, it is handy to just have a convenient way to refer to them. That Arbori syntax is a caret after an attribute name. For example
postulates that a parent node of “cursor_close_stmt_id” is labeled “close_stmt”. There is one more additional syntax: specifically we want “cursor_decl_id” and “cursor_close_stmt_id” to refer to the same identifier:
?cursor_close_stmt_id = ?cursor_decl_id
Now we have all the ingredients to finalize our query. Realizing, however, that we found not quite yet what we were after, lets rename it into “well_behaved_cursors”:
We are just one step away from finishing the task. One more query is wanted, named appropriately as “all_cursors”, which is the variant of “well_behaved_cursors” that ignores cursor closing statements. Then, we need to subtract “well_behaved_cursors” from “all_cursors”.
Apart from exotic syntax, little of what I have just described would raise brows of a reader with database background. Query gradually evolving for performing more and more complex tasks is something database developers and DBAs routinely do. Querying the code is quite easy!