Parsers Made Easy

April 17, 2018

This is the second in the series of articles explaining parser technology available as a part of open-source dbtools-commons library. In previous post we have described ready-made parser for the combined oracle SQL and PL/SQL grammar. Today, we’ll extend the scope and illustrate how to make a parser for any language.

To reiterate, our parser engine leverages Earley parsing algorithm, which admits any context-free grammar. Therefore, if you want a parser for your language, you just feed the grammar definition into the Earley parser engine. This is straightforward in theory, but let see how many steps does it take to make a java parser.

Our intent is to amend the parsing functionality implemented in Earley class:

public class JavaEarley extends Earley {

The Earley constructor is provided with a set of grammar rules, so we keep constructor signature unchanged:

    public JavaEarley( Set originalRules ) {
        super(originalRules);
        initKeywords();
    }

Earley parsing algorithm works even with inherent ambiguity between a keyword and an identifier. Yet, Java language specification contains a modest list of keywords, which can never be mistaken for identifiers. Although, again, formally not required, resolving ambiguity between keyword and identifier results in faster parse times:

    static public String[] keywords = {
        "'abstract'",
        "'assert'",
        "'boolean'",
        ...
        "'volatile'",
        "'while'",   
    };
    private Set _keywords = new TreeSet(); // pure Keywords
    void initKeywords() {
        for( String k : keywords )
            _keywords.add(getSymbol(k));
    }

Because of the parameter, the constructor itself is not very useful. We’ll supply a factory method to construct JavaEarley singleton:

    private static JavaEarley instance = null;
    public static JavaEarley getInstance() throws Exception {
        if( instance != null )
            return instance;
        
        Set rules = new TreeSet();
        
        String input = Service.readFile(JavaEarley.class, "java8.grammar");
        List src = LexerToken.parse(input, false, LexerToken.QuotedStrings); 
        ParseNode root = Grammar.parseGrammarFile(src, input);
        Grammar.grammar(root, src, rules);
        RuleTransforms.eliminateEmptyProductions(rules);

        instance = new JavaEarley(rules);
        instance.isCaseSensitive = true;
        return instance;
    }

In this java method we read the java8.grammar file, parse it, and translate into a set of rules. The Grammar is yet another Earley-based specialized parser dedicated to parsing EBNF files.

To complete the parser we have to override the two methods from the base class. This may seem cumbersome to the reader, but we need to somehow glue the Earley scan method with the output of lexical analysis. In other words, we need to inform JavaEarley what identifiers, string and numeric literals are.

    @Override
    protected boolean isIdentifier( int y, List src, int symbol, Integer suspect ) {
        if( symbol != identifier )
            return false;
        if( suspect != null && _keywords.contains(suspect) )
            return false;
        
        LexerToken token = src.get(y);
        if( token.type != Token.IDENTIFIER )
            return false;
        if( suspect == null )
            return true;
            
        return true;
    }
    
    @Override
    protected boolean isScannedSymbol( int y, List src, int pos, Tuple t, Integer suspect ) {
        int symbol = t.content(pos);
        LexerToken token = src.get(y);
        if( symbol == digits && token.type == Token.DIGITS )
            return true;
        if( token.type == Token.QUOTED_STRING && symbol == symbolIndexes.get("character_literal") )
            return true;
        if( symbol == string_literal && token.type == Token.DQUOTED_STRING )
            return true;
        return suspect != null && suspect == symbol 
            || isIdentifier(y,src, symbol, suspect) && (suspect==null||notConfusedAsId(suspect,t.head,pos))
        ;
    }

That was all! Now we invoke the parser by the familiar routine:

        String input =  Service.readFile(".../javaparser.testinput");
        LexerToken.switchLineCommentSymbol("/");
        List src = LexerToken.parse(input,"&");       	
        JavaEarley earley = JavaEarley.getInstance();
        Matrix matrix = new Matrix(earley);
        earley.parse(src, matrix); 
        ParseNode root = earley.forest(src, matrix);
        root.printTree();

It outputs a parse tree to the console like this:

[0,83)   compilationUnit
  [0,7)   "compilationUnit[2,4)"  packageDeclaration
    [0,1)   'package'
    [1,2)   identifier
    [2,6)   "packageDeclaration[6,11)"
    [6,7)   ';'
  [7,83)   classDeclaration  normalClassDeclaration  typeDeclaration
    [7,8)   "normalClassDeclaration[2,4)"  'public'  classModifier
    [8,9)   'class'
    [9,10)   identifier
    [10,83)   classBody
      [10,11)   '{'
      [11,82)   "classBody[3,5)"
        [11,62)   classBodyDeclaration  classMemberDeclaration methodDeclaration
          [11,13)   "methodDeclaration[2,4)"
            [11,12)   "methodDeclaration[2,4)"  'public'  methodModifier
            [12,13)   'static'  methodModifier
          [13,19)   methodHeader
            [13,14)   'boolean'  result  unannPrimitiveType  unannType
            [14,19)   methodDeclarator
              [14,15)   identifier
              [15,16)   '('
...

Here is the source code for JavaEarley.java and java8.grammar.

Advertisements

Earley Parser Engine

April 13, 2018

Finally, parsing technologies that I have been involved with during the last decade become opensource. It includes many features, but in this blog I’ll focus on Earley Parser Engine and Arbori Semantic Analysis. We’ll cover applications, such as Oracle SQL&PL/SQL parser, formatter, SQL code injection analysis, and, hopefully, many more.

We start with the basics. Given an input SQL code, how do I get the parse tree? SqlEarley.main() contains the sample code, which, essentially, reduces to the following.

First, we read an input file:

    final String input = 
        //"select 1 from dual;"
        Service.readFile(SqlEarley.class, "test.sql")
        ;

Next, we perform Lexical analysis:

    List src =  LexerToken.parse(input);
    //LexerToken.print(src);

A list of tokens is fed to the parser, but we have to instantiate the parser object, first:

    SqlEarley earley = SqlEarley.getInstance();

which is followed by initialization of Earley matrix

    Matrix matrix = new Matrix(earley);

Earley parsing is actually a recognition (which combines scan, predict, and complete into a single method)

    earley.parse(src, matrix); 

Finally, we can build and output parsing tree

    ParseNode root = earley.forest(src, matrix);
    root.printTree();

The print statement should output parse tree to java console like this:

[0,7)   library_unit  sql_statement  sql_statements ...
  [0,6)   query_block  select  ...  sql_stmt  subquery
    [0,4)   select_clause
      [0,1)   'SELECT'
      [1,4)   compound_expression  expr select_list  select_term
        [1,2)   digits  expr literal  numeric_literal  simple_expression
        [2,3)   '+'  
        [3,4)   digits  expr  literal  numeric_literal  simple_expression
    [4,6)   from_clause
      [4,5)   'FROM'
      [5,6)   identifier  query_table_expression  table_reference  ...
  [6,7)   ';'

If you feel that more detailed explanation is warranted, then please bear with me little longer.

Introduction

Parsing or, more formally, syntactic analysis is ubiquitous document processing stage. It takes seemingly disorganized string of characters and produces distinctly structured output – a tree. Ordinary documents feature multiple levels of hierarchy: characters, words, sentences, paragraphs, chapters, volumes…

From parsing perspective only the two bottom levels are significant. Document processing at the character level is called lexical analysis. Lexical analyzer, which is often called scanner, outputs a stream of words better termed as tokens. On the level above lexical analysis the stream of tokens is digested by a parser which organizes them into a parse tree. These two document processing phases differ greatly in their complexity, yet here our concern is usage, which in both cases reduces to easy java API call.

Scanner

Most programming languages share basically the same lexical structure. The conventional wisdom is that inventing specialized exotic lexical form for a new language has dubious value. We will, therefore, adopt “consolidated” C, C++, Java, SQL, PL/SQL lexical structure, which consists of white spaces, digits, single- and double-quoted strings, single- and multi-line comments, and operation symbols. These token types are enumerated in Token.java.

Again, the scanner is run with a single call:

List out = LexerToken.parse(input);

For example, given an input string

String input = "select emp from empno";

the scanner outputs the following list of LexerTokens:

0 [0,6) select IDENTIFIER
1 [7,10) emp IDENTIFIER
2 [11,15) from IDENTIFIER
3 [16,21) empno IDENTIFIER

Here, the first column contains numbers, which are token position in the list.

The semi-open segment in the second column delineates beginning and end character positions of the token – its location within the input string. This notation is explained in more depth in the following section — Semi-Open Segments.

The third column lists the token content.

Finally, in the last column we find the token type.

Technically, this printout has been generated with java call

LexerToken.print(out);

This is the essence of LexerToken.main(), which reader is advised to experiment with by modifying and running it.

Semi-Open Segments

Partitioning linearly ordered elements into semi-open segments has became fairly standard software practice. Java text processing utilities specify intervals of characters with beginning position included, and the end position excluded. For example,

"select * from emp".substring(7,8);

cuts out wildcard “*” character fragment from the original string. We adopt this idea with somewhat unconventional notation where the beginning of the segment is denoted with square bracket and the end is parenthesis.

Figure 1: Semi-open segments. A sequence of characters at positions 0,1,2,3,4,5,6,7,8,9 is partitioned into segments [0,6), [6,7), [7,8), [8,9), and [9,...).

Beyond lexical analysis, we leverage semi-open segments to label nodes in the parse tree. However, parse tree segment positions are in token units, not character ones. For example, the parse node capturing the wildcard in the above example would be labeled with the interval [1,2). Technically, the wildcard is the third token produced by the scanner; however all whitespace tokens are ignored by parser.

In parser applications it is frequently required to convert from character units into token offsets, and vice versa; hence the LexerToken.scanner2parserOffset utility method.

Parser

Parser consumes list of tokens produced by lexer. It combines tokens according to grammar rules, and arranges them together with non-terminal grammar symbols into a parse tree. The exact method how is it accomplished constitutes rather involved computer science subject. It is the failure of the later to communicate parsing science to a programmer in the trenches, that caused proliferation of crippled technologies such as XML.

Parsing methods which are popular and practical allow only restricted types of grammar. On contrary, methods which allow arbitrary context free grammars are slow. In the first case, language developer has to master arcane hacking tricks allowing him to navigate in the murky water of rule restrictions. He has to either massage the grammar to the form that particular parser generator likes, or pollute the grammar with side effects, which carefully guide the parser engine through rule selection.

No matter how cumbersome legacy parsing technologies are, they enjoy advantage of nearly instantaneous parse time, so that for users spoiled by that level of performance slow parse time of any “superior” parse method would be unacceptable. Here we claim to solve this performance vs. ability dilemma and offer a parsing engine which is both reasonably fast, and generic in terms of allowed grammars.

You can use our parser without being required to master parsing technology. The steps are quite simple. First, you instantiate parser object, and feed grammar into it. For SQL&PL/SQL there is a cooked singleton already; you just get a handle to its instance as:

SqlEarley earley = SqlEarley.getInstance();

The name Earley refers to the parsing method that we use, but you can get away with not knowing any details. The parsing itself happens with the following calls:

List src = ...; // from lexical analysis
Matrix matrix = new Matrix(earley);
earley.parse(src, matrix);
ParseNode root = earley.forest(src, matrix);

Admittedly, there is a certain amount of boilerplate code, because the reader dones’t have to know anything about intermediate Matrix object. The parse output is a tree, which can be printed to the standard output with:

root.printTree();

For example, the input "select emp from empno;" produces:

  
  [0,4)   query_block  select  simple_set_expr  ...
    [0,2)   select_clause
      [0,1)   'SELECT'
      [1,2)   column  expr  identifier  select_list ...
    [2,4)   from_clause
      [2,3)   'FROM'
      [3,4)   cartesian_product  identifier  query_table_expression ...
  [4,5)   ';'

To programmatically explore and navigate parse tree you have to familiarize yourself with the ParseNode object. It contains all the necessary methods, such as

  • “List all node’s children”,
  • “Give me the parent”,
  • “List the node grammar payload (a set of grammar symbols)”,
  • “Does the node payload contains select_clause?”

Navigating parse trees is intuitive, but in future articles we’ll describe far more advanced method — Arbori Semantic Analysis.

Further reading

To learn more about Earley Parser Engine, visualization and performance optimizations of the algorithm, here is the paper.

Formula for Formatting

September 28, 2017

SQL Developer release 17.3 is out with formatter amendments warranting an explanation. The fundamentals didn’t change: the formatting mechanism is still based on parse tree analysis, which is implemented via Arbori — parse tree query language. New for 17.3 is more intuitive rule structure and better coordination of rules with formatting options. Still, rather than describing 17.3 implementation increments, I’d try presenting a self-contained description of Arbori formatting engine.

We start with indentation, arguably, the most important formatting feature. Consider the following exampleindents.png

The first indent (at level = 0) is applied to the list of attributes in the select clause and the list of tables in the from clause in the main query block. The second offset (at level = 1) is applied to the inner view in parenthesis. The third offset is applied to the select term and the table in the inner view.

How is this specified formally?  We have to examine parse tree:

parseTreeIndents.png

The parse tree nodes, which we want to indent, have grammar symbols (such as the select_list) encircled in green. Unfortunately, there is another, nested, select_list occurrence, marked in red, which labels the parse tree node that we don’t want to indent. We just have to refine our condition that we are after the parse tree nodes labeled with select_list whose parent payload is select_clause!

Formally, all indentation formatting conditions are gathered in the rule named simpleIndentConditions:

simpleIndentConditions.pngThe operations “|” and “&” are logical disjunction and conjunction, correspondingly, so that the entire simpleIndentCondition rule is a monumental DNF. The conjunctions are listed alphabetically; therefore, we find the condition of interest for the nodes labeled with the select_clause buried somewhere in the middle:

selectListConditions.png

The condition consists of the 3 parts:

  • Nodes, labeled with the select_list
  • Parent node (formally “node^“) labeled with select_clause
  • Boolean bind variable :breaksAfterSelectFromWhere

The value of this bind variable is carried over from the Advanced Format preference page:

advancedFormat.png

Bind variables provide a vehicle for this “wall of options” to be seamlessly integrated into the formatter. Flipping the Line Breaks->SELECT/FROM/WHERE option off would have set :breaksAfterSelectFromWhere to false, while flipping it back again would have set :breaksAfterSelectFromWhere = true.  Consequently, with the condition :breaksAfterSelectFromWhere = true, the conjunction would reduce to just

|  [node) select_list & [node^) select_clause

while :breaksAfterSelectFromWhere = false would have effectively eliminating this conjunction from the simpleIndentCondition rule.

Let’s do this experiment, flipping the Line Breaks->SELECT/FROM/WHERE option off. People who requested this option are usually not fond of formatting style with excessive breaks. While at it, let’s switch off breaks on subqueries as well. The result

alignmentsExample.png

warrants an explanation. The first snag is the failure to align the select terms “2” and “3” to the predecessor. Somehow a space has been lost… The other alignments look OK. However, the alignments bring us beyond simple indentation model that I have described earlier. Once again, :breaksAfterSelectFromWhere is set to false, so it have effectively nullified the  condition to indent the select term in the simpleIndentCondition rule. How did it format the select terms (and other items)?

There we have two new rules for line breaks and alignments. The rule for line breaks consists of the two parts, covering conditions when to insert line break before a parse node:

extraBrkBefore.png

and after. Here we see that if a user have chosen an option to have line breaks before the comma, then the line break will be inserted before a comma, which is followed by a select_term.  This is not the condition that inserted line breaks before the select terms 2, and 3 in our example. The following one did:

extraBrkAfter.png

Being able to control how to inserting line breaks solves half of the problem. We also need to specify alignments:

pairwiseAlignments.png

The alignment rule is little more sophisticated than what we have considered before. It involves two variables: which node we want to align, and the target parse tree node, named “predecessor” which we want to align it to. There are several ways the “predecessor” node is related to the aligned node, hence the alignment rule has been split in several parts. The first rule, called pairwiseAlignments1, defines the predecessor node as the younger sibling of the aligned node, formally: predecessor=node-1. This is exactly how from_clause is aligned to the select_clause. The second rule, called pairwiseAlignments2, defines the predecessor node as the younger sibling of the younger sibling of the aligned node, formally: predecessor=node-1-1. This is how individual select terms are aligned. Here is the parse tree to convince the reader how in our example the select term “2” is aligned with the select list “1”:

pairwiseAlignments3.png

Likewise, the select term “3” is aligned with the select list “1,2”:

pairwiseAlignments2.png

 

 

Arbori; the missing manuals

February 11, 2017

Number of lines of code is well known to be poor metric for virtually any purpose. One thing is indisputable though: the total volume of code is growing. Today, developers are frequently challenged with mining huge codebases. Therefore, querying parse trees is not just an academic exercise, but something of ever increasing practical value.

Blog post, though, is hardly a proper venue to introduce a new language. Here are two references to supplement it:

  1. Arbori Starter Manual by Eugene Perkov
  2. Little more detailed Semantic Analysis with Arbori

Formatting is one of the software development features that always steers controversy. To put it bluntly, everybody wants to format the code their own way. In previous SQL Dev versions the answer to that challenge was having elaborate tree-table widget spiced with several dozens of various options.

Arguably, the combined SQL and PL/SQL grammar is the most cumbersome programming language in existence, with total number of grammar productions exceeding 20000. From that perspective, the list of “Advanced Format” options offered by SQL Dev feels inadequate. To address this mismatch, the latest release introduced “Custom Format”. It offers to a developer means to specify formal conditions which SQL and PL/SQL grammatical constructs to indent,  align, pad, add extra line breaks and so on.

As usual, the best way to demonstrate a new feature is an example. Consider the following enhancement request by Stefan Poschenrieder to remove the indent of the NOT NULL ENABLE inline constraint in the following code

CREATE TABLE dim_aspect (
    dim_aspectid       NUMBER(11,0)
        NOT NULL ENABLE,
    source             VARCHAR2(50),
    hier1_desc         VARCHAR2(50),
    hierarchy1         VARCHAR2(50),
    dwh_inserteddate   DATE
);

The first step is identifying the target grammar symbol in the Code Outline panel. Here is screen snapshot illustrating it:
code outline.png
As illustrated there, positioning a cursor over the NOT keyword highlights the corresponding node in the parse tree. Therefore, the grammar symbol of interest is probably “inline_constraint“. Clicking onto the node labeled inline_constraint to double check this guess witnesses it highlighting “NOT NULL ENABLE” in the code editor. (Please be aware that clicking onto a hyperlinked looking grammar symbol has side effect of opening documentation page featuring that symbol railroad diagram definition).

As soon as the grammar symbol of interest is known, lets find it at Custom Format preferences panel. Admittedly, glaring omission there is the search widget. Yet, grammar symbols are arranged in lexicographic order, so that a reader would have no trouble spotting inline_constraint in the following formatting rule:

...
| [node) index_subpartition_clause[69,119)# 
| [node) inline_constraint & ![node) inline_constraint[14,67)
| [node) inline_ref_constraint
...

This fragment of rather convoluted disjunctive condition instructs the formatter to indent the nodes with payload “index_subpartition_clause[69,119)#“, or the nodes labeled with “inline_constraint” (but not those that also labeled as “inline_constraint[14,67)“), or the nodes labeled “inline_ref_constraint“.  Therefore, commenting out the middle conjunction

...
| [node) index_subpartition_clause[69,119)# 
--| [node) inline_constraint & ![node) inline_constraint[14,67)
| [node) inline_ref_constraint
...

is likely a solution to the problem. After making this change, a reader is advised to test what effect do modified formatting rules have on a SQL or PL/SQL code sample in the preview editor. This action is performed with the bottom-left “Run” button. The two neighbors “Import…” and “Export…” are for saving and retrieving your work. By default the modified “format.arbori” file is kept in the SQLDev product preferences directory.

Let’s reinforce this custom formatting idea with two more examples, both from Steven Feuerstein. In both cases we’ll learn some additional syntax of that formal formatting rules specification. The first example is just a bug witnessed by the following test case:

CREATE OR REPLACE PACKAGE BODY plch_pkg IS BEGIN
   FOR indx IN 1..100000 LOOP
      g_strings(    
         indx       --<--- ???
      ) := 'String ' || indx;
   END LOOP;
END;
/

The rule responsible for that unwanted indentation of the “indx” parameter is

| [node) pls_expr & [node-1) '(' & ![node) numeric_literal

It requires the parse node to be labeled with “pls_expr” but not “numeric_literal“. Also, it stipulates that the prior sibling node (formally, “node-1“) is labeled with open parenthesis. Adding the condition for the node successor not to be the closing parenthesis

| [node) pls_expr & [node-1) '(' & ![node) numeric_literal & ![node+1) ')' 

is all it takes to fix this snag.

In the second example

CREATE PACKAGE fmt AS
    PROCEDURE create_checklist (
        user_id_in       IN INTEGER,
        question_id_in   IN INTEGER
    );
    --<-- wanted extra line break 
    PROCEDURE remove_checklist (
        checklist_id_in   IN INTEGER
    );

    PROCEDURE remove_checklist (
        user_id_in       IN INTEGER,
        question_id_in   IN INTEGER
    );

END fmt;

the formatted code have only single line breaks after each package member. The formatting rule responsible for this functionality is “extraLines“. It formally a disjunction of the two rules “sql_stmts” and “significant_statements“. In the later rule we see the following conjunctive condition:

| [node) basic_decl_item & [node+20 < node)

The syntax of first part is familiar: it requires the node to be labeled as “basic_decl_item“.

The second part is little more cumbersome. The “[ node” and “node )” refer to the node beginning and ending position, correspondingly. Each node in the parse tree recognizes a sequence of tokens. In our example nodes labeled as “basic_decl_item” recognizes a sequence of tokens beginning with the PROCEDURE keyword and ending with semicolon. The beginning position is the offset of the PROCEDURE keyword, while the ending position is the offset of the semicolon. The condition  “[node+20 < node)” requires more than 20 tokens between the node beginning and ending position, or informally the package member to be of “sufficient length”.

Please note that package data members are recognized as “basic_decl_item” as well. Therefore, the proper fix depends on if a user wants to inset double line breaks after data members or not. The length condition was a naive attempt on distinguishing data and procedures, but more elaborate condition, distinguishing data and procedures is possible too.

Let’s walk trough yet another example demonstrating how flexible it is. Consider the following example adapted from ToadWorld ER:

Can we vertically align THEN values of CASE statement ? Current Format:

SELECT CASE
         WHEN a.col5 = b.col5 THEN 'COL5'
         WHEN a.col55555555555555 = b.col55 THEN 'COL55555'
         WHEN a.col555555 = b.col55555 THEN 'COL555555'
       END AS VAL
FROM   table1;

Desired Format:

SELECT CASE
         WHEN a.col5 = b.col5                THEN 'COL5'
         WHEN a.col55555555555555 = b.col55  THEN 'COL55555'
         WHEN a.col555555   = b.col5555      THEN 'COL555555'
       END AS VAL
FROM   table1;

For the reference, the default formatting in SQLDeveloper looks less than desirable:

SELECT
        CASE WHEN
            a.col5 = b.col5
        THEN
            'COL5'
        WHEN
            a.col55555555555555 = b.col55
        THEN
            'COL55555'
        WHEN
            a.col555555 = b.col55555
        THEN
            'COL555555'
        END
    AS val
FROM
    table1

However, to satisfy this request it requires exactly 4 changes of the formatting program.

First, before taking on alignments, we should fix unwanted indentation. How about not indenting conditions a.col5 = b.col5 and some such? By examining the parse tree, we notice that the code fragment a.col5 = b.col5 is formally a condition. Therefore, it must be that the line

| [node) condition  &  [node-1) 'WHEN'

of the formatting program that is responsible for the effect. Comment (or delete it), and rerun the formatter to confirm this hypothesis:

SELECT
        CASE WHEN a.col5 = b.col5 THEN
            'COL5'
        WHEN a.col55555555555555 = b.col55 THEN
            'COL55555'
        WHEN a.col555555 = b.col55555 THEN
            'COL555555'
        END
    AS val
FROM
    table1;

What about not indenting expressions, like string literal ‘COL5’? Just comment out the following rule:

--| [node) expr  &  [node-1) 'THEN'

The effect of this change, however, is somewhat discouraging:

SELECT
        CASE WHEN a.col5 = b.col5 THEN 'COL5' WHEN a.col55555555555555 = b.col5 THEN 'COL55555' WHEN a.col555555 = b.col5 THEN 'COL555555' WHEN a.col55555555
 = b.col5 THEN 'COL5555555' END
    AS val
FROM
    table1;

Here we need to look onto parse tree yet again to identify what grammar symbol would indent the entire WHEN-THEN fragments. The symbol in question is the  searched_case_expression#, so that the missing indentation rule alternative is just

| [node) searched_case_expression#

With this amendments the sample code formats to:

SELECT
        CASE
            WHEN a.col5 = b.col5 THEN 'COL5'
            WHEN a.col55555555555555 = b.col55 THEN 'COL55555'
            WHEN a.col555555 = b.col5555 THEN 'COL555555'
        END
    AS val
FROM
    table1;

Please note that even though there is a set options for CASE expressions (which, again, is a concession to traditional format design), flipping those options won’t produce format as clean as we just did.

So far we made 3 custom formatting program changes . Now we are ready to take on alignments. Let’s first investigate how current  formatting rules align the equalities,  because reverse engineering similar functionality is the easiest way to master new skills. Here is the code:

/**
 * All alignments (paddings) ...
**/
paddedIdsInScope: (
-- types
  [id) identifier & [id+1) datatype & [scope) relational_properties
| [id) decl_id & ([id+1) prm_spec_unconstrained_type | [id+1) 'IN' | [id+1) 'OUT' ) & [scope) fml_part
| [id) decl_id & ([id+1) constrained_type | [id+1) object_d_rhs) & [scope) adt_definition
| [id) decl_id & ([id+1) constrained_type | [id+1) object_d_rhs) & [scope) decl_list
-- =>
| [id) sim_expr & [id+1) '=' & [id+1+1) '>' & [scope) paren_expr_list
-- :=
| [id) name & [id+1) ':' & [id+1+1) '=' & [scope) seq_of_stmts
-- =
| [id) column & [id+1) '=' & [id+1+1) expr & [scope) where_clause
| [id) column & [id+1) '=' & [id+1+1) expr & [scope) on_using_condition
) & scope < id ->

The first highlighted condition specify that we are interested in columns, followed by equality, followed by some expression. They would be chosen together with the containing where_clause. Therefore, if we choose columns within the CASE scope

| [id) column & [id+1) '=' & [id+1+1) expr & [scope) case_expression

then it would align the equalities (but not THEN expressions):

SELECT
        CASE
            WHEN a.col5                = b.col5 THEN 'COL5'
            WHEN a.col55555555555555   = b.col55 THEN 'COL55555'
            WHEN a.col555555           = b.col55555 THEN 'COL555555'
        END
    AS val
FROM
    table1;

The condition we are after is a slight variation of thereof:

| [id) condition & [id+1) 'THEN' & [id+1+1) expr & [scope) case_expression

which produces:

SELECT
        CASE
            WHEN a.col5 = b.col5               THEN 'COL5'
            WHEN a.col55555555555555 = b.col55 THEN 'COL55555'
            WHEN a.col555555 = b.col55555      THEN 'COL555555'
        END
    AS val
FROM
    table1;

Querying Parse Trees

October 12, 2016

In previous blog entry we have glimpsed into new SQL Developer 4.2 feature — parse trees. Again, this is something that development tools don’t readily expose with the most likely reason being the lack of foresight what users can do with this information. Blunt answer proposed here is: “query it”.

In general, querying  is a formal and meaningful way to ask questions about something. In context of parsing the structure of the objects which we ask those questions about is well defined. In other words, the input for a query is a parse tree. Then, what is the structure of query answer?

There are at least two possibilities. The most natural is perhaps to have the output of the query to have the same structure as the input, that is the tree. However, I yet to see a compelling query language over trees, so perhaps choosing something more familiar to database developer – a relation (table) – would be a better choice?

A [query] language which takes an input in one form (a tree), but produces the results in another form (a relation/table) might seem counter intuitive, yet we’ll demonstrate that it is practical. The first objection might be that such query language, which input is a structure of one kind, but the output is something else is “not closed”. This is the matter of wrong appearance. Trees are binary relations; therefore, it is just that our proposed query language makes no effort to keep the query structures confined to this narrow kind of relations.

As we are going to talk in depth about this new query language, it would become awkward without naming it. So let’s give it a name — Arbori. (The name credit goes to Brian Jeffries from SQL Dev team). With this prelude, let’s give a taste of Arbori query.

Code quality metrics, or code guidelines can be viewed as queries or constraints over a code base. For example, Trivadis coding guideline #33 reads: “Always close locally opened cursors“. Reformulating it as a query, it would sound like this: “Find all cursors which are not closed in the enclosing procedure“.

When querying parse trees it is always insightful to focus on a concrete code sample and figure out what parse tree nodes we are after.

CREATE PACKAGE BODY "Trivadis code guideline #33" AS
--bad
    PROCEDURE not_close_cursor (
        out_count   OUT INTEGER
    ) AS
        CURSOR c1 IS
            SELECT COUNT(*) FROM all_users;

    BEGIN
        out_count := 0;
        OPEN c1;
        FETCH c1 INTO out_count;
    END not_close_cursor;

-- Good

    PROCEDURE close_cursor (
        out_count   OUT INTEGER
    ) AS
        CURSOR c1 IS
        CURSOR c1 IS
            SELECT COUNT(*) FROM all_users;

    BEGIN
        out_count := 0;
        OPEN c1;
        FETCH c1 INTO out_count;
        CLOSE c1;
    END close_cursor;

END;

The parse tree for this code is

QueryingParseTrees1.png

The cursor declaration is a node with payload “cursor_d” (1). It has two children, the node labeled with keyword “CURSOR”, followed by the identifier “C1”.

Next, we identify the parse node where the cursor is closed (2). It is a node labeled as “close_statement” and we have to make sure that its second child matches the name of the cursor that we have identified on previous step.

Finally, the cursor declaration and closing nodes that we just have identified have to be in the same scope, that is there has to be an ancestor node corresponding to the procedure that encloses both (3).

This is all the thought that is needed. The further development is just the formal translation of those 3 insights into formal Arbori query.

However,  a proposition of mastering new query language was not something easy to sell to SQL Dev feature decision makers. This is why Arbori query panel is hidden and could only be enabled by typing “arbori” in the search box. (This is true in version 17.2; it was more cumbersome in previous release). This action brings in Arbori dockable query panel. For our purposes it is convenient to organize all three panels side-by-side:

QueryingParseTrees3.png

If a reader feels intimidated by a perspective of learning yet another query language, then this perception is without merit. In fact, anybody proficient in SQL should have no problem learning it. In SQL terms, each Arbori query accesses a single input object – the parse tree, which allows to greatly simplify query syntax: there is no need for the FROM clause. Also the attributes restriction is implicit from the predicates (or is achieved by other means), so that there is no SELECT clause. The entire query is reduced to a single WHERE clause.

Unlike SQL, each query is named. Here we started writing a query appropriately named “leaking_cursors”. There is no substance in query itself yet (it is invalid Arbori syntax).

The major Arbori query components are unary predicates. Basically, we want to say that the node payload is such and such. Before we expressed our interest finding a nodes labeled “cursor_d”. This is how it is formally written in Arbori syntax:

QueryingParseTrees4.png

We need to introduce an attribute name — “cursor_decl” (1) — and specify that a parse node named “cursor_decl” is required to have a payload “cursor_d” (2). If this syntax feels alien to the reader, written in SQL this would look something like this:

select cursor_decl
from parse_tree
where payload(cursor_decl).includes('cursor_d')

Admittedly, coupling square and round parenthesis in Arbori might be too much for some readers. Please don’t hesitate to propose a better syntax!

Executing this simple query produces the list of cursor declarations (3). Changing list selection updates node references on the parse tree and, as I described in the previous article, tree node selection is consistent with selection in the original PL/SQL or SQL code editor. The numbers in the odd brackets/parenthesis [13,15) mean semi-open interval/segment of integers: at position 13 of the input PL/SQL sample code there is a token “CURSOR”, while at the next position 14 there is a token “C1”.

This is some progress towards desired query, so let’s find all the nodes labeled “subprg_body”. We know how to do that already, so we just add one more condition. More importantly, we need a condition which requires procedure node to be an ancestor of the cursor declaration node. This is formally specified as a binary ordering relation enclosed_proc < cursor_decl:

QueryingParseTrees5.png

Informally, ancestor date of birth precedes that of descendant. A reader familiar with concept of nested sets  would find this notation quite intuitive, for example the node [49,51) “cursor_decl” is nested inside the [41,80) “enclosed_proc” node.

The next step, is finding the nodes corresponding to cursor closures. This is easy:

[cursor_close_stmt) close_stmt 

We can also add a requirement for “cursor_close_stmt” to be nested inside “enclosed_proc”. However, we also need to express new requirement:

  • The nodes “cursor_decl” and “cursor_close_stmt” refer to the same cursor name

This warrants additional Arbori syntax, and slight tactical maneuver. First, let’s forget about “cursor_decl” and “cursor_close_stmt” and shift our attention onto their children. Specifically, we are interested in their children which convey the cursor name, that is the ones labeled “identifier” (among other things). Let’s call these modified attributes “cursor_decl_id” and “cursor_close_stmt_id”. Since we have conditions onto their parents, it is handy to just have a convenient way to refer to them. That Arbori syntax is a caret after an attribute name. For example

[cursor_close_stmt_id^) close_stmt 

postulates that a parent node of “cursor_close_stmt_id” is labeled “close_stmt”. There is one more additional syntax: specifically we want “cursor_decl_id” and “cursor_close_stmt_id” to refer to the same identifier:

?cursor_close_stmt_id = ?cursor_decl_id

Now we have all the ingredients to finalize our query. Realizing, however, that we found not quite yet what we were after, lets rename it into “well_behaved_cursors”:

QueryingParseTrees6.png

We are just one step away from finishing the task. One more query is wanted, named appropriately as “all_cursors”, which is the variant of “well_behaved_cursors” that ignores cursor closing statements. Then, we need to subtract “well_behaved_cursors” from “all_cursors”.

Apart from exotic syntax, little of what I have just described would raise brows of a reader with database background. Query gradually evolving for performing more and more complex tasks is something database developers and DBAs routinely do. Querying the code is quite easy!

Reshaping Parse Trees

September 22, 2016

Introduction

Parsing and internal parse structures — trees — are not something that development tools readily expose. It is assumed that most users are comfortable with numerous features that depend on parsing, such as formatting, code block collapsing/expanding, code advisors and so on, but aren’t really interested learning the formal code structure. In SQL Developer 4.2 we have attempted to encourage users to experiment with parse trees.

The location, where the new feature is introduced, is somewhat obscure — it is the former 4.2 “Outline” panel, which is was accessible via “Quick Outline” context menu item. Minor name change reflects the amended functionality

reshapingparsetrees1

Perhaps some explanation of what you see at the left panel is warranted. Oracle PL/SQL language is ADA derivative, which formal grammar can be found here. In the above example, the top most node labeled subprg_spec has been recognized as matching the first production for the rule

subprg_spec	: PROCEDURE_ IDENTIFIER .fml_part.
		| FUNCTION_  designator .fml_part. RETURN_ ty_mk
		;

Consequently, it has 3 children labeled PROCEDURE, TEST (which is identifier) and , fml_part.

For some obscure reason, there is no trace of this genuine formal grammar in oracle PL/SQL documentation, and “railroad” syntax diagrams are different. The good news, however, is that oracle documentation SQL grammar matches almost verbatim to the parse trees that SQL Developer exposes.

Playing with parse trees

Selecting a node on the tree highlights the recognized portion of the code in the editorreshapingparsetrees2

Some nodes, such as those labeled with fml_part in the example above, are collapsed. This is because parse trees for even moderately long code become overwhelming. Expanding tree branch manually is quite laborious; this is why there are 2 more effective ways to accomplish it.

The first alternative is to click “expand branch” context menu item
reshapingparsetrees3

Even though the root node is not collapsed, this context menu item is still there, and this action will expand all the collapsed descendant branches.

Selectively collapsing/expanding tree branches

The second alternative is little more sophisticated. Suppose, in our working example you would like to have all the nodes with payload fml_part expanded. Then, you click onto the second button on the toolbar (it is marked with plus symbol and “Collapse nodes with payload” tooltip).

reshapingparsetrees4

If you check the fml_part menu item, then all the nodes with this payload become expanded.

A sharp eyed reader might have noticed that, by default, select is expanded, while seq_of_stmts is collapsed. This is because code outline is working not only for PL/SQL editor but also SQL Worksheet.  In SQL worksheet queries — select statements — are the main laborers, this is why their structure is expanded. However, if you work mostly in PL/SQL editor, then those select statements are less interesting. For example, a select query which is a part of a cursor declaration can be collapsed, unless a user decides to investigate it. As for seq_of_stmts, again, PL/SQL package bodies are frequently very large, so having collapsed sequence of statements — the bulk of procedure or function — is handy.

Selectively hiding the nodes

Another way to bring a parse tree to manageable size is hiding the nodes. In general, if node disappears, something must be done about this node children. Naturally, orphans become attached to node’s parent. For example, deleting node B, reattaches grand children C and D to node A.

reshapingparsetrees5

Which nodes are not so much interesting? Consider a typical SQL query

reshapingparsetrees6

A chain of tree nodes marked with select_list faithfully reflects SQL grammar, but a user might want to see more concise picture.  The 3rd (“Hide nodes with payload”) button on toolbar serves this purpose

reshapingparsetrees7

Unchecking the select_list menu item (it is unchecked by default) reshapes the tree like this

reshapingparsetrees8

Conclusion

There is only so much that tree reshaping can do. For large trees it is laborious to search for interesting information. In the next installment we’ll describe an approach which is more satisfying for a reader with database background — querying.