A recurring theme of this blog is extensibility of SQL Developer, and Integrated Development Environments, in general. The readers of this blog probably already guessed the answer. But what is the problem?

Consider the Code Templates functionality, which steered some confusion on SQL Dev OTN Forum. What are code templates? They are functions that generate code. Such as a function returning a fixed string literal, or the current date function that generates date string, or a function that generates the current schema name. Those functions can be defined either on the server side in PL/SQL, or client tier in javascript. Again, as you might have already guessed that here we’ll promote the Arbori based solution.

There are several facets to the problem:

  • Where do we keep those javascript functions?
  • How do we prompt those, which are relevant in the context?
  • How do we act according to the user’s choice?

We’ll declare the functions in arbori file, like this:

include "std.arbori"

templateDefinitions: runOnce -> {
   function selectFromDual() {
       return 'select * from dual;'
   }
   function insertInto() {
       return 'insert into '
   }
   ...
}

Next, let’s get a handle to the editor window, and provide a function that will do the insertion in the editor panel:

   var panel = struct.get('outlinePanel');
   var editor = panel.getEditor().getFocusedEditorPane()
      
   function insertAtCursor( text ) {
      var doc = editor.getDocument()
      var pos = editor.getCaretPosition()
      doc.insertString(pos, text, null)
   }   

Finally, we’ll build the context menu that prompts those templates:

   var menu = new javax.swing.JPopupMenu()
   
   var item1 = new javax.swing.JMenuItem("<html><b>"+selectFromDual());
   var listener1 = new java.awt.event.ActionListener(
      function () {
         insertAtCursor(selectFromDual())
      }   
   );
   item1.addActionListener(listener1);
   menu.add(item1)
   
   var item2 = new javax.swing.JMenuItem(insertInto());   
   var listener2 = new java.awt.event.ActionListener(
      function () {
         insertAtCursor(insertInto())
      }   
   );
   item1.addActionListener(listener2);
   menu.add(item2)

   editor.setComponentPopupMenu(menu)

Here is the full code:

include "std.arbori"

codeTemplates: runOnce -> {
   function selectFromDual() {
       return 'select * from dual;'
   }
   function insertInto() {
       return 'insert into'
   }
   var panel = struct.get('outlinePanel');
   var editor = panel.getEditor().getFocusedEditorPane()
      
   function insertAtCursor( text ) {
      var doc = editor.getDocument()
      var pos = editor.getCaretPosition()
      doc.insertString(pos, text, null)
   }   
   
   var menu = new javax.swing.JPopupMenu()
   
   var item1 = new javax.swing.JMenuItem("<html><b>"+selectFromDual());
   var listener1 = new java.awt.event.ActionListener(
      function () {
         insertAtCursor(selectFromDual())
      }   
   );
   item1.addActionListener(listener1);
   menu.add(item1)
   
   var item2 = new javax.swing.JMenuItem(insertInto());   
   var listener2 = new java.awt.event.ActionListener(
      function () {
         insertAtCursor(insertInto())
      }   
   );
   item2.addActionListener(listener2);
   menu.add(item2)

   editor.setComponentPopupMenu(menu)
}

and the screenshot:

Certainly, this is very basic implementation, which can be greatly improved. For example, the context menu can be aware of the text surrounding the cursor, and filter out the item list accordingly. However, to keep the article concise, I leave it to imagination of the reader, what other directions this idea can be extended.

Too many tabs

June 26, 2020

This feature is credited to Surya Gupta from Eugene Perkov team. Arbori execution results panel instantiates a tab with a grid for every named predicate, which quickly becomes overwhelming for complex program. In version 20.2 this problem is little less significant, because you can modularize your program leveraging the include directive. The included predicates are not displayed, and if you want them to, you can always define their clones in the main module.

Here is the direct method to get rid of the unwanted tabs:

include "std.arbori"

Prelude: runOnce -> {
    /**
     * Arbori program with numerous auxiliary predicates often results
     * in overwhelming number of tabs at SQLDev Arbori execution panel
     */
    function removeTabsExcept( tabArray ) {
        var execOrder = program.querySequence();
        var copy = execOrder.clone();
        for( tabNo = execOrder.length-1; 0 <= tabNo; tabNo=tabNo-1 ) {
            var exclude = true;
            for( tIndex in tabArray ) {
                if( tabArray[tIndex] == execOrder[tabNo] )
                    exclude = false;
            }
            if( exclude ) {
                execOrder.remove(tabNo);
            }
        }
        var Thread = Java.type('java.lang.Thread');
        new Thread(function () {
             Thread.sleep(1000);
             execOrder.clear();
             execOrder.addAll(copy);
        }).start();
    }
}

UnwantedTab1: runOnce;

OKTab1: runOnce;

UnwantedTab3: runOnce;

OKTab4: runOnce;

Finale: runOnce -> {
    removeTabsExcept(['OKTab1','OKTab4']);
}

This program works since version 19.4. The function removeTabsExcept that does all the work should have been refactored into the standard library, but it is not there in 20.2 (it would be in 20.3).

A more intuitive method to remove tabs would have been obtaining a handle to the results panel, and removing the children widgets. The handle would be there in version 20.3.

Arbori 20.2

June 19, 2020

SQL Developer release 20.2 features several Arbori amendments.

Include directive

Suppose you have defined Arbori predicates and Java Script functions in some arbori file, how do you reuse them? The include directive was there in previous versions, but it failed to work for user-defined Arbori libraries. In version 20.2 it has been remedied with the following ways to include your library:

Absolute path
include "C:\Users\dim\Documents\Arbori\collapse non table nodes.arbori"

finale: runOnce -> {
   print('finished')
}

This snippet executes the collapse non table nodes.arbori program first, and then adds a log record to Java console. The runOnce predicate is defined in the standard library std.arbori, which has been included into the collapse non table nodes.arbori program. What else is in this program is a separate topic, which we’ll focus later.

Relative path

The default SQL Developer directory is where it is started from, which is something like

C:\sqldeveloper-20.2\sqldeveloper\bin

If you create a folder ArboriLib there, then you can reference its content like this:

include "ArboriLib/collapse non table nodes.arbori"

finale: runOnce -> {
   print('finished')
}
Java environment setting

Edit the sqldeveloper.conf to define dbtools.arbori.home Java environment variable like this:

AddVMOption -Ddbtools.arbori.home=/Users/dim/Documents/Arbori

Then, the included Arbori directive will lookup files to the path relative to this directory.

Arbori Editor

Arbori code is now first-class citizen in SQL Developer. You can open .arbori files via the Open File menu item:

The editor has the expected functionality: syntax coloring, undo-redo, Ctrl-F, edit history tab, etc.

UI Widget Handles

Extending SQL Developer functionality via Nashorn Java Script actions has become recurrent theme. To be able to add widgets, or control the existing ones, however, there needs to be a handle to SQL Developer UI objects. In this section we’ll walk through the two examples how to do it.

First, lets add a button to the Code Outline panel. Consider the following Arbori-Nashorn code:

include "std.arbori"

Prelude: runOnce -> {
   var panel = struct.get("outlinePanel");
   var button = new javax.swing.JButton("<html><b>My button!");
   var listener = new java.awt.event.ActionListener(
      function () {
         print('OK');
      }   
   );
   button.addActionListener(listener);
   panel.getToolbar().add(button);
   panel.getToolbar().revalidate();
}

At line #4 we have obtained the handle to the Code Outline panel. Then, we have instantiated a button, and added a listener to it. The final two steps were adding the button to the panel’s toolbar, and refreshing the GUI to render it.

The second example is little more sophisticated. Displaying the full parse tree is problematic, because it has leaf node for every lexical token. The total number of parse tree nodes can be estimated to be multiplied by the average tree depth — roughly the logarithm of the number of tokens. Given that each line of the code editor usually fits several tokens, it is evident that the parse tree fails to provide birds-eye view onto the large file. If you don’t follow this idea, then just compare the sizes of the scrollers for moderately sized input, at the SQL Worksheet and Code Outline panel.

Can we collapse parse tree nodes selectively? The Code Outline toolbar features a built-in menu button to collapse the nodes with certain payload. What if we want to collapse the nodes with ad-hock criteria? For example, given a SQL query with nested subqueries/inline views, can we have all the clauses collapsed?

Here is Arbori program that does it:

include "std.arbori"

Prelude: runOnce -> {
   var panel = struct.get('outlinePanel');
   var tree = panel.tree;
   print('before: tree.getRowCount()='+tree.getRowCount());
   for( i = 1; i < tree.getRowCount(); i++ ) 
        tree.expandRow(i);    
}

"To collapse": [node) select_clause
        | [node) from_clause
        | [node) where_clause
        | [node) group_by_clause
        | [node) order_by_clause
-> {
    var node = tuple.get("node");
    var i;
    for( i = 1; i < tree.getRowCount(); i++ ) {
        var path = tree.getPathForRow(i);
        var n = path.getLastPathComponent().getUserObject();
        if( n.from == node.from && n.to == node.to  )
            tree.collapseRow(i);
    }
}

Finale: runOnce -> {
    print("after: tree.getRowCount()="+tree.getRowCount());    
}

There are three steps:

  1. Prelude, where we get a handle to the outline tree, log the number of rows, and expand all of them
  2. “To Collapse”, where we identify the nodes of interest, and collapse them
  3. Finale, where we log the size of the collapsed tree

Here is an unwieldy SQL query scrolling dozens of pages, which parse tree is collapsed to quite succinct overview:

Java Script Conditions

May 29, 2020

Arbori is diminutive language with grammar fitting into couple of pages. It turns out that it still features some redundancies. Consider the following input

SELECT Ord.SalesOrderID, Ord.OrderDate,
    (SELECT *
     FROM Sales.SalesOrderDetail 
     WHERE Ord.SalesOrderID = OrdDet.SalesOrderID) 
From Sales.SalesOrderHeader;

and the program

Matches: ?node = 'FROM'
;

What is the expected behavior of matching node source text, is it case sensitive? In particular, would there be a single match in the above example, or the two?

I was never satisfied with Arbori question mark syntax, and suggest a better alternative: Java Script condition. It is a function which returns a boolean, which should be declared in some Nashorn block, and can be used as Arbori condition. Let’s define a string matching condition in the above example.

include "std.arbori"

Prelude: runOnce -> {
    function nodeContent(nodeVar) {
        var node = tuple.get(nodeVar);
        return target.input.substring(
            target.src[node.from].begin,
            target.src[node.to-1].end
        );
    }
    function stringMatch() {
        print(tuple)
        return 'FROM'== nodeContent('node')
    }
}

MatchesFrom: .stringMatch
      & [node < node) 
;

There we have defined the two Java Script functions: the nodeContent, which cuts a text fragment from the input under a recognized parse node, and the stringMatch, which uses it to perform the actual comparison. This is the second article where I invoke the nodeContent function, which hints that this function better be included into the standard library std.arbori.

The purpose of the Prelude predicate is just the declaration of the stringMatch. This function, returning a boolean, can be used as condition in the following predicate — the MatchesFrom. It is tempting to define this predicate simply as

MatchesFrom: .stringMatch
;

but this wouldn’t work. This query would have no attributes, which is not a problem itself. Unlike SQL, Arbori admits relations with empty set of attributes. However, even if this query were succeeded somehow, it would be impossible to extract any attribute from the tuples, because there aren’t any! The trick here is just adding an attribute named node. Please also note that the Java script implementation of the stringMatch actually extracts the value of the attribute node from every tuple. If there were no such attribute in the MatchesFrom Arbori query, then it would raise a runtime error!

In Arbori there is no analog of Relational Algebra project operator, AKA SQL select. The query attributes are inferred implicitly from the conditions. Therefore, we could amend the query with a condition like

[node) identifier

but this condition would filter out all the keywords. Therefore, the bogus condition

 [node < node) 

which asserts that the beginning position of the node interval is less than the node’s end. Let’s execute the program:

This screenshot highlights the two points. First, witness only one match in the input. Second, the print statement inside the stringMatch implementation is useful for debugging purposes. Java console panel at the left demonstrates that the stringMatch condition has been evaluated at every parse tree node.

Contemporary IDE strength is defined not only by advanced and flawless functionality, but also by its openness. How open is SQLDeveloper? Yes, you can write extensions in some cumbersome XML syntax, or even plug-in custom java classes, but you can do better via Arbori Nashorn scripting. To illustrate this point I’m planning to write a series of essays demonstrating how to solve problems. I’ll start with a problem which solution works in SQLDeveloper 19.4. The later essays would focus on more impressive customizations supported by soon to be released 20.2.

Suppose we are given a PL/SQL anonymous block with bind variables

begin
:ret := plus(:arg1,:arg2);
end;

how can we execute it? Certainly, this task looks trivial: click run on the plus stored procedure in the navigator tree. Well, this is not exactly the solution, because you wouldn’t be executing the given PL/SQL block; you would be running a generated wrapper block. And if you try to run this block from SQL Worksheet then you will have to figure out how to supply and retrieve bind variables.

You may argue that there is print command in SQL Plus, so that you can channel the function return value through it. Yes, that is kind of a solution, but it is not entirely satisfactory for several reasons. First, SQL*Plus is just one RDBMS client among many, with fading ubiquitousness along the passage of time. Count the number of questions for “jdbc” vs. “sqlplus” on stackoverflow, if you are not convinced. Second, “run via sqlplus” gives you very little flexibility in how can you run this anonymous PL/SQL block, or how would you scale up your solution in more challenging situations.

Let’s run it in Arbori. The first step would be referring to the standard library, where you’ll find many useful predicates and Java Script functions:

include "std.arbori"

For example, it defines the predicate runOnce, such that that it has only one tuple. Consequently, the Nashorn block that follows the next Arbori rule — prelude — is run once, thereby performing initialization:

prelude: runOnce -> {
    var binds = [];
    function nodeContent(nodeVar) {
        var node = tuple.get(nodeVar);
        return target.input.substring(
            target.src[node.from].begin,
            target.src[node.to-1].end
        );
    }
}

There we allocate an array that will collect all the bind variables found by Arbori query, and define a utility function that cuts the fragment of the text recognized under a parse tree node.

In the next section we run Arbori query that searches for all bind variables and collect them in the previously defined binds array:

binds: [node) bind_var -> {
    binds.push(nodeContent('node'));
}

As usual, to know the grammar symbol to look for, that is bind_var in this example, you need to examine parse tree:

After execution it is instructive to iterate through the result set confirming that the binds query indeed found all the relevant parse tree nodes.

Now that we know all the bind variables, we are ready to execute the anonymous PL/SQL block:

run: [input) block_stmt  -> {
    var  sql = nodeContent('input');    
    print('executing: '+sql);
    
    var ConnectionResolver = Java.type('oracle.dbtools.db.ConnectionResolver');  
    var connectionList = ConnectionResolver.getConnectionNames();
    var conn = ConnectionResolver.getConnection('IdeConnections%23local_hr');
    var cs = conn.prepareCall(sql);
    cs.registerOutParameter(1,java.sql.Types.NUMERIC);
  
    var JOptionPane = Java.type('javax.swing.JOptionPane');  
    for( i = 1; i < binds.length; i++ ) {
        var bindVar=JOptionPane.showInputDialog(binds[i],0);
        var intBind = java.lang.Integer.parseInt(bindVar);
        print(binds[i]+'='+intBind)
        cs.setString(i+1, bindVar);    
    }
    cs.executeUpdate();
    JOptionPane.showConfirmDialog(null,'result='+cs.getInt(1));
    cs.close();
}

Before walking through the last code section, lets witness its working. In the screenshot below user is prompted for input variable names:

The result will be shown also in a dialog, but numerous other options to present it are certainly possible.

Here is the line-by-line explanation of the run query. At line #1 there is basic Arbori query searching for the parse tree nodes labeled as block_stmt. This is where the entire begin...end text fragment would be recognized, and with the Nashorn actions fired at every found tuple. The Nashorn action starting at line #2 constitutes the bulk of the code. There we extract the PL/SQl block as a text, and print it to the console at the next line for debugging purpose.

At lines #5,6,7 we leverage SQL Developer ConnectionResolver, so that we can get connection info from the list of connections that you are already have. Obtaining connection via the standard JDBC API is also possible, although little more verbose with hard coded connection credentials.

At lines #8,9 we prepare the call and register the output bind variable. Next comes the loop at lines #12-17 where we iterate through the array of bind variables, ask for their values as captured on the previous screenshot, and set them in the prepared statement.

After setting all the input bind variables we are ready to execute the statement at line #18, and fire the dialog at the next line showing the function output.

To many readers this may look like cumbersome reinventing the wheel. What is the point of reimplementing the existing functionality? Am I implying that the user has to write couple of pages of Arbori code in order to do some basic task? Not necessarily, as you might perfectly reuse the code written by somebody else. The arbori file extension would be recognized in version 20.2, so all you need is a folder/library of already written Arbori programs, and you open them just as naturally as SQL scripts.

Arbori Semantic Actions

August 1, 2019

Arbori is a tidy query language with relatively limited expressiveness. This constrains its applications, such as SQL Formatter. The arithmetic in Arbori is virtually non-existent; therefore, is it possible, for example, to break procedure/function argument list after every 5th argument?

One approach would have been extending Arbori language with extra capabilities. This direction however is less than desirable, because requiring users to learn new language with concise syntax is challenging proposition by itself. A new language with extensive cumbersome syntax is a nonstarter.

Alternatively, one can piggyback onto some popular general purpose programming language, such as Java Script. This is the direction followed by SQL Developer version 19.2. The “Output queries” section at page 21 of the arbori reference guide describes how this newly introduced JS action is attached to Arbori query.

In this post we’ll provide one more example: Custom Syntax Coloring of TO_DATE built-in SQL function called with wrong number of arguments. Employing the Arbori query from that forum thread is performed in 5 steps:

1. Append the Arbori query

WrongToDateNumOfArgs:
( [node^) function_call
| [node^) datetime_literal
| [node^) function_expression
) & ?node = 'TO_DATE'
& ([arg) expr
| [arg) string_literal
| [arg) pls_expr )
&  arg^-1 = node
& [arg+1) ')'
& [arg-1) '('
;

at the bottom of the editor pane at the Preferences -> Code Editor -> PL/SQL Syntax Colors -> PL/SQL Custom Syntax Rules

2. Add JS semantic action:

WrongToDateNumOfArgs:
( [node^) function_call
| [node^) datetime_literal
| [node^) function_expression
) & ?node = 'TO_DATE'
& ([arg) expr
| [arg) string_literal
| [arg) pls_expr )
&  arg^-1 = node
& [arg+1) ')'
& [arg-1) '('
->{
   var node = tuple.get("node");
   struct.addStyle(target, node, "WrongToDateNumOfArgs");
};

3. Confirm your edit with OK, restart SQLDev

4. At the parent preference panel Preferences -> Code Editor -> PL/SQL Syntax Colors you should be able to see the newly introduced style:color

5. Witness its working:custom

Custom Syntax Coloring

October 10, 2018

SqlDeveloper 18.3 features one more application of Semantic Analysis based upon Arbori parse tree query language. It provides much easier venue to mastering Arbori, compared to formatting (which is ubiquitous, but grew to almost 800 conditions aggregated into more than 30 rules).

First, there are 3 additional PL/SQL color styles, PlsqlCustom1-3:

styles.png There I have changed the default gray color for PlsqlCustom1 to more eye-catching pinkish highlight. This color would be appropriate for what we’ll do in the very first exercise — highlighting all the EXCEPTION keywords in PL/SQL code. The template is already there, all you need to do is activating the commented PlsqlCustom1 rule:

custom.png

Let’s examine the rule in detail:

PlSqlCustom1:          
   [node) 'EXCEPTION'  
->                  
; 

The rule header – PlSqlCustom1 – matches the name of the syntax style from the parent – Code Editor: PL/SQL Syntax Colors – preference page.

Consequently, PL/SQL code fragments identified by this rule are highlighted according to the chosen style.

The colon separates the rule header from the rule body. In general, a rule body specifies some logical criteria for PL/SQL code selection. The code selection method is parse driven: as soon as some nodes of interest are identified on the parse tree, the text covered by that node is subject to the rule’s action. In our example, the condition is

[node) 'EXCEPTION'

and it chooses all the nodes in the parse tree which payload is the EXCEPTION keyword. The “node” is the rule’s attribute name and the rule action expects a variable with such name verbatim. The rule attribute names are wrapped with the open bracket on the left and closed parenthesis on the right. This syntax is warranted by the convention to identify parse tree nodes with semi open segments specifying the text fragment boundaries.

The arrow  at the next line prompts that this rule triggers some action, namely the custom syntax highlighting. This is opposed to auxiliary rules, which can be leveraged in the other rules in the spirit similar to how views and CTEs are used in SQL.

The final semicolon demarcates the end of the rule.

The second rule – PlSqlCustom2 – is enabled by default, and its logic is documented at the help page.

Lets try something little more ambitious and try to highlight all the variables which are not defined. We can either add the PlSqlCustom3 rule, or start with scratch and repurpose the PlSqlCustom1. Here is the first attempt:

PlSqlCustom1: 
[node) identifier 
& ![node)decl_id 
& ![node-1) '.'
& ![node+1) '.'
& ![node^) pkg_body
& ![node^) subprg_spec
& ![node^) subprg_body
& ![node^) procedure_call
->
;

identifiers.pngWithout the negative conditions many more identifiers would have been highlighted as indicated by those red arrows. How would you know what grammar symbol to target? This can only be accomplished when examining the parse tree. For example, p_param1_todo declaration would have been highlighted because it is formally an identifier. However, it is also a decl_id. The reader is encouraged try commenting some of those negative conditions to see what is the effect at the PL/SQL code sample panel.

Why the raise has been highlighted, it is certainly not an identifier? This is a bug: the combined SQL&PL/SQL grammar is highly ambiguous, and the parser recognized the statement as a function call.

The second step is highlighting all variable declarations. We still want to keep the previous rule for later, this is why it is renamed, and its semantic action is removed:

decl.png

The final step is excluding those declarations from all identifier entries. However, it is little more subtle than that, and we can’t formally apply the subset difference. The identifiers and their declarations are always different nodes, and we want to match them by name. For that purpose, we rename the node attribute to the decl and introduce the two-attribute predicate matches(node, decl). Then, the final query is a set difference of all the ids minus matches (subtracting relations of different signature is legitimate in Arbori):

final.png

A little more refined version of the Arbori query for undefined variables:

ids:    
     [node) identifier
   & ([node) name | [node) arg) 
   & ![node+1) '.'
   & ![node^) procedure_call
   & (![node+1) '=' | ![node+2) '>')
   & ! ?node='true'
   & ! ?node='false'
   & ! ?node='sqlerrm'
; 

decls:     
     [decl) decl_id 
    & ([decl^) basic_d | [decl^) prm_spec | [decl^) subprg_spec)
|  [decl) identifier & [decl^) cursor_d 
; 

matches: ?ids.node = ?decls.decl
; 

PlSqlCustom1: ids - matches
->
;

This solution is only approximate, because no effort has been made to identify the variable scope. The ancestor-descendant relationship between the parse nodes is expressible in Arbori, and you can find numerous examples of its usage at the Custom Format preference page. However, unlike formatting, syntax highlighting is something triggered by every user keystroke. Arbori queries involving ancestor-descendant are slightly less efficient than simple traverse of parse tree… use it with caution.

Version 19.2 amendments

Since version 19.2 you are no longer limited with 3 rules and the fixed rule naming scheme.

 

Earley Parser Engine

April 13, 2018

Parsing is a mature technology. The field flourished with ideas in the 60s. At that time computer power was quite limited, this is why practical emphasis was on efficient parsing methods. LA(1) and LALR quickly became standard; both having linear parsing time complexity. Still, many grammars, such as C, escape relatively narrow class of grammars for which there is a standard linear time parser. The crux of the problem was ambiguity of certain language constructs, such as if-then-else. The SQL and PL/SQL grammar are even more ambiguous, due to the fact that keywords are allowed as identifiers. There are various hacks, which are usually presented as satisfactory solutions from end-user perspective, but our general philosophy is that manual transforming the grammar into LA(1) or LALR form is inferior to leveraging more powerful (but slower) parsing method.

Parser engines evolve in this direction already. The popular bison parser has been extended to implement GLR method, which handles arbitrary context free grammar. Microsoft OSLO framework for building Domain Specific Languages (DSL) is based upon GLR parsing method as well. Likewise, Antlr parser engine evolved to accommodate more general grammar. In version 4 it leverages LL() parsing that admits some grammar ambiguity. Yet LL() parsing is weaker than GLR, and in our opinion the parsing engine design must leverage and algorithm powerful to handle arbitrary context free grammar, rather than piggyback on weaker method.

Both methods — GLR and LL(*) — extend an algorithm, which initially had linear parsing time. Presumably, this won’t significantly affect performance in practical cases when there is little ambiguity. In worst case, however, such an approach can result in exponential time.

There are at least two more methods that can also handle arbitrary context-free grammars: CYK and Earley. CYK parsing algorithm was invented as a proof that arbitrary CFG grammar can be parsed in the time cubic by the size of the input. It is a remarkably simple idea, as core of the algorithm essentially contains only 4 nested loops. There is no automata theory, no states, no backtracking, no code generation. The method provides clean prescription how to handle parse failure.

Earley algorithm shares many features with CYK, in fact “Principles and Implementation of Deductive Parsing” by Shieber, Schabes, Pereira describes a unified framework for both CYK and Earley. Consequently, we were able to implement Earley in the spirit of CYK method, which allowed (among other things) meaningful performance comparisons between the two.

In this article, however, we won’t dig into algorithm implementation details. For all practical purposes it is a black box that can be invoked via Java Nashorn.

Let’s start with the basics. Given an input SQL code, how do I get the parse tree? Here is the Arbori program which we’ll walk through line-by-line:

include "std.arbori"

prelude: runOnce -> {
    var input = target.input;

    var LexerToken = Java.type('oracle.dbtools.parser.LexerToken');  
    var tokens  =  LexerToken.parse(input);
    LexerToken.print(tokens);
    
    var SqlEarley = Java.type('oracle.dbtools.parser.plsql.SqlEarley');  
    var earley  =  SqlEarley.getInstance();
    var Matrix = Java.type('oracle.dbtools.parser.Matrix');  
    var matrix = new Matrix(earley);
    
    earley.parse(tokens, matrix);
    var rootNode = earley.forest(tokens, matrix);
    rootNode.printTree();
}

First, we extract the input from the active editor:

var input = target.input;

var LexerToken = Java.type('oracle.dbtools.parser.LexerToken');  
...

Alternatively we can read it from a file, but please keep in mind that the Arbori program runs against the active editor, so that reading the input from some other pace just adds to confusion.

Next, we perform Lexical analysis:

var input = target.input;

var LexerToken = Java.type('oracle.dbtools.parser.LexerToken');  
var tokens  =  LexerToken.parse(input);
LexerToken.print(tokens);

var SqlEarley = Java.type('oracle.dbtools.parser.plsql.SqlEarley');  
...

A list of tokens is fed to the parser, but we have to instantiate the parser object, first:

...
LexerToken.print(tokens);

var SqlEarley = Java.type('oracle.dbtools.parser.plsql.SqlEarley');  
var earley  =  SqlEarley.getInstance();
var Matrix = Java.type('oracle.dbtools.parser.Matrix');  
var matrix = new Matrix(earley);
...

which is followed by initialization of Earley matrix

...

var SqlEarley = Java.type('oracle.dbtools.parser.plsql.SqlEarley');  
var earley  =  SqlEarley.getInstance();
var Matrix = Java.type('oracle.dbtools.parser.Matrix');  
var matrix = new Matrix(earley);

earley.parse(tokens, matrix);
var rootNode = earley.forest(tokens, matrix);
rootNode.printTree();

Earley parsing is actually a recognition (which combines scan, predict, and complete into a single method)

...

var Matrix = Java.type('oracle.dbtools.parser.Matrix');  
var matrix = new Matrix(earley);

earley.parse(tokens, matrix);
var rootNode = earley.forest(tokens, matrix);
rootNode.printTree();

Finally, we can build and output parsing tree

...
var Matrix = Java.type('oracle.dbtools.parser.Matrix');  
var matrix = new Matrix(earley);

earley.parse(tokens, matrix);
var rootNode = earley.forest(tokens, matrix);
rootNode.printTree();

The print statement should output parse tree to java console like this:

[0,4)   query_block  select  simple_set_expr  subquery
  [0,2)   select_clause
    [0,1)   'SELECT'
    [1,2)   '*'  select_list
  [2,4)   from_clause
    [2,3)   'FROM'
    [3,4)   cartesian_product  identifier  query_table_expression 

If you feel that more detailed explanation is warranted, then please bear with me little longer.

Introduction

Parsing or, more formally, syntactic analysis is ubiquitous document processing stage. It takes seemingly disorganized string of characters and produces distinctly structured output – a tree. Ordinary documents feature multiple levels of hierarchy: characters, words, sentences, paragraphs, chapters, volumes…

From parsing perspective only the two bottom levels are significant. Document processing at the character level is called lexical analysis. Lexical analyzer, which is often called scanner, outputs a stream of words better termed as tokens. On the level above lexical analysis the stream of tokens is digested by a parser which organizes them into a parse tree. These two document processing phases differ greatly in their complexity, yet here our concern is usage, which in both cases reduces to easy java API call.

Scanner

Most programming languages share basically the same lexical structure. The conventional wisdom is that inventing specialized exotic lexical form for a new language has dubious value. We will, therefore, adopt “consolidated” C, C++, Java, SQL, PL/SQL lexical structure, which consists of white spaces, digits, single- and double-quoted strings, single- and multi-line comments, and operation symbols. These token types are enumerated in Token.java.

Again, the scanner is run with a single call:

   var out = LexerToken.parse(input);

For example, given an input string

     var input = "select emp from empno";

the scanner outputs the following list of LexerTokens:

0 [0,6) select IDENTIFIER
1 [7,10) emp IDENTIFIER
2 [11,15) from IDENTIFIER
3 [16,21) empno IDENTIFIER

Here, the first column contains numbers, which are token position in the list.

The semi-open segment in the second column delineates beginning and end character positions of the token – its location within the input string. This notation is explained in more depth in the following section — Semi-Open Segments.

The third column lists the token content.

Finally, in the last column we find the token type.

Technically, this printout has been generated with java call

LexerToken.print(out);

Semi-Open Segments

Partitioning linearly ordered elements into semi-open segments is ubiquitous software practice. Java text processing utilities specify intervals of characters with beginning position included, and the end position excluded. For example,

"select * from emp".substring(7,8);

cuts out wildcard “*” character fragment from the original string. We adopt this idea with somewhat unconventional notation where the beginning of the segment is denoted with square bracket and the end is parenthesis.

Figure 1: Semi-open segments. A sequence of characters at positions 0,1,2,3,4,5,6,7,8,9 is partitioned into segments [0,6), [6,7), [7,8), [8,9), and [9,...).

Beyond lexical analysis, we leverage semi-open segments to label nodes in the parse tree. However, parse tree segment positions are in token units, not character ones. For example, the parse node capturing the wildcard in the above example would be labeled with the interval [1,2). Technically, the wildcard is the third token produced by the scanner; however all whitespace tokens are ignored by parser.

In parser applications it is frequently required to convert from character units into token offsets, and vice versa; hence the LexerToken.scanner2parserOffset utility method.

Parser

Parser consumes list of tokens produced by lexer. It combines tokens according to grammar rules, and arranges them together with non-terminal grammar symbols into a parse tree. The exact method how is it accomplished constitutes rather involved computer science subject. It is the failure of the later to communicate parsing science to a programmer in the trenches, that caused proliferation of crippled language processing technologies such as XML.

Parsing methods which are popular and practical allow only restricted types of grammar. On the contrary, methods which allow arbitrary context free grammars are slow. In the first case, language developer has to master arcane hacking tricks allowing him to navigate in the murky water of rule restrictions. He has to either massage the grammar to the form that particular parser generator likes, or pollute the grammar with side effects, which carefully guide the parser engine through rule selection.

No matter how cumbersome legacy parsing technologies are, they enjoy advantage of nearly instantaneous parse time, so that for users spoiled by that level of performance slow parse time of any “superior” parse method would be unacceptable. Here we claim to solve this performance vs. ability dilemma and offer a parsing engine which is both reasonably fast, and generic in terms of allowed grammars.

You can use our parser without being required to master parsing technology. The steps are quite simple. First, you instantiate parser object, and feed grammar into it. For SQL&PL/SQL there is a cooked singleton already; you just get a handle to its instance as:

var earley = SqlEarley.getInstance();

The name Earley refers to the parsing method that we use, but you can get away with not knowing any details. The parsing itself happens with the following calls:

List tokens = ...; // from lexical analysis
Matrix matrix = new Matrix(earley);
earley.parse(tokens, matrix);
ParseNode root = earley.forest(tokens, matrix);

Admittedly, there is a certain amount of boilerplate code, because the reader dones’t have to know anything about intermediate Matrix object. The parse output is a tree, which can be printed to the standard output with:

root.printTree();

For example, the input "select emp from empno;" produces:

[0,4)   query_block  select  simple_set_expr  ...
  [0,2)   select_clause
    [0,1)   'SELECT'
    [1,2)   column  expr  identifier  select_list ...
  [2,4)   from_clause
    [2,3)   'FROM'
    [3,4)   cartesian_product  identifier  query_table_expression ...
[4,5)   ';'

To programmatically explore and navigate parse tree you have to familiarize yourself with the ParseNode object. It contains all the necessary methods, such as

  • “List all node’s children”,
  • “Give me the parent”,
  • “List the node grammar payload (a set of grammar symbols)”,
  • “Does the node payload contains select_clause?”

Further reading

To learn more about Earley Parser Engine, visualization and performance optimizations of the algorithm, here is the paper.

Formula for Formatting

September 28, 2017

SQL Developer release 17.3 is out with formatter amendments warranting an explanation. The fundamentals didn’t change: the formatting mechanism is still based on parse tree analysis, which is implemented via Arbori — parse tree query language. New for 17.3 is more intuitive rule structure and better coordination of rules with formatting options. Still, rather than describing 17.3 implementation increments, I’d try presenting a self-contained description of Arbori formatting engine.

We start with indentation, arguably, the most important formatting feature. Consider the following exampleindents.png

The first indent (at level = 0) is applied to the list of attributes in the select clause and the list of tables in the from clause in the main query block. The second offset (at level = 1) is applied to the inner view in parenthesis. The third offset is applied to the select term and the table in the inner view.

How is this specified formally?  We have to examine parse tree:

parseTreeIndents.png

The parse tree nodes, which we want to indent, have grammar symbols (such as the select_list) encircled in green. Unfortunately, there is another, nested, select_list occurrence, marked in red, which labels the parse tree node that we don’t want to indent. We just have to refine our condition that we are after the parse tree nodes labeled with select_list whose parent payload is select_clause!

Formally, all indentation formatting conditions are gathered in the rule named simpleIndentConditions:

simpleIndentConditions.pngThe operations “|” and “&” are logical disjunction and conjunction, correspondingly, so that the entire simpleIndentCondition rule is a monumental DNF. The conjunctions are listed alphabetically; therefore, we find the condition of interest for the nodes labeled with the select_clause buried somewhere in the middle:

selectListConditions.png

The condition consists of the 3 parts:

  • Nodes, labeled with the select_list
  • Parent node (formally “node^“) labeled with select_clause
  • Boolean bind variable :breaksAfterSelectFromWhere

The value of this bind variable is carried over from the Advanced Format preference page:

advancedFormat.png

Bind variables provide a vehicle for this “wall of options” to be seamlessly integrated into the formatter. Flipping the Line Breaks->SELECT/FROM/WHERE option off would have set :breaksAfterSelectFromWhere to false, while flipping it back again would have set :breaksAfterSelectFromWhere = true.  Consequently, with the condition :breaksAfterSelectFromWhere = true, the conjunction would reduce to just

|  [node) select_list & [node^) select_clause

while :breaksAfterSelectFromWhere = false would have effectively eliminating this conjunction from the simpleIndentCondition rule.

Let’s do this experiment, flipping the Line Breaks->SELECT/FROM/WHERE option off. People who requested this option are usually not fond of formatting style with excessive breaks. While at it, let’s switch off breaks on subqueries as well. The result

alignmentsExample.png

warrants an explanation. The first snag is the failure to align the select terms “2” and “3” to the predecessor. Somehow a space has been lost… The other alignments look OK. However, the alignments bring us beyond simple indentation model that I have described earlier. Once again, :breaksAfterSelectFromWhere is set to false, so it have effectively nullified the  condition to indent the select term in the simpleIndentCondition rule. How did it format the select terms (and other items)?

There we have two new rules for line breaks and alignments. The rule for line breaks consists of the two parts, covering conditions when to insert line break before a parse node:

extraBrkBefore.png

and after. Here we see that if a user have chosen an option to have line breaks before the comma, then the line break will be inserted before a comma, which is followed by a select_term.  This is not the condition that inserted line breaks before the select terms 2, and 3 in our example. The following one did:

extraBrkAfter.png

Being able to control how to inserting line breaks solves half of the problem. We also need to specify alignments:

pairwiseAlignments.png

The alignment rule is little more sophisticated than what we have considered before. It involves two variables: which node we want to align, and the target parse tree node, named “predecessor” which we want to align it to. There are several ways the “predecessor” node is related to the aligned node, hence the alignment rule has been split in several parts. The first rule, called pairwiseAlignments1, defines the predecessor node as the younger sibling of the aligned node, formally: predecessor=node-1. This is exactly how from_clause is aligned to the select_clause. The second rule, called pairwiseAlignments2, defines the predecessor node as the younger sibling of the younger sibling of the aligned node, formally: predecessor=node-1-1. This is how individual select terms are aligned. Here is the parse tree to convince the reader how in our example the select term “2” is aligned with the select list “1”:

pairwiseAlignments3.png

Likewise, the select term “3” is aligned with the select list “1,2”:

pairwiseAlignments2.png

 

 

Arbori; the missing manuals

February 11, 2017

Number of lines of code is well known to be poor metric for virtually any purpose. One thing is indisputable though: the total volume of code is growing. Today, developers are frequently challenged with mining huge codebases. Therefore, querying parse trees is not just an academic exercise, but something of ever increasing practical value.

Blog post, though, is hardly a proper venue to introduce a new language. Here are two references to supplement it:

  1. Arbori Starter Manual by Eugene Perkov
  2. Little more detailed Semantic Analysis with Arbori