The idea of Domain Specific Languages has been around for a while. For example, Oracle Graal/Truffle framework provides a runtime environment for many popular languages, but is also making an effort to simplify the new language development. The starting point for Graal/Truffle is the language semantics captured in the Abstract Syntax Tree (AST). However, the first part — how to get from the language grammar defined in CFG to the AST — is left open.
In this article we’ll explore the full path, from CFG to the language implementation runtime. In particular we’ll investigate what is the minimal amount of effort involved for the language developer to accomplish it and, in particular, if the concept of AST is necessary.
Running Example
For illustrative purposes, we’ll adopt the example from Wikipedia page on AST:
a := 30
b := 47 - 5
while b != 0
--print a
--print b
if a > b
a := a - b
--print a
else
b := b - a
--print b
end
end
print a
print b
The minor amendments to the original syntax include the addition of the end
delimiter for the while
and if
statements; otherwise, the sequence of statements in the grammar is ambiguous. Also, the print
statement is added, which is helpful for debugging purposes.
This this code snippet would be referenced in the following sections by the file name: example.expr
.
Grammar
Writing the grammar for our tiny language is certainly a required step. The alternative — manually written procedural recursive descend parser — is more laborious. Here is the BNF:
program:
stmt
| program stmt
;
stmt:
assignment
| while
| if
| print
;
assignment: identifier ':' '=' expr
;
expr:
digits
| identifier
| expr '-' expr
;
boolean:
expr '>' expr
| expr '!' '=' expr
;
while: 'while' boolean program 'end'
;
if: 'if' boolean program 'else' program 'end'
;
print: 'print' expr
;
Arguably, plain BNF is the simplest language posible, so that little comment is warranted.
For future references, this grammar is referenced by the file name: expr.grammar
.
Parsing
The prerequisite for parsing is the choice of Parser Engine. Our choice here is the one based on Earley parsing algorithm. The implementation is in Java and is provided in the dbtools-common.jar
, which is the support library for popular SQL tools, such as SQLDeveloper, SQLcl, and ORDS. If the reader is uncomfortable with the code base which is not Open Source, here is the github link to an implementation, which doesn’t include Oracle artifacts.
With parser engine support we instantiate the parser and run it as follows:
package ast;
import java.io.IOException;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;
import oracle.dbtools.parser.Cell;
import oracle.dbtools.parser.Earley;
import oracle.dbtools.parser.Grammar;
import oracle.dbtools.parser.LexerToken;
import oracle.dbtools.parser.Matrix;
import oracle.dbtools.parser.ParseNode;
import oracle.dbtools.parser.RuleTuple;
import oracle.dbtools.parser.Visual;
import oracle.dbtools.parser.plsql.SyntaxError;
import oracle.dbtools.util.Service;
public class ExprEarley extends Earley {
public static void main( String[] args ) throws Exception {
String input = Service.readFile(ExprEarley.class, "example.expr");
Earley earley = new ExprEarley();
List<LexerToken> src = LexerToken.parse(input);
if( src.size() < 100 )
LexerToken.print(src);
else
System.out.println("src.size()="+src.size());
Matrix matrix = new Matrix(earley);
/*if( src.size() < 400 ) {
Visual.pixels = 900; // Mac scaling for UHD; default for Win
matrix.visual = new Visual(src, earley);
}*/
//earley.skipRanges = false; //<-- disable optimization
earley.parse(src, matrix);
if( matrix.visual != null )
matrix.visual.draw(matrix);
// get the bottom-right corner of the matrix
// this is "top" in the "bottom-to-the-top" parser terminology
Cell top = matrix.get(0, src.size());
if( top != null ) // if input is syntactically recognized
System.out.println(src.size()+ " tokens, top = " + top.toString());
else {
System.out.println(SyntaxError.checkSyntax(input, new String[]{"program"}, src, earley, matrix));
return;
}
ParseNode root = earley.forest(src, matrix, true, input);
root.printTree();
}
public ExprEarley() throws Exception {
super(getRules());
isCaseSensitive = true;
}
private static Set<RuleTuple> getRules() throws IOException {
Set<RuleTuple> ret = new TreeSet<>();
String rulesTxt = Service.readFile(ExprEarley.class,"expr.grammar");
List<LexerToken> src = LexerToken.parse(rulesTxt);
ParseNode root = Grammar.parseGrammarFile(src, rulesTxt);
Grammar.grammar(root, src, ret);
return ret;
}
}
The bulk of the code it the ExprEarley.main()
method, which a user who not interested in the bells and whistles of Earley algorithm can skip. Moreover, in many use cases, using the wrapper Parsed.class
is less verbose as compared to the above boiler plate code parsing in steps, and this would actually be the approach in the next section. Therefore, the parser for the expr.grammar
itself, is actually only a dozen or so lines at the bottom, that is the constructor, and the grammar extraction method.
For an interested reader, however, here is the walkthrough of the main()
method:
- At line 22 the
example.expr
input have been read. The first argument is Java resource prompting for the file directory collocated with ExprEarley.class
. Alternatively, the reader can use any other of dozens reading the file methods available in Java.
- At line 26 we have performed lexical analysis, which produced a list of
Token
s. The Lexical Analyzer in dbtools-common library is essentially a sugarcoated java.util.StringTokenizer
amended to cover a large variety of practical languages. The language specifics such as the comment style is controlled via extra arguments to the call, but in our case we have chosen the default, which is SQL.
- At line 32 we have installed the object which captures the crux of the Earley parsing algorithm. The commented code visualizes the parsing process as the two dimensional grid with the recognized grammar rules in the grid cells. The cubic complexity of the parser method is evident from this visualization, as the power 2 comes from the two dimensional nature of the grid, and one extra multiple comes from the Earley
complete
step. An interested reader is referred to this document for details.
- At line 38 we have an option to disable ad-hock optimization which differentiates our implementation from the canonical Earley algorithm. The user has an option to toggle optimization at the Matrix visualization panel, and witness grid areas colored in blue with skipped analysis.
- Finally, at line 56 the parser produced the Parse Tree, which in our case is
[0,38) program
[0,36) program
[0,34) program
[0,10) program
[0,4) assignment program stmt
[0,1) identifier
[1,2) ':'
[2,3) '='
[3,4) digits expr
[4,10) assignment stmt
[4,5) identifier
[5,6) ':'
[6,7) '='
[7,10) expr
[7,8) digits expr
[8,9) '-'
[9,10) digits expr
[10,34) stmt while
[10,11) 'while'
[11,15) boolean
[11,12) expr identifier
[12,13) '!'
[13,14) '='
[14,15) digits expr
[15,33) if program stmt
[15,16) 'if'
[16,19) boolean
[16,17) expr identifier
[17,18) '>'
[18,19) expr identifier
[19,25) assignment program stmt
[19,20) identifier
[20,21) ':'
[21,22) '='
[22,25) expr
[22,23) expr identifier
[23,24) '-'
[24,25) expr identifier
[25,26) 'else'
[26,32) assignment program stmt
[26,27) identifier
[27,28) ':'
[28,29) '='
[29,32) expr
[29,30) expr identifier
[30,31) '-'
[31,32) expr identifier
[32,33) 'end'
[33,34) 'end'
[34,36) print stmt
[34,35) 'print'
[35,36) expr identifier
[36,38) print stmt
[36,37) 'print'
[37,38) expr identifier
Program Evaluation
The common wisdom of program execution is transforming parse tree into AST. Again, we investigate if this step can be omitted, or possibly replaced by something else.
In the implementation we have chosen JavaScript engine available in Java, because it is a dynamic language with functions as first class citizens. Pure Java solution would be little more verbose with boilerplate wrapper objects.
There are the two alternative ways to run the scripts:
- as Arbori program, or
- directly via JavaScript engine
We don’t really leverage any Semantic Analysis facilities available, and the only advantage that Arbori provides is creating the environment with the parsed object available as global variable. In the second approach we would have to explicitly put it there.
The Java which runs Arbori is the following:
package ast;
import java.util.Map;
import oracle.dbtools.arbori.MaterializedPredicate;
import oracle.dbtools.arbori.Program;
import oracle.dbtools.parser.Earley;
import oracle.dbtools.parser.Parsed;
import oracle.dbtools.util.Service;
public class Evaluate {
public static void main( String[] args ) throws Exception {
String input = Service.readFile(ExprEarley.class, "example.expr");
Earley earley = new ExprEarley();
Parsed parsed = new Parsed(input,earley,"program");
Program program = new Program(earley, "JS") {
};
String prg = Service.readFile(ExprEarley.class, "evaluate.arbori");
program.compile(prg);
Map<String,MaterializedPredicate> output = program.eval(parsed);
}
}
As we have promised in the previous section, the parsing is performed with one call at line 16. The third argument — the "program"
literal — is the grammar top symbol.
The evaluate.arbori
code mentioned at line 20 is essentially the following JavaScript:
{
var Service = Java.type('oracle.dbtools.util.Service');
target.root.printTree();
var debug = true;
var scope = new java.util.HashMap();
var ruleHandlers = {
"assignment: identifier ':' '=' expr;"
: function Assignment( node, children ) {
scope.put( children[0].content(target.src),eval(children[3]) );
}
, "expr: expr '-' expr;"
: function Minus( node, children ) {
return eval(children[0])-eval(children[2]);
}
, "expr: digits;"
: function Digits( node, children ) {
return java.lang.Integer.parseInt(node.content(target.src));
}
, "if: 'if' boolean program 'else' program 'end';"
: function If( node, children ) {
var condition = eval(children[1]);
if( condition )
return eval(children[2]);
else
return eval(children[4]);
}
, "boolean: expr '>' expr;"
: function Gt( node, children ) {
return eval(children[0]) > eval(children[2]);
}
, "expr: identifier;"
: function Id( node, children ) {
return scope.get(node.content(target.src));
}
, "while: 'while' boolean program 'end';"
: function While( node, children ) {
var i = 0; // to avoid accidental infinite loop
// removed in the production code
while( eval(children[1]) && ++i<5 )
eval(children[2]);
}
, "boolean: expr '!' '=' expr;"
: function Ne( node, children ) {
return eval(children[0]) != eval(children[3]);
}
, "print: 'print' expr;"
: function Print( node, children ) {
print( children[1].content(target.src)+"="+eval(children[1]) );
}
}
function matchingProduction( node ) {
var children = node.children().toArray();
var rules = target.getEarley().rules;
for( ir in rules ) {
var rule = rules[ir];
if( !node.contains(rule.head) )
continue;
if( node.from+1 == node.to ) {
if( rule.head != rule.rhs[0] && node.contains(rule.rhs[0]) )
return rule;
else
continue;
}
var matched = true;
for( ik in children ) {
if( rule.rhs[ik] == null ) {
matched = false;
break;
}
if( !children[ik].contains(rule.rhs[ik]) ) {
matched = false;
break;
}
}
if( matched )
return rule.toString();
}
}
function eval(node) {
if(debug) java.lang.System.out.print(Service.padln(node,40)+" ");
var children = node.children().toArray();
var rule = matchingProduction( node );
if(debug) print(rule);
if( rule == null )
return "N/A";
var handler = ruleHandlers[rule];
//print('handler='+handler);
if( handler != null) {
var ret = handler(node, children);
if( debug && ret != null )
print("EVAL "+node + " TO " + ret);
return ret;
} else if( children != null ) {
for( ik in children ) {
eval(children[ik]);
}
} else
return "?";
}
eval(target.root);
print(scope);
}
It looks little more verbose than it actually is, because only the lines 9 to 52 is actually the code that language developer has to write, The matchingProduction()
and eval()
functions are universal for the language, and thus can be hidden in a library. The matchingProduction()
function purpose is to find the grammar rule disjunct that has been actually recognized at the parse node. The implementation itself is far from being efficient, but this is not our concern here. Again, our focus is how much language specific supporting code the developer has to write.
In lines 9 to 52 we have mapped the string representations of the rule productions into the functions that perform the evaluation at the nodes where such production has been discovered. For language developer it might be helpful to run the following Arbori query which captures matching production at each node:
matchingProductions: [node < node)
-> {
var node = tuple.get('node')
print(Service.padln(node,40)+" "+matchingProduction(node));
}
Consequently, the string representation of relevant rules are copied to the ruleHandlers
map.
For example, at the node
[15,33) if program stmt
it have recognized the rule production
if: 'if' boolean program 'else' program 'end';
There is only one production that is recognized at each node, so that we can leverage this knowledge by attaching the evaluation function to it in the ruleHandlers
map. In our example, the function If( node, children )
would be called as soon as the above production is recognized. In this function we process the parse tree node, and then descend to node’s children.
Conclusion
There seems to be an interesting conceptual distinction between Semantic Analysis of a program, and the execution of thereof. At present, it seems that you can’t just write an Arbori query that executes the program, which is somewhat puzzling. Semantic Analysis inspects the program code globally, for example, capturing the relations between the ancestor and descendant parse tree nodes. On the other hand, the program execution mechanism is mostly concerned with the parent-child relations. It is the recursion leverage, that perhaps fixes the execution mechanism shortcomings.