The Actipro Blog

All the latest UI control development news from Actipro

SyntaxEditor grammar/AST framework part 3: Creating a grammar for the Simple language

In the previous post we gave a detailed introduction to symbols, EBNF terms, and how you can translate a language’s EBNF specification to our grammar framework.

In today’s post we’re going to write a C#-based grammar using our new framework for a language called Simple.  The Simple language is basically a small subset of a Javascript-like language.  When we’re done, we’ll load it up into a SyntaxEditor and will look at the AST results that are generated for us based on some input code.

Finding the language specification

The first thing to do when building a Grammar for a language is to locate the language specification for the language.  Nearly all programming languages have some sort of formal EBNF specification that tell parsers and compilers how to interpret their code. 

Here’s the EBNF we’ll use for the Simple language as we build our Grammar:

   1: CompilationUnit:
   2:     FunctionDeclarationList?
   3:  
   4: FunctionDeclarationList:
   5:     FunctionDeclaration
   6:     FunctionDeclarationList FunctionDeclaration
   7:  
   8: FunctionDeclaration:
   9:     "function" "identifier" "(" FunctionParameterList? ")" Block
  10:  
  11: FunctionParameterList:
  12:     "identifier"
  13:     FunctionParameterList "," "identifier"
  14:  
  15: StatementList:
  16:     Statement
  17:     StatementList Statement
  18:  
  19: Statement:
  20:     Block
  21:     EmptyStatement
  22:     VariableDeclarationStatement
  23:     AssignmentStatement
  24:     ReturnStatement
  25:     
  26: Block:
  27:     "{" StatementList? "}"
  28:  
  29: EmptyStatement:
  30:     ";"
  31:     
  32: VariableDeclarationStatement:
  33:     "var" "identifier" ";"
  34:  
  35: AssignmentStatement:
  36:     "identifier" "=" Expression ";"
  37:     
  38: ReturnStatement:
  39:     "return" Expression ";"
  40:     
  41: Expression: 
  42:     EqualityExpression
  43:  
  44: EqualityExpression:
  45:     AdditiveExpression
  46:     AdditiveExpression "==" EqualityExpression
  47:     AdditiveExpression "!=" EqualityExpression
  48:     
  49: AdditiveExpression:
  50:     MultiplicativeExpression
  51:     MultiplicativeExpression "+" AdditiveExpression
  52:     MultiplicativeExpression "-" AdditiveExpression
  53:  
  54: MultiplicativeExpression:
  55:     PrimaryExpression
  56:     PrimaryExpression "*" MultiplicativeExpression
  57:     PrimaryExpression "/" MultiplicativeExpression
  58:     
  59: PrimaryExpression:
  60:     NumberExpression
  61:     SimpleName
  62:     FunctionAccessExpression
  63:     ParenthesizedExpression
  64:  
  65: NumberExpression:
  66:     "number"
  67:  
  68: SimpleName:
  69:     "identifier"
  70:  
  71: FunctionAccessExpression:
  72:     "identifier" "(" FunctionArgumentList? ")"
  73:  
  74: FunctionArgumentList:
  75:     FunctionArgument
  76:     FunctionArgumentList "," FunctionArgument
  77:  
  78: FunctionArgument:
  79:     Expression
  80:  
  81: ParenthesizedExpression:
  82:     "(" Expression ")"

Creating a Grammar class

Next we need to create a new class that inherits Grammar (from our framework).  The class constructor will do four tasks:

  1. Create terminals
  2. Create non-terminals
  3. Configure non-terminal productions
  4. Configure non-terminal can-match callbacks

In the previous post we explained these steps and showed some quick examples.  Now let’s implement them for the entire language.

Create the SimpleGrammar class

First, let’s create our class:

   1: /// <summary>
   2: /// Represents a <see cref="Grammar"/> for the <c>Simple</c> language.
   3: /// </summary>
   4: public class SimpleGrammar : Grammar {
   5:         
   6:     /// <summary>
   7:     /// Initializes a new instance of the <c>SimpleGrammar</c> class.
   8:     /// </summary>
   9:     public SimpleGrammar() : base("Simple") {
  10:  
  11:     }
  12: }

Create terminals

In the EBNF specification above, the terminals are quoted.  We’ll now insert Terminal variable declarations for each of the terminals in our language into the constructor for our SimpleGrammar class.

   1: var @addition = new Terminal(SimpleTokenId.Addition, "Addition")
   2:     { ErrorAlias = "'+'" };
   3: var @assignment = new Terminal(SimpleTokenId.Assignment, "Assignment")
   4:     { ErrorAlias = "'='" };
   5: var @closeCurlyBrace = new Terminal(SimpleTokenId.CloseCurlyBrace, "CloseCurlyBrace")
   6:     { ErrorAlias = "'}'" };
   7: var @closeParenthesis = new Terminal(SimpleTokenId.CloseParenthesis, "CloseParenthesis")
   8:     { ErrorAlias = "')'" };
   9: var @comma = new Terminal(SimpleTokenId.Comma, "Comma")
  10:     { ErrorAlias = "','" };
  11: var @division = new Terminal(SimpleTokenId.Division, "Division")
  12:     { ErrorAlias = "'/'" };
  13: var @equality = new Terminal(SimpleTokenId.Equality, "Equality")
  14:     { ErrorAlias = "'=='" };
  15: var @function = new Terminal(SimpleTokenId.Function, "Function")
  16:     { ErrorAlias = "'function'" };
  17: var @identifier = new Terminal(SimpleTokenId.Identifier, "Identifier");
  18: var @inequality = new Terminal(SimpleTokenId.Inequality, "Inequality")
  19:     { ErrorAlias = "'!='" };
  20: var @multiplication = new Terminal(SimpleTokenId.Multiplication, "Multiplication")
  21:     { ErrorAlias = "'*'" };
  22: var @number = new Terminal(SimpleTokenId.Number, "Number");
  23: var @openCurlyBrace = new Terminal(SimpleTokenId.OpenCurlyBrace, "OpenCurlyBrace")
  24:     { ErrorAlias = "'{'" };
  25: var @openParenthesis = new Terminal(SimpleTokenId.OpenParenthesis, "OpenParenthesis")
  26:     { ErrorAlias = "'('" };
  27: var @return = new Terminal(SimpleTokenId.Return, "Return")
  28:     { ErrorAlias = "'return'" };
  29: var @semiColon = new Terminal(SimpleTokenId.SemiColon, "SemiColon")
  30:     { ErrorAlias = "';'" };
  31: var @subtraction = new Terminal(SimpleTokenId.Subtraction, "Subtraction")
  32:     { ErrorAlias = "'-'" };
  33: var @var = new Terminal(SimpleTokenId.Var, "Var")
  34:     { ErrorAlias = "'var'" };

As mentioned in the previous post, we like to prefix our terminal variables with a @ character so they can be distinguished from non-terminals in their productions.

The SimpleTokenId class was generated by our Language Designer application and provides const fields for each token ID produced by our Simple language’s lexer.  The second parameter to the Terminal constructor is the terminal key, which is used to identify it. 

When errors occur, the key is used in the error message unless an error alias is used.  If an error alias is provided, it is used instead.  For instance telling the user ‘==’ expected is a much nicer error message than Equality expected.

Create non-terminals

Now we’ll declare NonTerminal variables for each non-terminal in the grammar.

   1: //
   2: // Create non-terminals
   3: //
   4:  
   5: this.Root = new NonTerminal("CompilationUnit");
   6: var functionDeclaration = new NonTerminal("FunctionDeclaration");
   7: var functionParameterList = new NonTerminal("FunctionParameterList");
   8: var statement = new NonTerminal("Statement");
   9: var block = new NonTerminal("Block");
  10: var emptyStatement = new NonTerminal("EmptyStatement");
  11: var variableDeclarationStatement = new NonTerminal("VariableDeclarationStatement");
  12: var assignmentStatement = new NonTerminal("AssignmentStatement");
  13: var returnStatement = new NonTerminal("ReturnStatement");
  14: var expression = new NonTerminal("Expression");
  15: var equalityExpression = new NonTerminal("EqualityExpression");
  16: var additiveExpression = new NonTerminal("AdditiveExpression");
  17: var multiplicativeExpression = new NonTerminal("MultiplicativeExpression");
  18: var primaryExpression = new NonTerminal("PrimaryExpression");
  19: var numberExpression = new NonTerminal("NumberExpression");
  20: var simpleName = new NonTerminal("SimpleName");
  21: var functionAccessExpression = new NonTerminal("FunctionAccessExpression");
  22: var functionArgumentList = new NonTerminal("FunctionArgumentList");
  23: var parenthesizedExpression = new NonTerminal("ParenthesizedExpression");

Note how the grammar’s Root property is set to the root non-terminal.

Create non-terminal productions

Next is the more interesting part.  We’ll configure the production rules for each non-terminal.  As explained in the previous post, our framework makes heavy use of operator overloads, methods, indexers, and implicit casts so that production rules can be created with minimal C#/VB code.

Here’s the productions:

   1: this.Root.Production = functionDeclaration.ZeroOrMore();
   2:  
   3: functionDeclaration.Production = @function + @identifier + @openParenthesis + 
   4:     functionParameterList.Optional() + @closeParenthesis + block;
   5:  
   6: functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();
   7:  
   8: statement.Production = block
   9:     | emptyStatement
  10:     | variableDeclarationStatement
  11:     | assignmentStatement
  12:     | returnStatement;
  13:  
  14: block.Production = @openCurlyBrace + statement.ZeroOrMore() + @closeCurlyBrace;
  15:         
  16: emptyStatement.Production = semiColon.ToTerm().ToProduction();
  17:  
  18: variableDeclarationStatement.Production = @var + @identifier + @semiColon;
  19:  
  20: assignmentStatement.Production = @identifier + @assignment + expression + @semiColon;
  21:  
  22: returnStatement.Production = @return + expression + @semiColon;
  23:  
  24: expression.Production = equalityExpression.ToTerm().ToProduction();
  25:  
  26: equalityExpression.Production = additiveExpression + 
  27:     ((@equality | @inequality) + equalityExpression).Optional();
  28:  
  29: additiveExpression.Production = multiplicativeExpression + 
  30:     ((@addition | @subtraction) + additiveExpression).Optional();
  31:  
  32: multiplicativeExpression.Production = primaryExpression + 
  33:     ((@multiplication | @division) + multiplicativeExpression).Optional();
  34:  
  35: primaryExpression.Production = numberExpression
  36:     | simpleName
  37:     | functionAccessExpression
  38:     | parenthesizedExpression;
  39:  
  40: numberExpression.Production = @number.ToTerm().ToProduction();
  41:  
  42: simpleName.Production = @identifier.ToTerm().ToProduction();
  43:  
  44: functionAccessExpression.Production = @identifier + @openParenthesis + 
  45:     functionArgumentList.Optional() + @closeParenthesis;
  46:  
  47: functionArgumentList.Production = expression + (@comma + expression).ZeroOrMore();
  48:  
  49: parenthesizedExpression.Production = @openParenthesis + expression + @closeParenthesis;

Not so hard, is it?  Let’s review some details about what we just did during our translation from the EBNF language specification to these production rules.

Elimination of left recursion

The FunctionParameterList, EqualityExpression, AdditiveExpression, MultiplicativeExpression, and FunctionArgumentList all have left recursion in the EBNF specification.  Since our grammar is LL(*)-based, that is not permitted.

It’s easy to rearrange our rules to handle this though.  First look at how we changed FunctionParameterList.  Its EBNF specification says:

   1: FunctionParameterList:
   2:     "identifier"
   3:     FunctionParameterList "," "identifier"

We wrote that in our grammar as:

   1: functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();

Next let’s look at EqualityExpression:

   1: EqualityExpression:
   2:     AdditiveExpression
   3:     AdditiveExpression "==" EqualityExpression
   4:     AdditiveExpression "!=" EqualityExpression

We wrote that in our grammar as:

   1: equalityExpression.Production = additiveExpression + 
   2:     ((@equality | @inequality) + equalityExpression).Optional();

The remaining left recursion instances were handled in similar fashion.

Elimination of ambiguity

Now let’s try running our grammar within SyntaxEditor.  When we do so, we get an exception:

Multiple productions within a NonTerminal 'PrimaryExpression' alternation start with the terminal 'Identifier' either directly or indirectly.  The second production that contains a reference to the terminal is: FunctionAccessDeclaration

PrimaryExpression’s EBNF specification says:

   1: PrimaryExpression:
   2:     NumberExpression
   3:     SimpleName
   4:     FunctionAccessExpression
   5:     ParenthesizedExpression

The productions of both SimpleName and FunctionAccessExpression start with an Identifier terminal.  That means they are ambiguous and the grammar framework detected it for us.

This is where we need to add a can-match callback for one of the non-terminals.  In this scenario SimpleName only matches an Identifier while FunctionAccessExpression matches an Identifier followed by an OpenParenthesis.  Therefore we’ll put our can-match callback on the FunctionAccessExpression

First though, let’s move the FunctionAccessExpression in the alternation before the SimpleName.  This ensures that its can-match callback will occur before the ambiguity.

   1: primaryExpression.Production = numberExpression
   2:     | functionAccessExpression  // This was moved up
   3:     | simpleName
   4:     | parenthesizedExpression;

Configure non-terminal can-match callbacks

Let’s make the can-match callback for the FunctionAccessExpression:

   1: /// <summary>
   2: /// Returns whether the <c>FunctionAccessExpression</c> non-terminal can match.
   3: /// </summary>
   4: /// <param name="state">A <see cref="IParserState"/> that provides information
   5: /// about the parser's current state.</param>
   6: /// <returns>
   7: /// <c>true</c> if the <see cref="NonTerminal"/> can match with the current state; 
   8: /// otherwise, <c>false</c>.
   9: /// </returns>
  10: private bool CanMatchFunctionAccessExpression(IParserState state) {
  11:     return (state.TokenReader.LookAheadToken.Id == SimpleTokenId.Identifier) && 
  12:         (state.TokenReader.GetLookAheadToken(2).Id == SimpleTokenId.OpenParenthesis);
  13: }

Can-match callbacks are passed in an IParserState.  The state gives access to everything from the look-ahead tokens to AST match data and even custom data.  The callback returns true if the non-terminal can match with the current state.

In our situation we want to indicate FunctionAccessExpression can match if the next two tokens are Identifier and OpenParenthesis.  The above code returns whether that condition is met.

We then wire up the can-match callback to the non-terminal like this:

   1: functionAccessExpression.CanMatchCallback = CanMatchFunctionAccessExpression;

Now we’re done with the grammar for this post’s sample.

Looking at the grammar’s output

The grammar framework will automatically generate an AST for you, just based on what we’ve written so far.  It will contain AST nodes for each terminal, non-terminal, and quantifier.

But let’s first see what output we get as-is.  The input we’ll type in our SyntaxEditor is:

   1: function IncrementAndMultiply(x, y) {
   2:     var result;
   3:     result = Increment(x);
   4:     return result * y;
   5: }

Here is the text representation of the AST that was generated:

   1: CompilationUnit[
   2:     [
   3:         FunctionDeclaration[
   4:             "function"
   5:             "IncrementAndMultiply"
   6:             "("
   7:             FunctionParameterList[
   8:                 "x"
   9:                 [
  10:                     ","
  11:                     "y"
  12:                 ]
  13:             ]
  14:             ")"
  15:             Block[
  16:                 "{"
  17:                 [
  18:                     Statement[
  19:                         VariableDeclarationStatement[
  20:                             "var"
  21:                             "result"
  22:                             ";"
  23:                         ]
  24:                     ]
  25:                     Statement[
  26:                         AssignmentStatement[
  27:                             "result"
  28:                             "="
  29:                             Expression[
  30:                                 EqualityExpression[
  31:                                     AdditiveExpression[
  32:                                         MultiplicativeExpression[
  33:                                             PrimaryExpression[
  34:                                                 FunctionAccessExpression[
  35:                                                     "Increment"
  36:                                                     "("
  37:                                                     FunctionArgumentList[
  38:                                                         Expression[
  39:                                                             EqualityExpression[
  40:                                                                 AdditiveExpression[
  41:                                                                     MultiplicativeExpression[
  42:                                                                         PrimaryExpression[
  43:                                                                             SimpleName[
  44:                                                                                 "x"
  45:                                                                             ]
  46:                                                                         ]
  47:                                                                         []
  48:                                                                     ]
  49:                                                                     []
  50:                                                                 ]
  51:                                                                 []
  52:                                                             ]
  53:                                                         ]
  54:                                                         []
  55:                                                     ]
  56:                                                     ")"
  57:                                                 ]
  58:                                             ]
  59:                                             []
  60:                                         ]
  61:                                         []
  62:                                     ]
  63:                                     []
  64:                                 ]
  65:                             ]
  66:                             ";"
  67:                         ]
  68:                     ]
  69:                     Statement[
  70:                         ReturnStatement[
  71:                             "return"
  72:                             Expression[
  73:                                 EqualityExpression[
  74:                                     AdditiveExpression[
  75:                                         MultiplicativeExpression[
  76:                                             PrimaryExpression[
  77:                                                 SimpleName[
  78:                                                     "result"
  79:                                                 ]
  80:                                             ]
  81:                                             [
  82:                                                 "*"
  83:                                             ]
  84:                                             MultiplicativeExpression[
  85:                                                 PrimaryExpression[
  86:                                                     SimpleName[
  87:                                                         "y"
  88:                                                     ]
  89:                                                 ]
  90:                                                 []
  91:                                             ]
  92:                                         ]
  93:                                         []
  94:                                     ]
  95:                                     []
  96:                                 ]
  97:                             ]
  98:                             ";"
  99:                         ]
 100:                     ]
 101:                 ]
 102:                 "}"
 103:             ]
 104:         ]
 105:     ]
 106: ]
Obviously this is much more data than is useful and in the next post we’ll show grammar framework features for creating a concise AST that only contains the useful data and in a meaningful structure.

Comments (1) -

August 9, 2010 at 16:19  

Ken Brown United States

I really appreciate you guys taking the time to release this information early. It really make me feel like I will be able to make the most of the grammar functionality when its released. This detailed documentation is a great example of Actipro’s commitment to providing excellent product support and detailed documentation.

Thanks Bill,
Ken Brown –CSMicro Systems

Pingbacks and trackbacks (2)+

Comments are closed