In the previous post, we gave an overview about what we’ll be covering in this multi-part walkthrough of the next generation grammar/AST framework we’re developing for SyntaxEditor, our code editor control.
Today’s post will dive right into a grammar and the building blocks needed to create one.
Grammar
A grammar is a set of production rules for parsing text in a formal language, where there is a single starting symbol from which parsing should begin. All code languages should have some sort of EBNF grammar specification available for reference on the Internet, so that compilers and other products know how to read the code and interpret it.
Symbols
A grammar is comprised of symbols, which are either terminals or non-terminals.
Terminals are the lowest-level fixed-input lexical element used in a grammar production rule. In the case of SyntaxEditor’s framework, terminals equate to tokens that are read in from a lexer.
NonTerminals each have a production rule, that is comprised of some combination of terminal and non-terminal symbol references.
EBNF sample
Take this EBNF for a Simple language FunctionDeclaration non-terminal:
1: FunctionDeclaration:
2: "function" "identifier" "(" FunctionParameterList? ")" Block
That is saying the non-terminal’s name is FunctionDeclaration and a function declaration is built by a concatenation of a function terminal, followed by an identifier terminal, followed by an open parenthesis terminal, followed by an optional FunctionParameterList non-terminal, followed by a close parenthesis terminal, and ending with a Block non-terminal.
EBNF terms
In the Actipro grammar framework, we have the concept of EBNF terms, all of which inherit an EbnfTerm base class. There are multiple classes use to represent various types of EBNF terms, each of which can be used within a production rule for a non-terminal:
- EbnfProduction
- EbnfTerminal
- EbnfNonTerminal
- EbnfQuantifier
- EbnfConcatenation
- EbnfAlternation
EbnfProduction
An EbnfProduction represents a production rule. Each NonTerminal has exactly one EbnfProduction associated with it that defines the EBNF for that non-terminal.
EbnfProduction objects have one child EbnfTerm that provides the EBNF pattern to match for the production rule. Productions can also optionally be assigned custom tree constructors, but we’ll get into that more in several posts from now.
EbnfTerminal
An EbnfTerminal represents the usage of a pre-defined Terminal instance within a production rule.
For instance, a Terminal for an Identifier is likely to be used many places in a grammar. The Terminal for an Identifier is defined once at the top of the grammar declaration and each usage of it within a production is made via separate EbnfTerminal instances that reference the Terminal.
EbnfNonTerminal
An EbnfNonTerminal is the same as an EbnfTerminal except that instead of referencing a Terminal, it is referencing a NonTerminal.
EbnfQuantifier
An EbnfQuantifier wraps a child EbnfTerm and allows you to specify the minimum and maximum number of matches that can be made of the child term.
There are pre-defined helpers for standard Kleene operators such as optional (?), zero-or-more (*), and one-or-more (+).
EbnfConcatenation
An EbnfConcatenation is a sequence of other EbnfTerms that are matched at the same scope level when performing tree construction.
EbnfAlternation
An EbnfAlternation is a way for one of multiple supplied EbnfTerm options to match. In standard EBNF notation, alternations are typically represented with a vertical bar (|).
First look at building a non-terminal
Now that we’ve covered the basics, let’s dig in and see how it works. We’ll write some grammar creation code that builds the FunctionDeclaration non-terminal example above.
Our grammar framework has been designed so that you code the grammar right in C#/VB and no code generation is necessary via an external tool. A grammar must inherit our Grammar class.
A grammar is generally initialized in its constructor with four steps:
- Create terminals
- Create non-terminals
- Configure non-terminal can-match callbacks
- Configure non-terminal productions
Create terminals
The first step is where we simply declare Terminal variables, one for each token that will be supplied to us by a lexer and used in our non-terminal productions.
For the above example EBNF, we have four terminals so let’s declare them:
1: var @closeParenthesis = new Terminal(SimpleTokenId.CloseParenthesis, "CloseParenthesis")
2: { ErrorAlias = "')'" };
3: var @function = new Terminal(SimpleTokenId.Function, "Function")
4: { ErrorAlias = "'function'" };
5: var @identifier = new Terminal(SimpleTokenId.Identifier, "Identifier");
6: var @openParenthesis = new Terminal(SimpleTokenId.OpenParenthesis, "OpenParenthesis")
7: { ErrorAlias = "'('" };
You’ll notice that each variable has a @ in front of it. In C#, the @ character is simply an escape character in case the identifier is a keyword. In our case, we’re going to use it just so later in our non-terminal productions, we can easily tell which variables are terminals and which are non-terminals.
The Terminal constructor takes a token ID, and the string key that identifies the terminal. Note that the SimpleTokenId class contains const token ID values and was generated for us by the Actipro Language Designer application (a new feature for the next version). Token IDs are how the grammar parser matches terminals with tokens that are being read. The tokens come from ILexer instances that you already have defined if you use SyntaxEditor.
Some of the Terminals are also assigned an ErrorAlias. When parsing, if a terminal is expected, but the current token doesn’t match, a parse error will be reported that will use the error alias if it is specified. If not specified, it will fall back to using the terminal’s key.
Now that our four terminals are defined, let’s create our non-terminals.
Create non-terminals
We now declare NonTerminal variables, one for each non-terminal in our grammar.
For our blog post, we will just declare one non-terminal variable even though a real grammar would have a lot more and we are even referencing other non-terminals for FunctionParameterList and Block in our production below.
1: var functionDeclaration = new NonTerminal("FunctionDeclaration");
Configure non-terminal can-match callbacks
Some grammars have more than one non-terminal that start with the same terminal and since our grammar is LL(*)-based, it is not allowed to have two or more non-terminals starting with the same terminal in an alternation together. If we did allow that, then the parser would never have the possibility of matching anything but the first non-terminal in the alternation that started with that terminal. This is called ambiguity and we automatically detect it for you and warn you about it when it is found.
We allow you to easily resolve ambiguity by setting a can-match callback on non-terminals that require it. In that case, we execute your custom callback code to determine if a non-terminal can match rather than the normal way that is ambiguous.
This is a confusing concept but the next post will show an example of why and when to do this.
Configure non-terminal productions
Finally we come to the meat of the grammar. All terminals and non-terminal variables have been declared and we’re ready to get into some EBNF-like notation for our non-terminal production rules that consists of various EbnfTerm instances.
This is where the real magic of our grammar framework happens. We’ve made tons of operator overloads, methods, indexers, and implicit casts so that you can define production rules with a minimal amount of C#/VB code.
Let’s define our example non-terminal production rule now:
1: functionDeclaration.Production = @function + @identifier + @openParenthesis +
2: functionParameterList.Optional() + @closeParenthesis + block;
It is our convention to have @ characters at the start of our terminal variables so they stand out differently than non-terminal variables. The + operator means concatenation. Even though we are referencing Terminal and NonTerminal-based variables here, operator overloading is actually converting each Terminal reference to an EbnfTerminal and each NonTerminal reference to an EbnfNonTerminal behind the scenes. Finally note the Optional method call which wraps the EbnfNonTerminal for FunctionParameterList with an EbnfQuantifier that has a minimum of 0 and a maximum of 1.
Putting it all together
Wasn’t that easy? Of course a real grammar would have a lot more terminals, non-terminals, and production rules but you just follow the same process for them too.
Before we end this post, let’s assume we have the following input to the parser for this grammar:
The default tree constructors will build a full AST for you without you needing to do anything. Of course it includes much more information than is practical, which is why we’ll show in a future post how to optimize tree construction. Anyhow, here is a text representation of the AST node output for that function using the default tree constructor:
1: FunctionDeclaration[
2: "function"
3: "Add"
4: "("
5: []
6: ")"
7: Block[
8: "{"
9: []
10: "}"
11: ]
12: ]
Each AST node in the output automatically stores the proper
start/end offsets for you too making it possible to query which node contains an offset later on.
In our next post in this series we’ll show a complete grammar example based on what we’ve learned so far. Then in posts after that, we’ll show how to customize tree construction and how to process errors.