We get asked all the time when our popular .NET Languages Add-on and Web Languages Add-on for the WinForms SyntaxEditor will be ported over to WPF and Silverlight. The missing feature area that we need before we can do that is our own grammar/AST framework.
What is a grammar/AST framework?
The WPF and Silverlight SyntaxEditor controls already have an IParser interface in place along with the ability to automatically call a language’s registered parser in a worker thread after any text change. Since we use interfaces for our design, any first- or third-party parser can be easily used with our framework. The IParser’s purpose is to parse through document text and produce some parse data results, which could be an AST (abstract syntax tree), syntax error list, etc.
A grammar framework is the ability to define a parser, generally via some form of EBNF. In some grammar implementations the parser is code-generated and in other implementations a parser is more like a regex engine and interprets input based on dynamic grammar data.
The most common output from a parser is an AST, which is a hierarchy of syntax tree nodes that describe the contents of the document.
Use third-party parsers?
We could make our language add-ons rely on third-party parser products like ANTLR (we already have a free add-on that enables integration with ANTLR parsers), however many of our customers require that code they purchase from us is not reliant on third-party functionality. Also when we have our own custom grammar/AST framework, we can add more customized editor-related functionality than what is possible with a third-party parser framework.
SyntaxEditor for WinForms’ grammar/AST model
The WinForms SyntaxEditor’s grammar/AST model is what is used by the language add-ons there. It is a recursive descent parser with infinite look-ahead capabilities. While it works great and has many features, there are several drawbacks that we are looking to improve upon with our newer design that will be prototyped out in the WPF/Silverlight SyntaxEditors:
- The grammar uses a combination of XML, EBNF, and output language code blocks, making it somewhat spaghetti-like and tricky to read.
- AST nodes output by the parser are concrete classes that can be generated by the grammar, but still they take a bit of work to get configured.
- A lot of output language code block code is needed to ensure AST nodes are constructed and have the correct offset ranges.
- Since the parsers are code-generated, there is no way for our language add-on customers to modify how the language add-on parsers behave without purchasing the add-on source code and regenerating the parsers with changes made to the grammar files.
What we’re prototyping out
We’ve had a good look at a lot of grammars and parser generators over the years and we’ve been considering ideas on how to improve our own grammar/AST framework moving forward. We’ve already started prototyping out a number of ideas to see if they pan out well.
Our plan is to stick with a recursive descent parser with infinite look-ahead capabilities like in WinForms. Here’s some preliminary ideas of how we’re trying to improve upon the WinForms’ grammar/AST framework and address each of the drawbacks above.
Grammar is completely written in your output language (C#/VB)
One thing we want to achieve if possible is allowing you to write your grammar using EBNF-like notation but in your actual language you write code in. We can actually achieve a pretty nice, compact syntax that is EBNF-like and uses language features like implicit type casting, operator overloading, etc. Some of this is already working in our prototype.
Here’s a grammar snippet written in C# that parses an XML start tag:
1: startTag.Production = tStartTagStartDelimiter + tStartTagName["tagName"] +
2: (
3: tStartTagAttribute["attrName"] +
4: tStartTagAttributeStringValueStartDelimiter +
5: tStartTagAttributeStringValueText.Optional() +
6: tStartTagAttributeStringValueEndDelimiter
7: > AstFrom("attrName")
8: ).ZeroOrMore().SetLabel("attrs") +
9: tStartTagEndDelimiter
10: > Ast("StartTag", AstFrom("tagName"), Ast("Attributes", AstValuesFrom("attrs")));
Any variable that has a t at the start of its name is a Terminal, and will match a token. Terminal and NonTerminal objects used in the grammar have already been defined in code above this snippet. Terminals and NonTerminals implicitly type convert to EbnfTerm-based objects when used in a production rule like above. Any EbnfTerm has methods on it such as ZeroOrMore, OneOrMore, and Optional that let you implement Kleene operators. The + operator is used for concatination of EbnfTerms and the | operator is used to achieve alternation. Thus we have achieved EBNF-like syntax but in C# and VB.
AST nodes will be completely generic
In this prototype design, we’ve kept AST nodes truly generic. Basically they have a Key value and a list of child nodes, that’s it. There will likely end up being a couple more properties on them (such as text range) but our goal is to keep the AST nodes as simple as possible.
We have some ideas to allow concrete classes to be built that will “wrap” the generic AST nodes and provide a more consumable object model that is specific to a language.
AST node construction is easy
You’ll notice at the end of the code snippet above we have this:
1: > Ast("StartTag", AstFrom("tagName"), Ast("Attributes", AstValuesFrom("attrs")));
That portion of the production rule is saying how to perform AST node construction. If you don’t specify a construction rule then an AST node hierarchy will automatically be built for you based on what is matched in the non-terminal. However in most cases, the default hierarchy that is created contains more information than what you really need or care about. Therefore most productions should have customized tree construction rules associated with them.
In this snippet’s scenario, we are saying we want this start tag non-terminal to build an AST node with value StartTag and whose first child node comes from the match made by the EbnfTerm that was given a label tagName. Since in the production rule a Terminal was used to match the tag name, the terminal’s value (the actual tag name) will be the value of the AST node that is created under the StartTag node. Next we say that we wish to have an Attributes AST node as the second child and its children will be all the results pulled from the EbnfTerm that was labeled attrs.
Given this input:
1: <Block Attr="Value" Attr2="Value2">
We get this string representation of AST node output:
1: StartTag[
2: "Block"
3: Attributes[
4: "Attr"
5: "Attr2"
6: ]
7: ]
This sort of AST node construction makes it super easy to create meaningful output from your parser in a minimal amount of time.
Code injection anywhere
You may be thinking: Hey this works great with one token look-ahead. But what about more complex scenarios where we need to peek ahead multiple tokens to determine how to process something?
What we’ve done is added code injection points that can be inserted on any EbnfTerm. For instance you can inject a delegate that executes when any EbnfTerm is about to be matched, or when it succeeds or fails. Since they are delegates, you can write them inline if you wish and can even reuse them multiple places in the grammar.
You can use code injection to tweak the AST node output too.
How you can help
As you can see we have a lot of ideas for the prototype we’re building. We want to hear your ideas too. What do you like and not like about these ideas? Do you have suggestions for other features you’d like to see?
Please post or email your comments to us. Now is the time to get them in, while we’re still in a design phase.