In the previous post, we optimized the tree construction output of our Simple language to be very concise. The next step in building a grammar is to make sure that it properly handles errors. After all, since this grammar framework is intended to be used with SyntaxEditor, our code editor control, we have to assume that most of the time the document’s code passed to our grammar parser will be in an invalid state. The user is continuously typing and modifying it.
In today’s post we will look at the various callbacks that are available to you, probably the most important of which are error handling callbacks. We’ll also dig into error handling options.
What is a callback?
As we’ve seen in the previous posts in this series, our entire grammar is built directly in C# or VB code. We do not do code generation like a lot of other parser generators do. A benefit of this is that you can interact directly with objects in the grammar.
One way to interact with objects is to assign callbacks to them. All EbnfTerm-based objects support four callbacks:
- Initialize
- Success
- Error
- Complete
And as shown in earlier posts, NonTerminal objects can be assigned a can-match callback.
Callbacks are simply delegates that get called when certain events occur. You can point them to methods you declare or can inject lambda expressions as well.
Let’s look at each of the five callbacks.
EbnfTerm callbacks
Initialize and Complete callbacks
The Initialize callback is called right before an EbnfTerm is about to be parsed. The Complete callback is called right after an EbnfTerm has been parsed. Thus they are always paired.
It is important to note that Complete is called regardless of whether the term’s parsing succeeded or failed.
Success and Error callbacks
The Success callback is called right after an EbnfTerm is parsed successfully. Alternatively, if the EbnfTerm was not parsed successfully, the Error callback is called. Thus each term that is attempted to be parsed will either have its Success or Error callbacks fired.
The Success and Error callbacks occur immediately before the Complete callback does.
Summary of EbnfTerm callbacks
A term that is successfully parsed will offer this sequence of callbacks:
- Initialize
- (parsing attempt here)
- Success
- Complete
A term that is not successfully parsed will offer this sequence of callbacks:
- Initialize
- (parsing attempt here)
- Error
- Complete
Definitions
The Initialize, Success, and Complete callbacks have this definition:
1: public delegate void ParserCallback(IParserState state);
It is passed an IParserState that gives you access to look-ahead tokens, custom data, and the matches that have been made at the current scope level. You can update custom data, or even modify the matches collection if you wish in any of these callbacks.
Custom data can be anything you wish. Perhaps as you traverse through certain non-terminals, you want to maintain a stack of which ones you’ve visited. Your custom data could contain such a stack. In the Initialize callback for the non-terminals you wish to track, you could push an item on the stack. In the Complete callback for the non-terminals you wish to track, you could pop an item off the stack.
The Error callback has this definition:
1: public delegate IParserErrorResult ParserErrorCallback(IParserState state);
The Error callback also gets an IParserState passed to it. However it differs from the others in that it expects an IParseErrorResult object returned. Since the Error callback is called when an error occurs, this result tells the parser how to proceed. There are options for preventing any errors from being reported and options for whether to continue on as if no error occurred.
The standard set of options are provided in the ParserErrorResults object via static properties:
- Default – Potentially report errors and return a match failure.
- Continue – Potentially report errors but continue on.
- Ignore – Never report errors and continue on.
- NoReport – Never report errors and return a match failure.
Sample callback
Callbacks can be assigned with the OnInitialize, OnSuccess, OnError, and OnComplete methods.
This root production shows how an Error callback can be assigned by calling the OnError method and passing it the delegate to use. In this case the method that will be called is AdvanceToDefaultState.
1: this.Root.Production = functionDeclaration.OnError(AdvanceToDefaultState)
2: .ZeroOrMore().SetLabel("decl")
3: > Ast("CompilationUnit", AstChildrenFrom("decl"));
What happens is that if an error occurs while parsing FunctionDeclaration, the AdvanceToDefaultState method is called, which does this:
1: /// <summary>
2: /// Advances the token reader to the next 'function' token from where parsing
3: /// can resume.
4: /// </summary>
5: /// <param name="state">A <see cref="IParserState"/> that provides information
6: /// about the parser's current state.</param>
7: /// <returns>An <see cref="IParserErrorResult"/> value indicating a result.</returns>
8: private IParserErrorResult AdvanceToDefaultState(IParserState state) {
9: state.TokenReader.AdvanceTo(SimpleTokenId.Function);
10: return ParserErrorResults.Continue;
11: }
You can see how it tells the token reader to advance to the next Function token. We have skipped over any potentially “bad” tokens and have gone right to the next token that we know will successfully start a FunctionDeclaration.
The callback returns ParserErrorResults.Continue, which means potentially report an error, but continue on instead of breaking out of the ZeroOrMore quantifier that contains the FunctionDeclaration non-terminal.
Built-in error callbacks
There are also some built-in Error callbacks that you can assign. They don’t do anything other than return the various related ParserErrorResults values:
- OnErrorContinue – Returns ParserErrorResults.Continue.
- OnErrorIgnore – Returns ParserErrorResults.Ignore.
- OnErrorNoReport – Returns ParserErrorResults.NoReport.
This example shows the use of OnErrorContinue, where we will report an error if the semi-colon isn’t matched, but we’ll continue on with parsing as if it was there:
1: variableDeclarationStatement.Production = @var + @identifier["name"] +
2: @semiColon.OnErrorContinue()
3: > Ast("VariableDeclarationStatement", AstFrom("name"));
Advanced error handling
Sometimes errors will occur where a non-terminal is referenced however that non-terminal is capable of starting with multiple different terminals. In that case, the parser doesn’t report an error by default since it doesn’t know what it should say. Here’s a perfect example:
1: returnStatement.Production = @return + expression["exp"].OnErrorContinue() +
2: @semiColon.OnErrorContinue()
3: > Ast("ReturnStatement", AstFrom("exp"));
Say the input for this production was return return. Obviously that is invalid as the second return keyword doesn’t fit into an expression. Since Expression can start with numerous terminals, an error occurs but no parse error is reported into the parse errors collection since the parser doesn’t know what to tell the UI.
We have two options for handling this scenario.
Option 1 – Use an error alias
When a NonTerminal is assigned an ErrorAlias, it will report an error by default if it fails to match. We only want to set error aliases on higher-level non-terminals such as Expression or Statement non-terminals. We can do so like this:
1: var expression = new NonTerminal("Expression") { ErrorAlias = "Expression" };
That will tell the parser to automatically report an Expression expected parse error if Expression fails to match. This is the easiest way to handle this scenario.
Option 2 – Custom error callback
The second option that can be used if we need to customize the error message more is to use an error callback:
1: returnStatement.Production = @return + expression["exp"].OnError(ExpressionExpected) +
2: @semiColon.OnErrorContinue()
3: > Ast("ReturnStatement", AstFrom("exp"));
The error callback can be implemented like:
1: /// <summary>
2: /// Occurs when an expression is expected but not found.
3: /// </summary>
4: /// <param name="state">A <see cref="IParserState"/> that provides information
5: /// about the parser's current state.</param>
6: /// <returns>An <see cref="IParserErrorResult"/> value indicating a result.</returns>
7: private IParserErrorResult ExpressionExpected(IParserState state) {
8: // Report a custom error, and return a value telling the parser to not report
9: // errors and continue on
10: state.ReportError(ParseErrorLevel.Error, "Expression should have been here.");
11: return ParserErrorResults.Ignore;
12: }
Note that here we report an error Expression should have been here instead of the Expression expected message that comes from option #1. We also return ParserErrorResults.Ignore to ensure that no other error message is reported, and tell the parser to continue on.
Error reporting notes
We’ve now seen how both terminals and non-terminals are capable of reporting parse errors that can be displayed in the user interface. In some scenarios, multiple errors may be reported for a given text offset. Allowing this can really confuse the end user. The grammar framework has built in functionality such that it will only report the first parse error for a given offset, since that is the most important one.
The parse error collection returned in the parse data result back to the document will also be sorted by each error’s location in the document.
NonTerminal can-match callbacks
Can-match callbacks can optionally be assigned to any NonTerminal. Since our grammar is LL(*)-based, each NonTerminal maintains a set of terminals that it knows are able to start it. This is called the “first set”. For instance a Simple language FunctionDeclaration production always starts with a function terminal. Thus the FunctionDeclaration’s first set consists of a single function terminal.
Sometimes you may have an alternation EBNF term with two or more non-terminal references that have intersecting first sets. We see this in the Simple language where both the SimpleName and FunctionAccessExpression non-terminal productions start with Identifier terminals, and the PrimaryExpression non-terminal production is an alternation that contains both of them. This situation is called ambiguity and the grammar will warn you when it detects the scenario so that you can fix it.
When a can-match callback is specified, it effectively overrides the “first set” of the non-terminal. Thus in the Simple language where the ambiguity occurred, the ambiguity is resolved by applying a can-match callback to one of the ambiguous non-terminals.
Definition
A can-match callback has this definition:
1: public delegate bool ParserCanMatchCallback(IParserState state);
It is passed an IParserState and the result is a boolean value indicating whether the non-terminal can match with the current state. Logic in the callback is generally implemented such that it examines look-ahead tokens to see what the next several tokens are. Since you are able to look ahead all the way to the end of the document if you wish, that is the reason our grammar is LL(*). The * means infinite look-ahead.
Sample callback
The Simple language grammar’s FunctionAccessExpression has a can-match callback. This code can be used to assign the callback:
1: functionAccessExpression.CanMatchCallback = CanMatchFunctionAccessExpression;
And here is the callback implementation:
1: /// <summary>
2: /// Returns whether the <c>FunctionAccessExpression</c> non-terminal can match.
3: /// </summary>
4: /// <param name="state">A <see cref="IParserState"/> that provides information
5: /// about the parser's current state.</param>
6: /// <returns>
7: /// <c>true</c> if the <see cref="NonTerminal"/> can match with the current state;
8: /// otherwise, <c>false</c>.
9: /// </returns>
10: private bool CanMatchFunctionAccessExpression(IParserState state) {
11: return (state.TokenReader.LookAheadToken.Id == SimpleTokenId.Identifier) &&
12: (state.TokenReader.GetLookAheadToken(2).Id == SimpleTokenId.OpenParenthesis);
13: }
CanAlwaysMatch callback
There is a built-in method called Grammar.CanAlwaysMatch which is a can-match callback that always returns true. This callback is useful as described in the next section.
Proper design of iterative productions
Very often a root compilation unit has some other set of non-terminals that repeat within it. In this case we set up a non-terminal EBNF term with an error handler and place it in a ZeroOrMore quantifier like this:
1: this.Root.Production = functionDeclaration.OnError(AdvanceToDefaultState)
2: .ZeroOrMore().SetLabel("decl")
3: > Ast("CompilationUnit", AstChildrenFrom("decl"));
What happens here is that if an error occurs in FunctionDeclaration it will advance to the next function token (per above) and will continue on with the next FunctionDeclaration match.
But what happens if at the start of the document we have an invalid token instead, such as an Identifier? FunctionDeclaration doesn’t start with an Identifier terminal. It only starts with a Function terminal. Thus the entire ZeroOrMore quantifier will never be entered and your CompilationUnit AST node output will be empty, even if there are a lot of valid function declarations after that initial Identifier.
We can easily handle this scenario by using the CanAlwaysMatch callback on FunctionDeclaration:
1: // Make sure FunctionDeclaration will always be examined,
2: // even if the next token is not 'function'
3: functionDeclaration.CanMatchCallback = CanAlwaysMatch;
Thus we have forced the “first set” of FunctionDeclaration to be overridden and even tokens like Identifier will cause us to enter FunctionDeclaration. In that scenario, Identifier won’t match with the Function terminal and an error will be reported that indicates ‘function’ expected. This scenario is now properly handled.
Advanced implementations
What about languages such as C# where you could have a using statement, namespace, or type declaration at the root compilation unit level? We’ll apply the same concepts.
Make a new non-terminal called CompilationUnitContent that has an alternation between those other non-terminals. Make the root production call CompilationUnitContent in the same way FunctionDeclaration was called above, with an error handler. Then likewise, we set the CanAlwaysMatch callback on the CompilationUnitContent non-terminal.
The AdvanceToDefaultState method that we use needs to be designed to advance to the next Using, Namespace, etc. token. This is easy as the AdvanceTo method we provide on ITokenReader can accept any number of token ID values.
Finally we can use one of the two options listed in the Advanced error handling section above to report a helpful parse error to the end user.
Next steps
Today we’ve covered a lot of ground on callbacks and error handling. You can see the power that our grammar framework has in these areas since grammars are written natively in C# and VB.
In the next post, we’ll apply these techniques to our Simple language grammar so that we can make it always provide us with as complete of an AST as possible, even when there are parse errors present, as will often be the case when editing documents in a code editor control such as SyntaxEditor.