Post

Parsing a programming language in Rust

I recently rewrote the parsing stage of Oxlip. As a matter of fact, it is the third significant refactoring of the parser. I must admit it took me quite some time to land on a solution I feel comfortable with. Even if all the libraries I am discussing here are viable options for parsing programming languages, each one of them had limitations that were not obvious to me at the beginning. My ambition is not to present an holistic view of the ecosystem of parsing libraries in Rust, or a tutorial on writing compilers, but only to share a journey in the world of parser generators.

Parsing expression grammars

My first experience with PEG parsing dates back to 2014, at a time when I was writing a SQL parser in Ruby with Treetop. Coming from Bison and Flex, the promises of PEG are appealing. The complete grammar is expressed with a composition of regular expressions. The lexical analysis and syntax analysis stages are handled by a single tool and domain specific language. It means less boilerplate and a much tighter feedback loop when prototyping new programming languages. That sounds like a good enough reason to start Oxlip with a PEG parser.

Pest

Pest does not provide an internal DSL like Treetop does. The grammar is written in a separate file, imported and compiled via a Rust macro. The grammar syntax is straightforward, easy to learn and with good IDE support (IDEA, VSCode).

Examples of Pest grammar expressions, a.k.a rules:

1
2
3
4
5
6
7
literal_num = @{ ASCII_DIGIT+ }

literal_type = { http_status_range | literal_num | literal_str }

object_type = { "{" ~ ( expr_type ~ ( "," ~ expr_type )* )? ~ "}" }

block_comment = _{ "/*" ~ ( block_comment | !"*/" ~ ANY )* ~ "*/" }

Apart from better readability, another benefit of maintaining the grammar outside of the Rust source code is improved debugging. The IDE plugin makes it possible to test the grammar interactively, submitting input text and checking which rule matches. It definitively helps when prototyping a new language from scratch.

The output of a successful parsing phase is a tree of text spans, i.e. pointers to portions of the original input text. Each span is tagged with the rule that matched that given piece of text. Within Pest, a rule-tagged text span is called a Pair. To build a syntax tree, one has to scan through the tree of Pairs and create the corresponding syntax nodes.

Example of mapping from a Pair to the abstract syntax node representing an Oxlip statement:

1
2
3
4
5
6
7
8
9
10
11
12
impl<T: AsExpr> FromPair for Statement<T> {
    fn from_pair(p: Pair) -> Self {
        let p = p.into_inner().next().unwrap();
        match p.as_rule() {
            Rule::decl => Statement::Decl(p.into_expr()),
            Rule::res => Statement::Res(p.into_expr()),
            Rule::ann => Statement::Ann(p.into_expr()),
            Rule::import => Statement::Use(p.into_expr()),
            _ => unreachable!(),
        }
    }
}

Compared to internal parser generators (e.g. parser combinators), this approach is more verbose and error-prone. The grammar is effectively described from two places that must be kept in sync: the grammar expression file and the Rust code scanning the parse tree. Any oversight is likely to lead to a runtime panic, e.g. the unreachable branch in the example above. It was not a deal breaker to me at that time. Pest is a solid PEG parser generator implementation that delivers on its promises.

The main reason to look for an alternative was the lack of error recovery mechanism. Pest does report detailed errors when a parsing goes wrong, but there is no way to ignore invalid tokens and resume parsing further. A typical example is to skip tokens until a known separator is found, like a closing parenthesis or a semi-colon. Lack of error recovery is not a problem in many cases and sometimes even preferred, e.g. when parsing a network protocol. My ambition for the Oxlip compiler is to offer a seamless IDE integration without writing two separate implementations, a command line interface and a language server. With that objective in mind, smart error recovery is a must.

It is hard to generalize recovery strategies independently of the language grammar and still offer a great developer experience, which explains why error recovery is not a common feature among parser generators. The Chumsky Rust library is one of the rare exceptions. Rewriting the parser using Chumsky was the first major refactoring of the Oxlip source code.

Chumsky

Chumsky is a parser combinator library for parsing PEGs. Parser combinator libraries rely on function composition to combine basic production rules into more complex parsers, and finally the complete grammar. They are rather popular in functional programming languages, like Haskell. A combinator is a function taking one or more parsers as inputs and producing another parser as output. For example, the choice combinator takes a list of alternative parsers and returns a parser that tries them in sequence until a match is found:

1
2
3
4
5
let control_token = choice::<_, Simple<char>>([
    text::keyword("if").to(Token::If),
    text::keyword("for").to(Token::For),
    text::keyword("while").to(Token::While),
]);

Points of error recovery can be specified by combining specific combinators, e.g. skip_until or nested_delimiters. As their name suggest, the error recovery combinator makes sure to skip input characters until a given character or matching delimiter is found.

As said before, one of the benefits of using a PEG parser is the ability to merge lexical and syntax analysis in one grammar. Other parsing approaches first transform a stream of characters into a stream of tokens (e.g. keywords, identifiers, literals…etc), then construct an abstract syntax tree (AST). PEG makes it easy to build an AST from a stream of characters, in one single pass. Similarly, the design philosophy of Chumsky is optimized towards parsing streams of characters, even if the input of a parser can be of any Rust type.

Concrete syntax tree

For IDE integration, a parser must enable source code formatting and refactoring. It is not enough to just parse and extract the semantic layer of the source code into an AST. Doing so would discards control characters, whitespace and comments, that are relevant to the IDE when manipulating the original source code. The architecture of the syntax analysis in Oxlip is inspired from rust-analyzer. Instead of an abstract syntax tree, the main data structure is a concrete syntax tree (CST). The CST retains all tokens expressed by the language grammar, whereas an AST only retains essential nodes, i.e. the nodes strictly required for evaluation. For instance, parentheses ((, )), brackets ([, ]) and semi-colons (;) are not relevant to an AST but are maintained in a CST.

Whitespace and comments are not discarded but recognized as trivial nodes, i.e. nodes without meaning. Trivial nodes do not influence the semantic of the program but must be retained for source code manipulation purposes. Unfortunately, PEG parsers struggle with grammars where it is valid for whitespace and comments to appear between any two other tokens. It is the case for C, C++, Java, Rust and most general purpose languages. Pest actually provides two special rules, WHITESPACE and COMMENT, to specify the parsing expression to be ignored between any two rules. Without those special rules, writing the parsing expressions by hand would make the grammar unreadable.

Here is the relevant extract from the Pest documentation:

1
2
3
4
If both WHITESPACE and COMMENT are defined, this grammar:
a = { b ~ c }
is effectively transformed into this one behind the scenes:
a = { b ~ WHITESPACE* ~ (COMMENT ~ WHITESPACE*)* ~ c }

Chumsky does not offer any special support for matching whitespace and comments. Therefore, we have to forget about expressing our language with a single grammar at the level of a stream of characters. Whitespace and comments must first be recognized as special tokens during lexical analysis, then put aside before syntax analysis. Writing two PEG parsers, one feeding into the other, technically works. Actually, there is an example in the Chumsky repository for a simplified Rust-like language.

What soon became untractable with this approach is the requirement to keep all semantic tokens and generate a CST instead of an AST. It is not to say that parser combinators cannot generate CSTs, but the combinators built into Chumsky are strongly biases towards ASTs, i.e. discarding non-essential tokens. It is understandable as most use cases for parser generators relate to AST generation. Our requirement is more the exception than the rule. On top of that, Rust is not Haskell. As a parser combinator implemented in Rust, Chumsky has to deal with a significant amount of complexity (not too dissimilar to the early days of C++ template metaprogramming) that tends to leak into the client code as soon as we go beyond the built-in combinators. The further I moved away from the canonical use case for Chumsky, the more I was fighting the abstraction rather than building on top of it.

DIY parsing

As a PEG parser generator, Chumsky generates a recursive descent parser. We reached a state where we have to maintain two grammars, one for the lexical analysis and another one for the syntax analysis. The lexical analysis parser generates the stream of tokens from input characters, but a recursive descent parser is overkill for this purpose. A lexer based on a deterministic finite-state machine would be easier to write and more efficient. The syntax analysis parser generates the CST from the stream of tokens but, as explained in the previous section, it is overly complex and hard to maintain. It is time to rethink the parser architecture once again, and this time leave PEG behind.

Logos

The first step was to replace the lexical analysis based on Chumsky with a more fit for purpose implementation. Logos is a Rust lexer library. It takes an enumeration of token-kinds with their associated grammar, either as an extract match or a regular expression, and generates (at compile time) an optimized deterministic finite-state machine capable of parsing a stream of characters into a stream of tokens.

Here is an extract of the lexer specification for Oxlip:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#[derive(Logos, Debug, PartialEq, Eq, Hash, Clone, Copy)]
#[logos(subpattern ident = r"[0-9a-zA-Z$_-]")]
pub enum TokenKind {
    #[regex(r"[ \t\r\n]+")]
    Space,
    #[regex(r"//[^\r\n]*[\r\n]*")]
    CommentLine,
    #[regex(r"/\*([^*]|\*[^/])*\*/")]
    CommentBlock,
    ...
    #[token("get")]
    MethodGet,
    #[token("put")]
    MethodPut,
    ...
    #[regex("[a-zA-Z_](?&ident)*")]
    IdentifierValue,
    ...
    #[token("::")]
    OperatorDoubleColon,
    #[token("->")]
    OperatorArrow,
    #[regex(r"#[^\r\n]*[\r\n]*")]
    AnnotationLine,
    #[regex("`[^`]*`")]
    AnnotationInline,
}

The library is extremely easy to use and provides just the right abstraction for what a lexer is supposed to do. The integration of string interning is also much simpler. The stable revision of Chumsky is stateless, meaning string interning requires a second pass over the stream of tokens. There is no such limitation with Logos. It is trivial to iterate over the stream of tokens, parsing literals and interning strings on the fly.

Here is how we iterate over the stream of tokens in the Oxlip lexer source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
pub fn tokenize(loc: Locator, input: &str) -> (Option<TokenList<Token>>, Vec<ParserError>) {
    let lexer = TokenKind::lexer(input).spanned();
    let mut list = TokenList::new(loc.clone());
    let mut errors = Vec::new();

    for (result, range) in lexer {
        match result {
            Ok(kind) => {
                let slice = &input[range.clone()];
                let value = match kind {
                    TokenKind::LiteralNumber => TokenValue::Number(parse_number(slice)),
                    TokenKind::LiteralString => {
                        TokenValue::Symbol(list.register(parse_quoted_string(slice)))
                    }
                    TokenKind::LiteralHttpStatus => {
                        TokenValue::HttpStatus(parse_http_status(slice))
                    }
                    ...
                    _ => TokenValue::None,
                };
                let token = Token(kind, value);
                list.push(token, range);
            }
            Err(_) => {
                let span = Span::new(loc.clone(), range);
                errors.push(ParserError::new(span));
            }
        }
    }

    (Some(list), errors)
}

As a matter of fact, rewriting the lexer from Chumsky to Logos was a straightforward experience. It boils down to an enumeration type with fairly readable annotations. It is simple to extend and test.

Recursive descent parser

The next significant step was getting rid of Chumsky altogether. Here is not the place to go into too much details about writing a recursive descent parser by hand. The important take away is that it is likely easier than you think. Once the structure of the grammar is clear, it is a very systematic process of mapping production rules to functions. There are computer science problems that do not require a lot of boilerplate. Writing a recursive descent parser is one of them. Not having to constrain ourselves to a parsing framework imposed by an external library actually leads to a more fit-for-purpose implementation and a more readable code. Strategies for error recovery can be carefully inserted at the right place to fit with the language semantic.

Here are three production rules expressed as bare Rust functions (comma, property list and object):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pub fn parse_comma<T: Core>(c: &mut Context<T>, s: Cursor) -> ParserResult {
    parse_token(c, s, TokenKind::ControlComma)
}

pub fn parse_property_list<T: Core>(c: &mut Context<T>, s: Cursor) -> ParserResult {
    let ns = &mut Vec::new();
    let s = repeat(c, s, ns, &[parse_expression, parse_comma]);
    Ok((s, c.compose(SyntaxKind::PropertyList, ns)))
}

pub fn parse_object<T: Core>(c: &mut Context<T>, s: Cursor) -> ParserResult {
    let (s, n0) = parse_token(c, s, TokenKind::ControlBraceLeft)?;
    let (s, n1) = parse_property_list(c, s)?;
    let (s, n2) = parse_token(c, s, TokenKind::ControlBraceRight)?;
    Ok((s, c.compose(SyntaxKind::Object, &[n0, n1, n2])))
}

The last important aspect in the parser refactoring journey has to do with exponential backtracking. Backtracking means going back to a previous position in the input stream after failing a production rule to try a different alternative. The ability to backtrack is a key aspect of top-down parsers like recursive descent and not specific to handwritten parsers. Depending on how the grammar is written, operator precedence and nested expressions can lead to exponential backtracking, i.e. parsing input tokens an exponential number of times compared to the input length. It was not apparent at first, but even our previous implementation based on Chumsky did exhibit exponential behaviors. The common solutions are: precedence climbing, Pratt or memoization (Packrat). There was unfortunately no support for those techniques in the stable release of Chumsky at that time. As far as I know, support is planned for the next major release (1.0).

Memoization

The purpose of memoization is to avoid trying a production rule more than once at the same position. To do so, the result of applying a production rule (success or failure) is simply kept in memory. Therefore, a Packrat parser consumes more memory than a recursive descent parser, but guarantees linear time parsing. It is worth mentioning that only mutually recursive rules must be memoized, which limits the memory consumption in practice.

When writing a recursive descent parsing by hand, adding support for memoization is trivial. As illustrated in the Oxlip source code, one only needs to maintain a cache of parsing results (ParserResult<G>) indexed by the position in the input stream (Cursor) and the production rule identifier (G::Tag):

1
2
3
4
pub struct Context<T: Core, G: Grammar> {
    ...
    cache: HashMap<(Cursor, G::Tag), ParserResult<G>>,
}

Mutually recursive production rules must be guarded by a function (memoize) that checks for a known parser result in the cache before proceeding further:

1
2
3
4
5
pub fn parse_expression<T: Core>(c: &mut Context<T>, s: Cursor) -> ParserResult {
    memoize(ParserTag::Expression, c, s, |c, s| {
        parse_recursion(c, s).or_else(|_| parse_relation_kind(c, s))
    })
}

Conclusion

Some abstractions promise to make our life easier. PEG parser generators and parser combinators are good examples. PEG parser generators are easy to learn and help to iterate quickly on a single grammar based on regular expressions. Parser combinators capture the boilerplate of writing top-down parsers while keeping the flexibility of the host language. When deciding on a parsing technique the properties of the language grammar are important, but we should not forget the requirements coming from the broader user experience:

  • in what context is the compiler going to be used? Are we targeting a command line interface or an IDE?
  • should we reject invalid input as soon as possible, or is error recovery important?
  • beside pure evaluation, what information must be captured and maintained in the syntax tree to assist with interactive source code manipulation?

A 2021 survey showed that the majority of compilers for the top-10 general purpose languages (e.g. GCC, Clang, V8, TypeScript, Java, Go) were using a handwritten recursive descent parser. As software engineers we sometimes go shopping for external libraries too quickly. For what parsing is concerned, the simple handwritten solution is often the most appropriate.

This post is licensed under CC BY 4.0 by the author.