Building a Parser

Reading Time: 7 minutes

Screen Shot 2019-10-02 at 10.27.29 AM

I recently picked up Crafting Interpreters. The project guides programmers through building their own interpreters for the Lox programming language. I’m writing this blog series as I go through the project. You can see all the posts so far right here.

In Chapters 5 and 6 of Crafting Interpreters, we pick up from the scanner that turned our code into a series of tokens, and we start writing the parser.

A parser takes in a list of tokens from our source code and produces an abstract syntax tree like the ones we talked about when we drew illustrations of intermediate representation strategies.

Usually the AST sits somewhere between the source language and an intermediate representation,(IR), though it’s significantly closer in form to an IR than our source code is. This is where it becomes important for us to adjudicate the grammar of our language: which tokens belong to which expressions. We haven’t gotten to adjudicating meaning yet, but the way we parse the grammar plays a critical role in that.

How would I imagine a parser to be built?

Here is the representation from Crafting Interpreters for Lox’s grammar:

Screen Shot 2019-10-19 at 12.25.27 PM.png

This representation is meant to concisely portray precedence in the grammar: which parts of an expression are evaluated first. For example, in a piece of code like 2 + 3 < 5 + 1, the sums on the left and right are calculated before the comparison < is calculated because addition expressions have precedence over comparison expressions.

The tokens on the left in the above representation list the different types of expressions. The tokens on the right show how each of those expression types nest inside each other. When we inline each expression definition, check out the shape of the resulting diagram:

Parser Tree

It’s a tree.

Back when I wrote the scanner, before I started coding, I asked myself: given what I know about what scanners do, how would I imagine such a thing to be built?

When I do the same for parsing, I think of this tree. I have done some tree traversal in my day for visualizing method calls, querying a graph database, and building an in-memory database with transaction support. Each case employed recursion.

And that’s what the parser does, too. It assumes that each expression is of the type of lowest precedence, and when it finds an indication of higher precedence, it drops into another method for that higher precedence expression. Think of each expression as water coming out of the top of a tiered fountain, and the bowls of the fountain representing each expression type.

Tiered Fountain

An indication of higher precedence (say, a + sign in a comparison expression) causes that part of the expression to spill over from the comparison() method to an addition() method below it. Like so:


private Expr comparison() {
Expr expr = addition();
while (consuming(GREATER, GREATER_EQUAL, LESS, LESS_EQUAL)) {
Token operator = previousToken();
Expr right = addition();
expr = new Expr.Binary(expr, operator, right);
}
return expr;
}
private Expr addition() {
Expr expr = multiplication();
while (consuming(MINUS, PLUS)) {
Token operator = previousToken();
Expr right = multiplication();
expr = new Expr.Binary(expr, operator, right);
}
return expr;
}

view raw

Parser.java

hosted with ❤ by GitHub

So on down the fountain all the way to the expression type with the highest precedence, primary().

Notice, though, that our tree has ‘expression’ both on the bottom and the top. This is the recursive element: larger groupings of expressions can contain smaller groupings of expressions, resulting in a parse tree in which methods call themselves.

Precedence vs. Associativity

Precedence refers to which operator is evaluated first in an expression that composes multiple different operators. Associativity refers to which operator is evaluated first in an expression that composes multiple of the same operator.

In the expression 2 / 1 + 1 precedence determines whether the answer is 1 or 3.

In the expression 2 / 1 / 4, associativity determines whether the answer is one half or 8.

Different programming languages have different precedence and associativity rules, and you can look up precedence and associativity tables for them if you don’t feel like figuring it out by typing in expressions and seeing what comes back. Here’s a table for C and C++ (expressions are listed in order of precedence):

Screen Shot 2019-11-03 at 9.04.28 AM.png

We’re going to talk more about precedence and associativity in the next post, but it makes sense to introduce associativity alongside precedence because the contrast between them helps cement in the memory what each one is doing.

Visualizing Order of Operations

The parser takes in a list of tokens and spits out an expression, which can have other expressions that will be evaluated first nested inside it. We could visualize this several different ways. I implemented the two discussed in Crafting Interpreters. 

The first prints the AST as a string that looks a lot like the Scheme programming language. In Scheme, each expression appears in its own set of parentheses, with the operator first and the operands (arguments) afterward. That is, we don’t have to parse Scheme into a data representation: it is itself a data representation. This, in addition to its use as the exemplar in SICP, contributes to its popularity as a teaching language for compiler design.

Here’s an example of our Scheme-like string:

Screen Shot 2019-11-03 at 9.21.04 AM.png

Here, I’m typing in expressions and printing both the token list and the abstract syntax tree representation. If you don’t write Scheme, here’s the skinny so you can understand this image: the operator comes first, followed by the operands. Narrowest set of parentheses is evaluated first.

In Lox, mathematical operations are left-associative. So when we do 4 + 3 + 2 + 1, Lox evaluates 4 + 3 first, then adds the result to 2, then adds the sum of all those to 1. When we do 9 / 8 / 7 / 6, 9 is divided by 8, the quotient of that by 7, and the quotient of that by 6.

In Lox, 2 / 1 / 4 is one half.

Here are the same operations with the AST represented in Reverse Polish Notation, with the operator at the end of the expression. This constituted one of the exercises at the end of Chapter 5, I suspect to encourage folks to practice a programming pattern that we used to implement the Scheme-like AST representation.

Screen Shot 2019-11-03 at 9.31.06 AM.png

I thought about making a third representation using GrahViz because I like pictures. I decided against this, though, because my goal in this project is to learn about how to write interpreters, and futzing with the GraphViz dependency to do the same pattern a third time that I had already done twice doesn’t contribute to that goal. As we’ve discussed at length, I like to dig holes, and using projects to focus my learning helps me recognize when I’m getting out the shovel so I can force myself to put it away.

So, rather than digging that hole, in the next post we’ll move on to talk some more about abstract syntax trees.

If you liked this post, you might also like:

The Listening Series (or maybe you’ll hate it, but I’m proud of it anyway)

The series about reducing job interview anxiety (especially for folks with a little experience)

This start to a new series about what you can do as an individual to align your engineering career with your values

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.