Compilers 4: Exploring a Parser

Reading Time: 10 minutes

In May I took Dave Beazley’s week-long course on compilers. This is the fourth post in a series about that course. Here’s where you can see all the posts so far.

In the last post, we debugged an interpreter together. So far we’ve covered the back end of compilation such that all of our exercises began with an abstract syntax tree. Now we talk about the front end of compilation: generating that syntax tree from the source code.

We have two steps here: tokenization and parsing.

I wrote about building a parser once before while going through the first part of Crafting Interpreters.

Bob’s approach to tokenization shares a similarity with Dave’s: it takes in the source code as a stream of characters and then shoves it through a massive conditional ladder (or, in the case of Crafting Interpreters, a case statement) to produce a new stream of tokens for building the tree.

Seriously, here’s how we tokenize the text in the compilers class:

def  tokenize(text):
    n = 0
    line = 1
    while n < len(text):
        #look for patterns/tokens
        if text[n] == "'":
            start = n
            n += 1
            while n < len(text) and text[n] != "'":
                if text[n] == "\\":
                    n += 1
                n += 1
            n += 1
            yield Token('CHAR', text[start:n], line)
            continue
        if text[n] in " \t":
            n += 1
            continue #forget about it
        if text[n] == "\n":
            line += 1
            n += 1
            continue
        if text[n:n+2] == '//':
            start = n
            while n < len(text):
                if text[n] == "\n":
                    break
                n += 1
            # yield Token('COMMENT', text[start:n], line) #no need to parse comment
            continue
        if text[n:n+2] == '/*':
            start = n
            start_line = line
            while n < len(text):
                if text[n:n+2] == "*/":
                    n += 2
                    break
                n += 1
            # yield Token('COMMENT', text[start:n], start_line) #no need to parse comment
            continue
        if text[n:n+8] == 'continue':
            yield Token('CONTINUE', 'continue', line)
            n += 8
            continue
        if text[n:n+8] == 'return':
            yield Token('RETURN', 'return', line)
            n += 6
            continue
        if text[n:n+5] == 'break':
            yield Token('BREAK', 'break', line)
            n += 5
        if text[n:n+5] == 'print':
            yield Token('PRINT', 'print', line)
            n += 5
            continue
        if text[n:n+5] == 'const':
            yield Token('CONST', 'const', line)
            n += 5
            continue
        if text[n:n+5] == 'float':
            yield Token('TYPE', 'float', line)
            n += 5
            continue
        if text[n:n+5] == 'while':
            yield Token('WHILE', 'while', line)
            n += 5
            continue
        if text[n:n+5] == 'false':
            yield Token('FALSE', 'false', line)
            n += 5
            continue
        if text[n:n+4] == 'true':
            yield Token('TRUE', 'true', line)
            n += 4
            continue
        if text[n:n+4] == 'else':
            yield Token('ELSE', 'else', line)
            n += 4
            continue
        if text[n:n+4] == 'bool':
            yield Token('TYPE', 'bool', line)
            n += 4
            continue
        if text[n:n+3] == 'var':
            yield Token('VAR', 'var', line)
            n += 3
            continue
        if text[n:n+3] == 'int':
            yield Token('TYPE', 'int', line)
            n += 3
            continue
        if text[n:n+2] == 'if':
            yield Token('IF', 'if', line)
            n += 2
            continue
        if text[n:n+2] == '<=':
            yield Token('LE', '<=', line)
            n += 2
            continue
        if text[n:n+2] == '>=':
            yield Token('GE', '>=', line)
            n += 2
            continue
        if text[n:n+2] == '==':
            yield Token('EQ', '==', line)
            n += 2
            continue
        if text[n:n+2] == '!=':
            yield Token('NEQ', '!=', line)
            n += 2
            continue
        if text[n] == '<':
            yield Token('LT', '<', line)
            n += 1
            continue
        if text[n] == '>':
            yield Token('GT', '>', line)
            n += 1
            continue
        if text[n] == '+':
            yield Token('PLUS', '+', line)
            n += 1
            continue
        if text[n] == '*':
            yield Token('MULT', '*', line)
            n += 1
            continue
        if text[n] == '/':
            yield Token('DIV', '/', line)
            n += 1
            continue
        if text[n] == '-':
            yield Token('MINUS', '-', line)
            n += 1
            continue
        if text[n] == '=':
            yield Token('ASSIGN', '=', line)
            n += 1
            continue
        if text[n] == '{':
            yield Token('LEFTBRAQUETTE', '{', line)
            n += 1
            continue
        if text[n] == '}':
            yield Token('RIGHTBRAQUETTE', '}', line)
            n += 1
            continue
        if text[n] == '(':
            yield Token('LEFTPAREN', ')', line)
            n += 1
            continue
        if text[n] == ')':
            yield Token('RIGHTPAREN', '(', line)
            n += 1
            continue
        if text[n] == ';':
            yield Token('SEMICOLON', ';', line)
            n += 1
            continue
        if text[n] in '0123456789.':
            start = n
            while n < len(text):
                if text[n] not in '0123456789.':
                    break
                n += 1
            if "." in text[start:n]:
                yield Token('FLOAT', text[start:n], line)
            else:
                yield Token('INT', text[start:n], line)
            continue
        else:
            if name_start.fullmatch(text[n]):
                start = n
                while n < len(text):
                    if not name_pattern.fullmatch(text[n]):
                        break
                    n += 1
                yield Token('NAME', text[start:n], line)
                continue
            else:
                raise RuntimeError(f"{text[n]} not parsed at {line}")
    yield Token('EOF', 'eof', line)

The tokenizer in cpython (the implementation of Python that you almost certainly use) follows a similar structure, albeit longer and a little more complicated.

The unique, interesting thing I want to point out here is the use of a generator. This function isn’t returning the tokens wholesale: rather, it’s yielding them such that the parsing step can next through them one at a time without having to handle the whole collection at once.

When is that useful? It’s useful when we get to the parsing step.

Suppose we have this Wabbit code:

var a int = 2;
var b int = 3;
if a < b {
   print a;
} else {
   print b;
}

The tokenizer gives us a token stream something like this*:

Token('VAR', 'var', 2)
Token('NAME', 'a', 2)
Token('TYPE', 'int', 2)
Token('ASSIGN', '=', 2)
Token('INT', '2', 2)
Token('SEMICOLON', ';', 2)
Token('VAR', 'var', 3)
Token('NAME', 'b', 3)
Token('TYPE', 'int', 3)
Token('ASSIGN', '=', 3)
Token('INT', '3', 3)
Token('SEMICOLON', ';', 3)
Token('IF', 'if', 4)
Token('NAME', 'a', 4)
Token('LT', '<', 4)
Token('NAME', 'b', 4)
Token('LEFTBRAQUETTE', '{', 4)
Token('PRINT', 'print', 5)
Token('NAME', 'a', 5)
Token('SEMICOLON', ';', 5)
Token('RIGHTBRAQUETTE', '}', 6)
Token('ELSE', 'else', 6)
Token('LEFTBRAQUETTE', '{', 6)
Token('PRINT', 'print', 7)
Token('NAME', 'b', 7)
Token('SEMICOLON', ';', 7)
Token('RIGHTBRAQUETTE', '}', 8)
Token('EOF', 'eof', 9)

*They’re called “braquette” instead of “bracket” because this is my compiler that I am writing for my skill development and if I want to be ridiculous then that’s my prerogative.

Now we have to turn that into an abstract syntax tree that we can interpret. Something like we input to our interpreter in the last post, like so:

Block(
    VariableAssignment(name=a, value=Integer(value=2)),
    VariableAssignment(name=b, value=Integer(value=3)),
    Conditional(if_condition=Comparison( <, a, b), if_block=Block(
        Print(a)
    ), else_block=Block(
        Print(b)
    ))
)

In order to create those structures, we have to be able to consume the tokens in chunks like this:


Token('VAR', 'var', 2) Token('NAME', 'a', 2) Token('TYPE', 'int', 2) Token('ASSIGN', '=', 2) Token('INT', '2', 2) Token('SEMICOLON', ';', 2)
Token('VAR', 'var', 3) Token('NAME', 'b', 3) Token('TYPE', 'int', 3) Token('ASSIGN', '=', 3) Token('INT', '3', 3) Token('SEMICOLON', ';', 3)
Token('IF', 'if', 4) Token('NAME', 'a', 4) Token('LT', '<', 4) Token('NAME', 'b', 4) Token('LEFTBRAQUETTE', '{', 4) Token('PRINT', 'print', 5) Token('NAME', 'a', 5) Token('SEMICOLON', ';', 5) Token('RIGHTBRAQUETTE', '}', 6) Token('ELSE', 'else', 6) Token('LEFTBRAQUETTE', '{', 6) Token('PRINT', 'print', 7) Token('NAME', 'b', 7) Token('SEMICOLON', ';', 7) Token('RIGHTBRAQUETTE', '}', 8)
Token('EOF', 'eof', 9)

To accomplish that, the parser can look for a beginning token and then next through the tokens until it finds an ending token that marks the end of the given block or statement. In Wabbit, assignments start with var and end with a semicolon, so the parser can look for that*. (Note that Wabbit also allows frontend code to declare a variable without assigning it like var a; so this parsing function accounts for that condition too):

def parse_variable_assignment(tokens):
    tokens.expect('VAR')
    loc = parse_location(tokens)
    typ = tokens.accept('TYPE')

    assignment = tokens.accept('ASSIGN')
    if assignment:
        expr = parse_expression(tokens)
        tokens.expect('SEMICOLON')
        return VariableAssignment(name=loc, value=expr)
    else:
        tokens.expect('SEMICOLON')
        return VariableDeclaration(name=loc, type=typ)

*Note also that Python does not share this property with Wabbit. a = 3 is legal in Python, and lines don’t end in semicolons. Python handles the difference between variables and constants with conventions rather than compiler-level enforcement, and rather than a semicolon it looks for a newline or the explicit line break \.

Parsing a Wabbit conditional statement takes a little more doing than variable assignment for two reasons:

  1. There are multiple options here on structure. Wabbit allows if by itself as well as if...else.
  2. The code within a conditional branch can comprise its own block of multiple statements.

Nevertheless, Wabbit still has clear beginning and ending tokens. The conditional encloses each block in a LEFTBRAQUETTE and a RIGHTBRAQUETTE. So the whole thing starts with an if, the condition comes next, then the block. After the block it could be nothing, or an else.

def parse_if_statement(token_stream:TokenStream):
    '''
    if test { consequence }
    if test { consequence } else { alternative }
    '''
    token_stream.expect('IF')
    condition = parse_expression(token_stream)
    token_stream.expect('LEFTBRAQUETTE')
    if_block = parse_statements(token_stream)
    token_stream.expect('RIGHTBRAQUETTE')
    if token_stream.peek('ELSE'):    # Only 1 token ahead.
        # Parsing algorithms:   LL(1), LALR(1), LR(2), the # is the lookahead
        token_stream.expect('ELSE')
        token_stream.expect('LEFTBRAQUETTE')
        else_block = parse_statements(token_stream)
        token_stream.expect('RIGHTBRAQUETTE')
    else:
        else_block = None
    return Conditional(condition, if_block, else_block)

Each structure possesses its own parsing function that detects all the pieces and sets them in place on a data model object like Conditional or VariableDeclaration or VariableAssignment.

This code looks straightforward, and the reason it looks that way, honestly, is that the machinery for helping a client developer figure out the problem when they write invalid Wabbit isn’t that sophisticated. Almost the totality of it amounts to this one raise line in the expect function that consumes tokens inside the parsing functions:

    def expect(self, toktype):
        '''
        Require the next token to be toktype or else a SyntaxError.
        Consumes and returns the matching token if successful.
        '''
        tok = self.accept(toktype)
        if tok is None:
            raise SyntaxError(f'Expected {toktype}. Got {self.lookahead}')
        return tok

Why keep it that simple? Well, first of all, because this is an educational project and not a production-quality tool. But second of all, because that one line catches a surprisingly large number of things. If I remove a semicolon from the first line, it catches my error:

Traceback (most recent call last):
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/try_parser.py", line 85, in <module>
    parsed = parse_source(source4)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 206, in parse_source
    model = parse_statements(tokens)     # You need to implement this part
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 331, in parse_statements
    statement = parse_statement(tokens)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 300, in parse_statement
    return parse_variable_assignment(token_stream)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 223, in parse_variable_assignment
    tokens.expect('SEMICOLON')
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 199, in expect
    raise SyntaxError(f'Expected {toktype}. Got {self.lookahead}')
SyntaxError: Expected SEMICOLON. Got Token('VAR', 'var', 3)

And if I remove a bracket from the if block, it catches that too:

Traceback (most recent call last):
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/try_parser.py", line 85, in <module>
    parsed = parse_source(source4)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 206, in parse_source
    model = parse_statements(tokens)     # You need to implement this part
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 331, in parse_statements
Token('PRINT', 'print', 2)
    statement = parse_statement(tokens)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 306, in parse_statement
Token('INT', '2', 2)
    return parse_if_statement(token_stream)
Token('PLUS', '+', 2)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 279, in parse_if_statement
Token('INT', '3', 2)
    token_stream.expect('LEFTBRAQUETTE')
Token('MULT', '*', 2)
  File "/Users/chelseatroy/workspace/learning/compilers_2021_05/wabbit/parse.py", line 199, in expect
    raise SyntaxError(f'Expected {toktype}. Got {self.lookahead}')
SyntaxError: Expected LEFTBRAQUETTE. Got Token('PRINT', 'print', 5)

This implementation has another advantage as a teaching and learning tool: it’s completely synchronous. That makes it possible for a programmer with the source code to place a breakpoint at the beginning, run the code in debug mode on a test case, and then step over each line one by one to understand exactly what’s happening. I’m a huge advocate of this learning method. Please don’t wait until your code stops working to plod through it with the debugger. There’s so much to learn from stepping through working code.

OK, I think we’ll stop there on parsers for now. In the upcoming final post on the compilers course, I’ll talk about code generation (you might have heard folks call it “code gen”, I have to assume, to sound cool). See you then.

If you liked this piece, you might also like:

The rest of the compilers series, of course!

This post about engineering for edge cases, which I think about a lot in the context of compiler design

This piece on documentation, which is scarcely related to compilers but a piece I think might help you nonetheless

One comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.