Compilers 3: Let’s Debug an Interpreter Together!

Reading Time: 10 minutes

In May I took Dave Beazley’s week-long course on compilers. This is the third post in a series about that course. Here’s where you can see all the posts so far.

In the last post, I demonstrated implementing a code printer. We started from the abstract syntax tree and worked backwards in the compilation process to the source code that would compile to such a tree. I then made you a promise:

We could construct a rudimentary interpreter with the exact same format as this source code printer. It would use the same data model, and it could even use the same sort of giant conditional, as to_source. The difference: instead of arranging stringifications of the data model inputs, it would execute the Python version of the commands encoded in each data model input. If we treat Python as the target for the instruction set, then the interpreter code as described would constitute the right part of the mountain diagram shown above.

– That’s my story, and I’m stickin’ to it.

Does this really work? Can we take the structure of the code we used to travel backwards through the compilation process, change out a few pieces, and then travel forwards through the compilation process to the actual execution of the code?

Five rabbis discuss the implementation of Jewish ritual and ethics described in the Talmud. Hopefully our source code is sufficiently unambiguous that we can write a program to interpret it rather than assembling five rabbis.
Five rabbis discuss the implementation of Jewish ritual and ethics described in the Talmud. Hopefully our source code is sufficiently unambiguous that we can write a program to interpret it rather than assembling five rabbis.

If we could, then we might take an abstract syntax tree like this one:

some_arithmetic = Block(statements=[
        Print(value=BinaryOperator('*', BinaryOperator('+', Integer(2), Integer(3)), Negative(value=Integer(4)))),
        Print(value=BinaryOperator('/', BinaryOperator('-', Float(2), Float(3)),     Negative(value=Float(4.0))))
    ])

interpret_program(some_arithmetic)

and expect it to interpret the results of those arithmetic operations, then print the results like so:

>>> -20
>>> 0.25

For reference, that same tree run through our printer from the last post produces this source code, which is what interprets to the two results above:

print 2 + 3 * -4;
print 2.0 - 3.0 / -4.0;

What would have to change about our printer code design to make this work?

The result ends up being, very little. Similar to the to_source function, the interpret_program function can operate by switching on each of the different types, expressions, and statements in our data model. The difference is that, for each one, instead of generating a string to express it in Wabbit source code, we…do the thing in Python that this piece of Wabbit is trying to do.

Here’s my version in full for your gasping convenience (you can also see it in this gist, if you prefer to have syntax highlighting):

# Top level function that interprets an entire program. It creates the
# initial environment that's used for storing variables.

def interpret_program(model):
    # Make the initial environment (a dict).  The environment is
    # where you will create and store variables.
    env = {'constants':{}, 'variables':{}}
    for structure in model.statements:
        interpret(structure, env)

# Internal function to interpret a node in the environment.  You need
# to expand to cover all of the classes in the model.py file.

def interpret(node, env):
    # Expand to check for different node types
    if isinstance(node, Integer):
        return int(node.value)
    elif isinstance(node, Float):
        return node.value
    elif isinstance(node, BinaryOperator):
        if node.op == '*':
            return interpret(node.left, env) * interpret(node.right, env)
        elif node.op == '/':
            return interpret(node.left, env) / interpret(node.right, env)
        elif node.op == '+':
            return interpret(node.left, env) + interpret(node.right, env)
        elif node.op == '-':
            return interpret(node.left, env) - interpret(node.right, env)
    elif isinstance(node, Comparison):
        if node.op == '>':
            return interpret(node.left, env) > interpret(node.right, env)
        elif node.op == '<':
            return interpret(node.left, env) < interpret(node.right, env)
        elif node.op == '==':
            return interpret(node.left, env) == interpret(node.right, env)
        elif node.op == '>=':
            return interpret(node.left, env) >= interpret(node.right, env)
        elif node.op == '<=':
            return interpret(node.left, env) <= interpret(node.right, env)
        elif node.op == '!=':
            return interpret(node.left, env) != interpret(node.right, env)
    elif isinstance(node, Print):
        print(interpret(node.value, env))
    elif isinstance(node, VariableAssignment):
        env['variables'][str(node.name)] = interpret(node.value, env)
        return Unit()
    elif isinstance(node, ConstantAssignment):
        env['constants'][str(node.name)] = interpret(node.value, env)
        return Unit()
    elif isinstance(node, VariableDeclaration):
        env['variables'][str(node.name)] = None
        return Unit()
    elif isinstance(node, VariableReAssignment):
        if str(node.name) in env['variables'].keys():
            env['variables'][str(node.name)] = interpret(node.value, env)
        elif str(node.name) in env['constants'].keys():
            raise RuntimeError("Cannot reassign the value of a constant")
        else:
            raise RuntimeError(f"{node.name} not found for reassignment")
        return Unit()
    elif isinstance(node, Name):
        if node.name in env['variables'].keys():
            return env['variables'].get(node.name)
        elif node.name in env['constants'].keys():
            return env['constants'].get(str(node.name))
        else:
            raise RuntimeError(f"'{node.name}' not defined")
    elif isinstance(node, Negative):
        # gotta check if its  type that can be negative
        return interpret(node.value, env) * -1
    elif isinstance(node, Conditional):
        if interpret(node.if_condition, env):
            result = interpret(node.if_block, env)
            if isinstance(result, Break):
                return Break()
        else:
            if node.else_block:
                result = interpret(node.else_block, env)
                if isinstance(result, Break):
                    return Break()
    elif isinstance(node, WhileLoop):
        if interpret(node.while_condition, env):
            result = interpret(node.while_block, env)
            if not isinstance(result, Break):
                interpret(node, env)
            else:
                return Unit()
    elif isinstance(node, Block):
        for statement in node.statements:
            maybe_flow_interruptor = interpret(statement, env)
            if isinstance(maybe_flow_interruptor, Continue):
                return Unit()
            elif isinstance(maybe_flow_interruptor, Break):
                return Break()
        return Unit()
    elif isinstance(node, CompoundExpression):
        final_statement = None
        for statement in node.statements:
            final_statement = interpret(statement, env)
        return final_statement
    elif isinstance(node, Break):
        return Break()
    elif isinstance(node, Continue):
        return Continue()
    elif isinstance(node, WabbitType):
        return Unit()
    elif isinstance(node, WabbitBoolean):
        return node.value
    elif isinstance(node, Character):
        return str(node.value)
    else:
        raise RuntimeError(f"Can't interpret {node}")

So, I held out on you a little. The example above strategically shows a syntax tree that only does scripting: there are no variable gymnastics and no state for the program to manage. This is a great time for us to transition into a conversation about that.

Suppose Wabbit needs to manage constants and variables:

exemplifying_state = Block(statements=[
    ConstantAssignment(Name('pi'), Float(3.14159)),
    VariableDeclaration(name=Name('tau'), type=WabbitType('float')),
    VariableReAssignment(
        name=Name('tau'),
        value=BinaryOperator('*', Float(2.0), Name('pi'))),
    Print(value=BinaryOperator('+', Negative(value=Name('tau')), Float(10.0)))
])
interpret_program(exemplifying_state)

What you see here is the syntax tree for some code that:

  1. Assigns a constant called ‘pi’ the float type and a value of 3.14159
  2. Declares a variable called ‘tau’ of the float type
  3. Assigns ‘tau’ a value of 2 * pi
  4. Prints the sun of 10 and the negative value of tau

The tree requires storing, referencing, and manipulating two different kinds of names: constants and variables.

When run on this tree, interpret_program prints out 3.7168200000000002—as expected.

But some of its behavior is definitely…weird.

Scroll up and take a close look at my interpret function (or check out the gist right here). I’ve left the house haunted so you can chase a poltergeist.

Weirdness: Variable Storage

If I declare a variable called version and then later a constant called version, and then try to access the version value, what do you expect will happen?

versioning_model = Block(statements=[
    VariableAssignment(name=Name('version'), type=WabbitType('float'), value=Float(2.0)),
    ConstantAssignment(Name('version'), Float(5.0)),
    Print(value=BinaryOperator('+', Name('version'), Float(1.0)))
])
interpret_program(versioning_model)

Folks typically guess that the printed value will be 6.0, since the constant version got assigned after the variable version. The implementation in the interpreter code above doesn’t work that way. Instead, interpreting that syntax tree will print out 3.0.

What, respectfully, the hell have I done here? And maybe more importantly, why have I done it?

I can wait while you think.

Done? OK.

In order to handle state, our interpreter needs a place to store it. That’s called the environment. Most compilers implement state with some kind of key-value store, and this interpreter does that too with a dictionary initialized for each interpretation run:

def interpret_program(model):
    # Make the initial environment (a dict).  The environment is
    # where you will create and store variables.
    env = {'constants':{}, 'variables':{}}
    for structure in model.statements:
        interpret(structure, env)

This specific implementation has two separate inner stores for constants (which cannot be reassigned after initial assignment) and variables (which can be reassigned after initial assignment). After I implemented it this way, I realized the problem: this means that, without extra checking logic, a constant and a variable can have the same name (that’s right, even a goddess like myself makes mistakes when attempting to implement a compiler in a week, or in this case, an interpreter in 45 minutes 😉).

When the source code references a name…

    elif isinstance(node, Name):
        if node.name in env['variables'].keys():
            return env['variables'].get(node.name)
        elif node.name in env['constants'].keys():
            return env['constants'].get(str(node.name))
        else:
            raise RuntimeError(f"'{node.name}' not defined")

…the interpreter first checks for the presence of the name in the variable store. If it cannot find it there, it checks the constant store. If it still cannot find the name, it raises a RuntimeError. But this means that, if the name does exist in the variable store, the constant by the same name becomes unreachable. The interpreter won’t ever use it!

Why have these two separate sections in the environment at all? At the time, I wanted a clean way to quickly recognize source code that attempted to reassign a constant and raised a helpful error message:

    elif isinstance(node, VariableReAssignment):
        if str(node.name) in env['variables'].keys():
            env['variables'][str(node.name)] = interpret(node.value, env)
        elif str(node.name) in env['constants'].keys():
            raise RuntimeError("Cannot reassign the value of a constant")
        else:
            raise RuntimeError(f"{node.name} not found for reassignment")
        return Unit()

If I want to keep the error message without allowing constants and variables to occupy the same names (or allowing constants to become unreachable because of it), I might need to store all of the names in one top level environment, then have each one point to a value plus that value’s mutability.

And I did it, but before I show you, why don’t you give it a try? You have the interpreter code to change, and you have the versioning_model example above to run on your changes. This challenge will help familiarize you with how this interpreter works because:

  • It’s a clear, well-scoped change
  • It requires changing how the interpreter references the environment in several places
  • You know how to rapidly determine if it is working yet or not.

Once you’re done (or if you’re tired of it or want to skip this and see the answer), here’s a gist of the interpreter code with the change made. And here’s a diff of the exact changes so you can see exactly what I changed and where.

I didn’t bother with a class for the new state value storage: I used dictionaries. I didn’t use named tuples because I didn’t want to reassign an immutable data type to represent reassigning the source code’s mutable variables. I’d rather my interpreter create a boundary on constants than erase one on variables.

So far, we’ve stuck to the easy part of compilation.

For both printing and interpretation, we have started with an abstract syntax tree represented with a valid data model. Somehow, we have to get to that point from some source code expressed in Wabbit. That’s the big, thorny job—parsing. And in the next post of the series, we’ll finally explore that topic.

If you liked this piece, you might also like:

The Crafting Interpreters series, which covers what I’m learning from Bob Nystrom’s book

The Raft series, of course! Learn to implement a distributed system from the comfort of your home, complete with gifs and smack talk!

The time I talked about what counts as a programming language in front of an audience of programming language designers. I strategically saved this talk for a pandemic-era Zoom conference to eliminate the possibility that they’d actually throw tomatoes.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.