In May I took Dave Beazley’s week-long course on compilers. This is the fifth post in a series about that course. Here’s where you can see all the posts so far.
In the last post, we talked about parsing: getting from the front end of a programming language to its abstract syntax tree. Now we head backwards again to talk about code generation: the step where we translate the operations in the abstract syntax tree into instructions for a specific machine or virtual machine to run.

Recall the structure of the Wabbit interpreter from this prior piece. Our code generation example will follow that same structure except that, instead of executing commands in Python, it will generate commands for an intermediate representation in LLVM.
LLVM stands for “low level virtual machine,” and it’s a library specifically designed to construct and optimize intermediate and/or binary machine code. It allows us to focus on lexing and parsing because it does some of the heavy lifting for translating the resulting instructions for common computer architectures. So instead of targeting a specific machine, we can target LLVM (which hands us some of the common machines), and if we also wanted to run our code on various other machines we could also use LLVM tools to translate its intermediate representation for those machines.
We won’t do that today. Instead, we’ll focus on targeting LLVM IR (intermediate representation) from a Wabbit frontend. The plan is:
- Import a Python library that gives us an API for generating individual commands in the LLVM intermediate representation.
from llvmlite import ir
Dope, that was easy.
2. Translate our Wabbit abstract syntax tree nodes into llvmlite commands for the LLVM IR. That sounds fancy, but we’ve already done this same family of thing twice. First, we translated our AST into printed Wabbit code. Then, we translated our AST into executed Python code. By now, you’re getting used to seeing this pattern:
...
from model import *
def generate(node, env):
if isinstance(node, Integer):
return f"ir.Constant(int_type, {node.value})"
elif isinstance(node, Float):
return f"ir.Constant(float_type, node.value)"
elif isinstance(node, BinaryOperator):
# YOU KNOW WHAT, FOR NOW LET'S JUST ASSUME EVERYTHING IS AN INTEGER
if node.op == '*':
return f"builder.mul({generate(node.left, env)}, {generate(node.right, env)})"
elif node.op == '/':
return f"builder.sdiv({generate(node.left, env)}, {generate(node.right, env)})"
elif node.op == '+':
return f"builder.add({generate(node.left, env)}, {generate(node.right, env)})"
elif node.op == '-':
return f"builder.sub({generate(node.left, env)}, {generate(node.right, env)})"
elif isinstance(node, Comparison):
return f"builder.icmp_signed(node.op, generate(node.left, env), generate(node.right, env))"
elif isinstance(node, Print):
return f"builder.call(_print_int, [{generate(node.value, env)}]) "
... and so on and so forth, through all of the types of tree node in the Wabbit AST
Note that what this does is take in a node and return a string. That string, if evaluated by llvmlite, would execute an LLVM command. This part of the exercise involves a lot of consulting the llvmlite documentation.
Note, also, the comment in the BinaryOperator
block. The LLVM command is different depending on the type of number in an arithmetic operation. A sophisticated version of this code generator would differentiate between, for example, doubles, 8 bit integers, and 32 bit integers. That process, by the way, is called monomorphization—taking a function that is polymorphic (operates on multiple different types of inputs), and breaking out a separate implementation for each of the types that it can take in.
There’s technically another way to address this in compiler design, called boxing—putting a primitive type into an object with an interface that guarantees it can perform certain operations, and then having the compiler “not care” what type is passed in since it’s guaranteed at runtime to respond to all the right operations (though it might be a little bit slow to do so). I can’t do this topic justice in this blog post, but if you click here and then search the page for the section titled “the basic ideas,” you’ll find a really good explanation.
I didn’t do either one here—I just assume my polymorphic function receives an integer, and if it doesn’t, code generation fails. I left it like this because I ran out of time (Dave had us writing a compiler in a week, people), and also I don’t think I need that level of detail to exemplify the idea of what we’re doing here for you.
Onward.
3. Concatenate or llvmlite command strings together in a file that we can then ask llvmlite to ingest and compile.
starting_crapola ="""
from llvmlite import ir
mod = ir.Module('hello')
int_type = ir.IntType(32)
void_type = ir.VoidType()
float_type = ir.DoubleType()
bool_type = ir.IntType(1)
char_type = ir.IntType(8)
builder = ir.IRBuilder()\n
"""
def generate_program(model):
program = ""
for structure in model.statements:
program += generate(structure, env)
f = open("wabbit_to_llvm.py", "w")
f.write(starting_crapola)
f.write(program)
f.close()
We start with a multi-line string of boilerplate that names the LLVM IR types our program might use and then creates an IRBuilder, which the docs call “the workhorse of LLVM Intermediate representation (IR) generation.” Notice that the generate
function spits out strings that reference that builder.
Once we’ve run our code generator and created a file wabbit_to_llvm.py
that represents our Wabbit program, we run the following script to execute it:
python3 codegen.py
python3 wabbit_to_llvm.py > codegen.ll
clang main.c runtime.c codegen.ll
./a.out
I’ll be honest: working with llvmlite was not the easiest thing. The API is well-written and pretty comprehensively documented, so that wasn’t the hard part. But I hit a lot of snags with respect to getting the thing to run. This error or that error would happen, the script result wasn’t idempotent, and I had to change various things on my computer. Ideally, you follow the instructions to get a library to run, and it runs. So, two stars out of five on developer experience here. I know LLVM in and of itself doesn’t have to be such a bear to get working because I use it also for my work on the Roc programming language, and that experience was a dreamy “download and done”.
So you’ve seen the example. A bone to pick before I go.
I get annoyed when programmers shorten “code generation” to “code gen” to sound cool. Why: because then they take that one step further and write it like “codegen” because they’ve seen it like that before. And that’s when the colloquialism gets straight up misleading. So, lemme disambiguate here:
code generation – the language-agnostic step in a compilation process of converting an abstract syntax tree into a set of instructions, typically for a low-level machine, but in the case of a transpiler (a compiler that translates one high-level PL to another one), possibly a high-level programming language like JavaScript or Java.
codegen – a product name used by a variety of tech teams, including GraphQL and Swagger, to describe tools they have built for their specific products—not the general use case.
Please don’t make me guess whether you’re talking about a general compilation step and a specific tool you used. This would be sort of like if some company came up with a proprietary, nerfed-down adventure experience product that they called ‘EverestClimb,’ and then anytime someone talked about doing an “Everest Climb” in conversation you couldn’t tell if they actually climbed Mt. Everest or did this adventure experience.
Where do we go from here?
Well, if you enjoy the way I write about compilers, you’re in luck. First of all, I have an ongoing series about Bob Nystrom’s book Crafting Interpreters. I put it down for a while to take care of some other stuff, but it’s on deck for a revisit now that this course has given me some more depth.
Second, if you read the last third of this blog post closely, you might have noticed a little easter egg about a project I’m working on. Expect more soon ;).
If you liked this piece, you might also like:
The rest of this series on the compiler course
The Crafting Interpreters series
An Applied Programming Instructor Commentates an Algorithms Class