Building a Scanner

Reading Time: 12 minutes

I recently picked up Crafting Interpreters. The project guides programmers through building their own interpreters for the Lox programming language.

I want to grasp this subject as quickly and deeply as I can. As I do, I find myself using many of the techniques we discussed in the leveling up series. In Chapter 4 of Crafting Interpreters, we be begin writing the front end of our Lox compiler: the scanner. So in this post, I document my process to show you a specific example of how I write code when I’m trying to learn.

First, what is a scanner? A scanner takes in the text of a program and outputs tokens—the parts of the program that we read to execute what the program is asking.

Before I began reading or writing, I asked myself:

Given what I know about what scanners do, how would I imagine such a thing to be built?

I would expect it to take in characters one by one and probably run them through some kind of extensive conditional. If one character looks like the start of a multi-character special token, then we have to hang onto it, and all the characters that follow, either until we have the full token, or until the string we’re looking at no longer matches the token in question.

As it happens, this is exactly what our scanner does.

Forging an Iterative Path to a Working Scanner

Crafting Interpreters does not take an iterative approach to writing a scanner, per se. Instead, it more or less builds a finished product line by line.

For example, the book shows us a method that uses three helper methods, and then it shows us the text of the helper methods. In an iterative approach, the first method would contain the bodies of those helper methods until those bodies needed to be duplicated (or, if we’re considering risk estimates for refactoring, triplicated or more). At that point, we would extract a helper method.

Were we to copy and paste code into our implementation from the book as it appears, we would not fully understand what it’s going to be for yet. If we were to paste the top level method and not paste the helper methods below it, the program would not compile.

That’s fine with me. I have seen books attempt a verbatim, step-by-step imitation of iterative programming—most notably Kent Beck’s Test-Driven Development by Example and Martin Fowler’s Refactoring to Patterns. These books’ code samples are unmistakably clear and take agonizingly tiny steps. It takes a lot of work to build a book like this, and some folks really love it.

I find it excruciatingly boring and ineffective, and I like both of those books in spite of the verbatim style, not because of it. This is not because I’m a genius; it’s because I’m an impatient screwup who learns best by barging ahead and screwing up. If I want to do something right the first time, I seek (and follow) clear instructions. This is the case with Ikea furniture and other crap I only want to have to do once.

If I instead want to carry new skills forward into other projects, I ignore the instructions and figure things out myself. The style in which Crafting Interpreters presents code creates an opportunity for me to do that.

Instead of copying and pasting the implementations in the book, I built my scanner up alongside the book’s, but with my own approach. I’ll divide this approach into three categories: code behind, code ahead, and refactoring.

Code Behind

Chapter 4 of Crafting Interpreters introduces a method in our scanner called advance(). This method both modifies the class (consumes a character) and returns a value (the next character in the source code). That return value only gets used in one of the three or four cases in the scanner where the method gets called.

I get nervous when methods both modify their class and return something because programmers can see the method in use to assign a value and miss the fact that it also modifies the class*. So I inlined advancing the cursor in all three cases and assigned that value separately in the one place where it was needed. This is, notably, not as engineered as the solution in the code samples. So my level of engineering is behind the one in the example text, hence the name.

As I continue working, the scanner might evolve such that it makes sense to extract the advance() method again. But I want to see that for myself, so I can understand the exact issues that the presence of that method allows us to avoid.

*Later in my implementation, I break the return-or-modify rule. I extract a method called isAComment() that checks if a token started by a slash is a standalone slash or a comment, consumes the comment if it’s a comment, and then returns a boolean that we use to decide whether to add a slash token.

Why: consuming the comment without adding a token modifies the cursor we’re at, but it very deliberately doesn’t modify our token list. That’s the whole point. Since the location of the cursor exists purely as a means to help us get to the token list, I decided that the advantage of having a clean, legible scanToken() method outweighed the risk here of a programmer misinterpreting the modifications inside isAComment().

Code Ahead

The chapter often shows a method that uses some helper methods. Afterward it shows the text of the helper methods. So I would put the first method into my implementation and then try to create the helper methods without looking at the ones in the chapter.

When I finished, I would scroll down and compare my implementation to the examples. Those helper methods served as “fill-in-the-blank” exercises that allowed me to feel like I was building a scanner rather than copying one over. So at points, my implementation was more engineered than the solution in the book, hence the name “code ahead.”

Refactoring

I engaged with the example code by making it my own.

Take naming things, for example. I force myself to understand the roles of different pieces of code by attempting to give them illustrative names. My scanner’s current variable became cursorIndex. The match() method became consumeIfNext() (yes, I know, regex uses match).

The peek() method from the code example shows us the next character after the cursor without consuming that character, but we only use it to check if the next character matches a certain character. So I changed the name to isNextChar(), gave it a parameter, and had it return a boolean rather than a string. I did this right before we got to number literals, at which point we did start using the value of peek(). Rather than reinstate peek(), I left the implementation inline. I think source.charAt(cursorIndex) is clearer than peek() as to what is happening (albeit less cute).

I played with my implementation in a REPL throughout to make sure my science experiments were working. I didn’t have to write code that made sense, because I only wanted to test whether the scanner would detect the tokens I wanted it to detect.

parsing a line of nonsense code

Once the scanner implementation worked, I took time to consider how we had structured its behavior.

The scanToken() method had a giant case statement, as expected. The use of a case statement has performance and structural implications. For scanning, a case statement is fast: it facilitates differentiation between several different types of token at once, rather than falling through several different conditionals. The structural implication: the moment we’re dealing with tokens that aren’t detected by exact character match (for example, identifiers and numbers), a case statement can’t handle it. So we shove it in default at the bottom and fall back on conditionals.


private void scanToken() {
char c = advance();
switch (c) {
case '(': addToken(LEFT_PAREN); break;
case ')': addToken(RIGHT_PAREN); break;
case '{': addToken(LEFT_BRACE); break;
case '}': addToken(RIGHT_BRACE); break;
case ',': addToken(COMMA); break;
case '.': addToken(DOT); break;
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break; // [slash]
//> two-char-tokens
case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;
//< two-char-tokens
//> slash
case '/':
if (match('/')) {
// A comment goes until the end of the line.
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;
//< slash
//> whitespace
case ' ':
case '\r':
case '\t':
// Ignore whitespace.
break;
case '\n':
line++;
break;
//< whitespace
//> string-start
case '"': string(); break;
//< string-start
//> char-error
default:
/* Scanning char-error < Scanning digit-start
Lox.error(line, "Unexpected character.");
*/
//> digit-start
if (isDigit(c)) {
number();
//> identifier-start
} else if (isAlpha(c)) {
identifier();
//< identifier-start
} else {
Lox.error(line, "Unexpected character.");
}
//< digit-start
break;
//< char-error
}
}

view raw

Scanner.java

hosted with ❤ by GitHub

This works, but the shape of it does not accurately represent the importance of each token. If I’m trying to find number parsing, the first place I’m looking isn’t default. The case statement represents a choice made for the sake of run speed, and its use makes the scanner code harder for programmers to immediately understand.

Now, run speed is really important in a compiler. Developers rely on compilers for real-time feedback as they write code. So when I say “prioritize discoverability and maintainability over run speed in almost all cases,” compilers are one of the reasons for that “almost.”

But I’m not writing a compiler for production use right now. I’m writing a compiler to learn. So to learn, I need to tinker with the compiler.

I decided to do an imaginative exercise: I pretended that I lived 25 years in the future.

Computing resources are so ample and powerful that run speed affects the utility of a compiler as little as it affected user-facing app development in 2019. When I get to prioritize discoverability and maintainability over run speed, how would I build my scanner?

I decided that, first of all, the shape of the code should match the significance of each piece. A case statement wasn’t going to work for this. I switched to boring conditionals and defined clearly named helper methods so that scanToken() has a section for single character tokens, a section for comparison operators, a section for comments and other slashes, a section for ignored characters, a section for strings, a section for digits, and a section for identifiers. They are all at the same level of importance, same level of indentation, and roughly the same size, so a programmer can easily scan and find the code representing the group of tokens containing the one they are looking for.

Each type of token, if found, returns early from the method. If nothing is found, at the very end of the method, we produce an error indicating that the scanner has found an unexpected character. This approach mimics a legion of guard clauses. The difference: instead of a bunch of guard clauses exiting early for invalid inputs and finishing with the “happy path” behavior, these clauses all exit early for valid inputs, with the output for an invalid input at the end. Our Lox.error... line can be interpreted as “do this if all else fails.”


private void scanToken(char character) {
if (singleCharacterToken(character) != null) {
addToken(singleCharacterToken(character));
return;
}
if (isComparisonOperator(character)) {
addComparisonOperatorTokenFor(character);
return;
}
if (character == '/') {
if (!isAComment()) {
addToken(SLASH);
}
return;
}
switch (character) {
case ' ': //God, I hate this about Java
case '\r':
case '\t':
// Ignore whitespace.
return;
case '\n':
line++;
return;
}
if (character == '"') {
addStringToken();
return;
}
if (isDigit(character)) {
addNumberToken();
return;
}
if (isAlpha(character)) {
addIdentifierToken();
return;
}
Lox.error(line, "Unexpected character.");
}

view raw

Scanner.java

hosted with ❤ by GitHub

In the process, I discovered some interesting things. For example, if I had a number followed by a semicolon, my scanner would crash because we weren’t checking for a semicolon on the end of numbers. Similarly, a semicolon at the end of an identifier would get included in the identifier. I added checks for both of these cases.

This brought me to the point where I would consider factoring out some of these checks into clearly named check methods. In the end, I refrained. See, the checks follow a pattern:

  • determine something about the current character
  • ensure that the next character isn’t a semicolon
  • ensure that we’re not at the end of the file

However, the “something” in the first list item differs from place to place, so I’m not seeing a discoverable abstraction emerge here…yet.

You can check out all the changes I have described, along with the helper methods I extracted, in this commit. I should have made incremental commits, but truth be told I was watching InkMasters on the side and forgot. (I love Ryan Ashley).

Ryan agrees with my predilection for taking ownership of my programming exercises.

Conclusion

I’m an impatient screwup who learns best by barging ahead and screwing up. The style in which Crafting Interpreters presents code creates an opportunity for me to do that.

If I didn’t understand why the abstractions in the Chapter 4 code examples, I moved forward without making those abstractions until some quality of my code’s structure or function showed me why they were valuable.

When the chapter showed a top-level method with three helper methods, I would try to implement those helper methods before I looked at their implementations, using the code examples as fill-in-the-blank exercises to test my understanding of what we wanted to do. Then I would scroll down and compare my helper methods to the ones in the chapter.

Perhaps most importantly, I took ownership of the code and made it my own. Though I started each subsection with copying and pasting, I came to understand the material by:

  • considering each piece of code’s role and choosing my own names to describe those roles
  • testing things out in a REPL and looking for edge cases
  • adjusting the top-level structure of the code and fixing the issues that fell out of those changes

Because I did this, I feel confident that I could answer questions about my finished scanner.

If you liked this piece, you might also like:

The rest of the posts about Crafting Interpreters

This other time we talked about performance (examples in Ruby)

Reading too far into git commit messages

One comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.