Why I keep lexing - In praise of lexical analysis

Lexing or lexical analysis is the process of turning sequences of characters into tokens, a more structured representation than “just the bytes”. As a reference you can think of turning then bytes encoding the text “10” into a pair of type, bytes (NUMBER, b”10”) or a type, span pair (NUMBER, [0, 1]). The result pairs together what was found in the input with a value to represent what kind of syntax it is.

If you do this repeatedly on the remaining of the input you get a sequence of tokens that you can use to reconstruct some structure that is encoded in the text.

Again as an example we can turn the arithmetic expression “10 + (2 ^ 5)” into the sequence (NUMBER, b”10”), (OP, b”+”), (OPEN, b”(“), (NUMBER, b”2”), (OP, b”^”), (NUMBER, b”5”), (CLOSE, b”)”).

This is a classic phase of compilers but it shows up in many other tasks. Basically every time there is a structure that has been flattened into bytes you will do some kind of lexing to help you reconstruct this structure. This will be either explicit by writing a lexer yourself or implicit by using something like parser combinators.

And now for the important selling point of lexing and why I want this separation between lexing and parsing despipte the increase in complexity: lexing is linear in the size of the input and you can use it to find syntactic errors. The errors you generate out of this simple analysis go a long way and are basically for free. Even if you don’t catch all kinds of syntactic errors lexing is worthwhile as long as you can always catch a kind of syntactic error, e.g. balanced parenthesis.

Again in our example lexing already gives us a lot of information on the structure we will reconstruct. It is a balanced expression, to each open parenthesis there is a matching close parenthesis, and each binary operator is correctly applied to two operands. I can compute this two properties for free by lexing and implementing a pushdown automatata.

By lexing alone I can ensure the input is free of syntax errors before I parse it. This also means that I can finally remove from my parser all the code that handles syntactic errors. This is a win-win situation in my book: for very little effort I get fast and reasonable syntax errors plus I simplify my parser significanlty by removing a lot of error handling that does not really belong there.

This becomes even more important to me as I get into hostile territory. As a compiler author you can be reasonably sure that the input is trusted but as a developer reading untrusted input from the network this is where problems lie.

If I parse untrusted JSON data from the network I will wait for the whole input to show up before parsing ¹. And when I have the whole input it’s much better to discard it before I try to parse it and find that it has a billion levels of nested structures and (1) it goes over the limit of recursion I tolerate and (2) it has made me allocate so many resources which I now need to clean up.

I’m not giving up on parser combinators but I really like lexing now. It is reliable because it’s simple and give me great joy by making my parser simpler.

streaming parsers are not my forte anyway. ↩