Operator Precedence in Top-Down Parsing: Pratt Parsers
David 2020-05-28
Last Edited: 2020-08-18
Operator Precedence: Pratt Parsers
Arguably the easiest way to hand-implement a parser is to write a recursive-descent parser. The gist of a recursive descent parser is that you associate a function with each grammar rule in the language. It's fairly easy to construct a grammar for which this is possible when that grammar doesn't contain "expressions"—usually these are grammars that are LL(1). Once you get into grammars with "expressions" in them, though, they tend to be a bit more difficult to deal with because it's hard to come up with LL(1) rules that make sense and that allow the operators to have the correct precedence and associativity. It turns out that there is a way to make this exact sort of thing easier by "enhancing" recursive descent parsers and turning them into what is now known as a Pratt Parser. They're named after Vaughan Pratt, the author of the paper that described them (also available here (PDF)).
A Pratt parser is an extension of recursive descent parsing that associates metadata with tokens in addition to syntax rules. This allows for much easier parsing of expressions that have operator precedence than you'd normally get if, for example, you were writing your own parser and designing an LL(1) grammar (which you often have to do if you're writing your own parser because LR grammars tend to be difficult to parse by hand). Rather than go through the trouble of contorting your grammar to fit something that is easy to parse, Pratt parsers assign metadata to tokens to aid in the parsing of expressions that involve operator precedence.
Pratt's paper, while not terribly confusing is more difficult to understand than I'd like. The concepts are actually easier to grasp if you just attempt to implement the parser described in section 3 (Implementation) of the paper. It's a real "seeing is believing" type situation.
The first thing that we learn is that Pratt parsers associate some "semantic code" (as in programming code) to each token. The semantic code contains information about the token (and practically, parsing code). Keep that in mind as we look at the main diagram of the paper.
The parser described has three states:
( )
| nud
|
v led
(c)<-----------\
| |
v rbp<lbp/ |
(left)----------/
Pratt uses some unusual terminology. The "nud" is a short-form for "null
denotation" which means "the code denoted for a token which has no preceding
expression" which, even more simply, is a "prefix operator". For example, the
unary minus ("-") is a token with a "nud" whereas the infix operator for
division (the "/") does not. The "led" by contrast is the "left denotation"
which is the "code denoted for a token with a preceding expression." That is,
anything that is not a prefix operator—i.e. an infix operator or a postfix
operator. Postfix operators are comparatively rare, but a well-known example
of one is the factorial operator ("!") which operates on the preceding number:
4! = 1 × 2 × 3 × 4 = 24. The "rbp" and "lbp" are "right binding power" and
"left binding power," respectively. These are, as far as I can tell, the same
thing as "operator precedence." For example the multiplication operator "*"
has a higher binding power than the addition operator "+". The "right" binding
power is a parameter/argument of the parse operation (it has a default value of
"the minimum" when calling parse on the top of the tree) but the left binding
power is a property of the operator token under consideration during the parse
at the point of checking rbp<lbp/
. I believe the "/" in the notation is used
used to mean "if the preceding is true."
Example
In order to better understand the paper's implementation of the parser, I tried to develop one that would follow, as closely as possible, the algorithm given in the paper, no matter how ugly it got. Using just this diagram (and a fair amount of trial-and-error) I came up with the following example in JavaScript (which you can run in Node.js or Deno):
const parse = (function(s) {
const bps = {
'+': 1,
'-': 2,
'*': 3,
'/': 4,
'^': 5
};
function expression(rbp) {
let c = nud();
return expression_tail(c, rbp);
}
function nud() {
let tok = s[0];
if (tok.match(/\d/)) {
return function() {
return parseInt(tok);
};
}
throw new Error('Token has no null denotation: ' + tok);
}
function expression_tail(c, rbp) {
advance();
let left = c();
if (s.length === 0) {
return left;
}
let lbp = bp();
if (rbp < lbp) {
return expression_tail(led(left), rbp);
}
return left;
}
function advance() {
s = s.substring(1);
}
function bp() {
let tok = s[0];
if (tok in bps) {
return bps[tok];
}
throw new Error('Token has no binding power: ' + tok);
}
function led(lhs) {
const leds = {
'+': plus,
'*': multiply,
'-': minus,
'/': divide,
'^': power
};
let tok = s[0];
if (tok in leds) {
return leds[tok](lhs);
}
throw new Error('Invalid operator: ' + tok);
}
function plus(lhs) {
return infix(lhs, '+', (x, y) => x + y);
}
function infix(lhs, op, exec) {
return function() {
let rbp = bps[op];
let rhs = expression(rbp);
return '(' + lhs + ' ' + op + ' ' + rhs + ')';
};
}
function minus(lhs) {
return infix(lhs, '-', (x, y) => x - y);
}
function multiply(lhs) {
return infix(lhs, '*', (x, y) => x * y);
}
function divide(lhs) {
return infix(lhs, '/', (x, y) => x / y);
}
function power(lhs) {
return function() {
let rbp = bps['^'];
// -1 because right-associative
let rhs = expression(rbp-1);
return '(Math.pow(' + lhs + ',' + rhs + '))';
};
}
return expression(0);
});
function test(str, expected) {
let parsed = parse(str);
let actual = eval(parsed);
let status = actual === expected ? 'OK' : 'FAIL';
console.log(status + ': ' + str + ' | ' + actual + ' | ' + expected + ' | ' + parsed);
}
test('1', 1);
test('1+2', 3);
test('1+2*3', 7);
test('1+2*3+4', 11);
test('1+2*3+4*5', 27);
test('1+2*3+4-5', 6);
test('1+6/3-2*2', -1);
test('2^2*3', 12);
test('2^3^2', 512);
test('2^2^3', 256);
test('3^3+4*5-2^2^2', 31);
This implementation isn't terribly efficient and it heavily uses closures, but I think it's quite true to the implementation described in the original paper. Most of the blog posts about Pratt parsers that I read—of which this one is my favourite, but this one is also very good—use objects rather than functions and so are harder for me to map to the original paper.
There is one weird bit in the exponentiation function where we subtract 1 from
the rbp
value before recursing. We do that because the exponentiation
operator is right-associative. This means that 2^3^2 should equal (2^(3^2)) and
not ((2^3)^2). In order to achieve this, we subtract 1 from the rbp of the
exponentiation operator before using it because the exponentiation operator
binds "more tightly" on the left than the right.
You should be able to see from the above that it's near-trivial to add new operators and in fact I added everything but addition and multiplication after the fact in the above example.
How/Why it Works
Pratt Parsers work on much the same principle as the
Shunting-Yard Algorithm
in that we build up a stack of values (implicitly, using the program stack)
and, through assigning binding-power/precedence to operators, perform
operations on the values on that stack in order to evaluate the expression.
Effectively, it builds up the correct "expression tree" from the flattened
program based on the rules about the binding power of operators. An operator
with a higher "binding power" will end up "closer" in the final tree to its
arguments than would an operator with lower binding power. For example, the
string a + b * c + d
ends up looking like:
+
/ \
+ d
/ \
a *
/ \
b c
Pratt parsers are really cool because they make it almost-trivial to add complex operator precedence rules to hand-written recursive descent parsers!