Precedence and Associativity in Recursive-Descent Parsing
David Jackson 2022-05-01
I like writing parsers. I probably like writing parsers too much. I think a lot of that is influenced by the use of "little languages" like awk in UNIX to tailor the language used to solve the problem to the nature of the problem. My favourite method for creating parsers is to write them from scratch as recursive-descent parsers. These have a problem, though: it's not obvious how to deal with the precedence or associativity of operators in expressions.
Precedence is how we decide the "order of operations." You may remember the mnemonic "BEDMAS" (or "PEDMAS") from elementary school mathematics meaning: "Parentheses/Brackets, Exponents, Division, Multiplication, Addition, Subtraction." This is the order of operations for mathematics. To see what this means concretely:
20 - 3 * 2 ^ 2 + 1 = 20 - (3 * (2 ^ 2)) + 1 = 9
Associativity describes how an expression is evaluated. In mathematics, for
example, subtraction is left-associative—i.e. 9 - 5 - 1 = (9 - 5) - 1
—but
exponentiation is right-associative—i.e. 2 ^ 3 ^ 2 = 2 ^ (3 ^ 2) = 512
.
A grammar for mathematical expressions, in EBNF is the following:
expression = expression, operator, expression | "(", expression, ")" ;
operator = "+", "-", "*", "/", "^" ;
This grammar is highly ambiguous because it tells us nothing about precedence
or associativity. It might parse 2 + 3 * 5
as (2 + 3) * 5
which would
give us an incorrect result. We can fix this by providing distinct grammar
rules for each "level" of precedence. We have the following levels, ordered
lowest-precedence to highest:
- Addition/Subtraction
- Multiplication/Division
- Exponentiation
- Parenthetical/Numbers
We can translate that into a grammar that encodes precedence rules. To do this, each rule must "contain" all the rules of higher precedence. For example:
expression = factor, ("+", "-"), factor | factor ;
factor = exponent, ("*", "/"), exponent | exponent ;
exponent = primary, "^", primary | primary ;
primary = number | "(", expression, ")" ;
This grammar still presents a problem for recursive-descent because it contains a FIRST/FIRST conflict. We can eliminate that as follows:
expression = factor, { ("+", "-"), factor } ;
factor = exponent, { ("*", "/"), exponent } ;
exponent = primary, { "^", exponent } ;
primary = number | "(", expression, ")" ;
We now have a grammar that correctly encodes the precedence of our expressions. We don't yet have a our associativity rules encoded, though. To deal with associativity rules, we will construct our "abstract syntax tree" of the expression such that it encodes the associativity rules.
// Left-Associative:
function expression() {
var expr = factor();
while (match("+", "-")) {
var op = consume();
var right = factor();
expr = new BinaryOperation(exp, op, right);
}
return expr;
}
...
// Right-Associative:
function exponent() {
var left = primary();
var exp = left;
if (match("^")) {
var op = consume();
var right = exponent();
expr = new BinaryOperation(left, op, right);
}
return expr;
}
In the left-associative case, we weight the tree to the leftmost child so that it is correct. In the right-associative case, we take advantage of the recursive nature of the exponentiation rule to build up a right-associative tree. The recursion rule is right-recursive which naturally lends itself to right-associativity. In fact, left-associativity is "the difficult case" when we are doing recursive-descent parsing. Unlike the right-associative case, left-associativity requires us to, implicitly or explicitly, "pass along" the left part of the tree. If we had used recursion rather than a loop, that might look like the following:
function expression(left) {
if (left == nil) {
left = factor();
}
if (match("+", "-")) {
var op = consume();
var right = expression(left);
return new BinaryOperation(left, op, right);
}
return left;
}
To sum up: we encode precedence by creating a grammar rule for each "level" of precedence and we accomplish associativity by constructing our expression tree such that the shape of the tree properly encodes the associativity.