ADR 0052: Recursive Descent Parser with Pratt Parsing for Binary Operators

Status

Accepted | Implemented (2026-03-05)

Context

Problem Statement

Beamtalk's compiler is designed as a language service first — the parser must support IDE operations (completions, diagnostics, formatting, go-to-definition) with sub-100ms response times (Principle 12). This imposes hard constraints on parser design:

Binary Operator Precedence

ADR 0039 established that Beamtalk departs from Smalltalk's pure left-to-right binary evaluation in favour of standard math precedence (* binds tighter than +). This requires the parser to implement a full precedence hierarchy rather than a flat left-to-right rule.

The initial implementation handled binary precedence with three dedicated methods:

Each method was ~20 LOC, largely duplicated. Adding a new operator required creating a new method, threading it into the precedence call chain, and updating match arms — a ~20 LOC change across 2 files.

Research Spike

BT-109 evaluated parser alternatives. The full findings are in docs/internal/parser-architecture.md.

Decision

Keep the hand-rolled recursive descent parser, enhanced with Pratt parsing for binary operator precedence.

Recursive Descent for Structure

The top-level parser structure remains hand-rolled recursive descent. One function per grammar production, straightforward control flow, error recovery via synchronisation at statement boundaries. This approach is well understood, easy to step through in a debugger, and gives full control over error messages and trivia attachment.

Pratt Parsing for Binary Operators

Binary operator precedence is handled with a Pratt (top-down operator precedence) parser. A single binding power table maps operator tokens to numeric precedence levels. One recursive function — parse_binary_with_pratt — handles all binary operators:

/// Binding power table — the ONLY place to add or change binary operator precedence.
fn binary_binding_power(op: &str) -> Option<BindingPower> {
    match op {
        ">>"                         => Some(BindingPower::left_assoc(5)),  // Method lookup
        "==" | "/=" | "=:=" | "=/=" | "=" => Some(BindingPower::left_assoc(10)), // Equality
        "<"  | ">"  | "<="  | ">="  => Some(BindingPower::left_assoc(20)), // Comparison
        "+"  | "-"  | "++"           => Some(BindingPower::left_assoc(30)), // Additive
        "*"  | "/"  | "%"            => Some(BindingPower::left_assoc(40)), // Multiplicative
        "**"                         => Some(BindingPower::right_assoc(50)), // Exponentiation
        _ => None, // Unknown operator ends the binary expression
    }
}

Adding a new operator (e.g., bitwise &) is a single line:

"&" => Some(BindingPower::left_assoc(25)),

What This Looks Like in Practice

Beamtalk programmers do not interact with the parser directly. The effects are visible through correct expression evaluation and accurate IDE diagnostics:

// Pratt parsing ensures these evaluate as expected:
2 + 3 * 4        // => 14   (not 20 — * binds tighter than +)
10 - 2 - 3       // => 5    (left-associative: (10 - 2) - 3)
2 ** 3 ** 2      // => 512  (right-associative: 2 ** (3 ** 2))
x > 0 =:= y > 0 // => true if both positive (comparison before equality)

// Method lookup has lowest binary precedence:
Counter >> #increment  // => CompiledMethod object

// Errors point to the right token:
2 + * 4          // Error: expected expression, found `*` (column 5)

The IDE can report diagnostics in-flight while typing — the parser does not abort on the first error.

Prior Art

Recursive Descent: The IDE-First Standard

Hand-rolled recursive descent is the parser architecture of choice in production language tooling:

These projects share a common finding: the control and debuggability of hand-rolled parsers outweigh the abstraction benefits of combinators or generators when IDE responsiveness is a first-class requirement.

Pratt Parsing: Well-Established Algorithm

Pratt parsing originates with Vaughan Pratt (1973) and has become standard for expression parsing:

Traditional Smalltalk: No Precedence

Traditional Smalltalk (Pharo, Squeak, Smalltalk-80) evaluates binary messages strictly left-to-right with no numeric precedence hierarchy. 2 + 3 * 4 evaluates to 20, not 14. Beamtalk departs from this — see ADR 0039 for the full rationale. The Pratt implementation is what makes this departure possible at all.

User Impact

Newcomer (Python/JS/Rust background): Operators behave as expected. 2 + 3 * 4 is 14. Errors point to the right token. IDE completions work mid-expression. The parser is invisible when it's working correctly, and helpful when it isn't.

Smalltalk developer: The expression evaluation order differs from Pharo/Squeak — see ADR 0039. The message-passing syntax and three-tier message hierarchy (unary > binary > keyword) is preserved. Only the intra-binary precedence ordering changes.

Erlang/BEAM developer: No impact. The parser is a Rust crate producing an AST; it does not affect generated BEAM code or runtime behaviour.

Tooling developer (LSP, debugger, formatter): The AST preserves trivia (whitespace, comments) and source spans on every node. Error nodes appear in-place rather than halting the parse. This makes formatting, completion, and diagnostics straightforward to implement without special-casing partially-written code.

Production operator: No direct impact. Parse errors are compiler-time, not runtime. Faster parsing means faster REPL iteration in production use.

Steelman Analysis

Option A: Parser Combinators (Chumsky)

Why rejected anyway: Chumsky uses Pratt parsing internally for its pratt() combinator — adopting it adds a dependency to get the same algorithm. Chumsky's generic types make error messages and stack traces harder to read. Error recovery is combinator-configured rather than explicit synchronisation points, which is harder to tune for IDE responsiveness. Chumsky was 1.0.0-alpha at evaluation time.

Option B: Parser Generators (lalrpop, pest)

Why rejected anyway: Parser generators (LALR, PEG) require all input to be valid-enough to reduce. Error recovery requires significant extra work to customise. Trivia handling needs a preprocessing layer. Smalltalk-style messages (context-sensitive keyword message arity, statement separators) don't map naturally to LR/PEG grammars without substantial encoding complexity.

Option C: rowan / ungrammar (Lossless CST)

Why rejected anyway: Path-dependent — the working parser existed before rowan was evaluated. Adopting rowan would require rewriting the AST representation across ~35+ files. At the current codebase scale, the ceremony of typed accessor wrappers over an untyped CST is not yet justified by the grammar's complexity. If the grammar grows to the point where hand-rolled disambiguation becomes a maintenance burden (signals: ADR 0047, ADR 0048), rowan is the natural migration target.

Option D: tree-sitter

Why rejected anyway: tree-sitter produces a CST that must be lowered into an AST for compilation. This creates a dual-tree problem: two representations of the same source, two transformation steps, and divergence risk between the tree used for IDE queries and the tree used for codegen. Beamtalk's compiler-as-language-service principle depends on a single AST serving both purposes.

Tension Points

Tooling developers would find tree-sitter compelling — it provides the strongest incremental-parsing story and free editor integration. The tension is between single-tree simplicity (hand-rolled: compiler AST serves IDE and codegen) and ecosystem reach (tree-sitter: immediate Neovim/Helix support from a grammar file). This project weights single-tree simplicity and IDE diagnostic quality higher. A tree-sitter grammar for syntax highlighting remains a viable future addition outside the compiler pipeline.

Alternatives Considered

Chumsky (Parser Combinator)

Chumsky is a Rust parser combinator library with error recovery support and a pratt() combinator for operator precedence.

Why not adopted:

FactorChumskyHand-rolled + Pratt
Precedence handlingUses Pratt internallyUses Pratt directly
Error recoveryCombinator-based (harder to tune)Explicit sync points
Trivia handlingRequires token adapter layerIntegrated
DebuggingComplex generic typesSimple call stack
DependenciesAdds external crateNone
API stability1.0.0-alpha at evaluationStable

The core finding: Chumsky wraps the same algorithm with more indirection. There is no algorithmic advantage — only complexity.

lalrpop / pest (Parser Generators)

LALR (lalrpop) and PEG (pest) generators produce parsers from grammar files.

Why not adopted:

rowan / ungrammar (Lossless CST)

rust-analyzer — the ADR's cited inspiration — does not use a hand-rolled AST in isolation. It uses rowan, a red-green tree library that provides a lossless concrete syntax tree preserving all trivia. A typed AST layer is generated from an ungrammar specification as a zero-cost view over the CST, not a separate tree. This eliminates the "dual-tree problem" cited against tree-sitter while retaining the benefits of a formal grammar description.

Why not adopted:

Honest assessment: rowan is the strongest alternative not adopted. If Beamtalk's parser were being designed today from scratch, a rowan-based lossless CST would be a serious contender. The decision to keep the hand-rolled AST is partly path-dependent — the working parser existed before this alternative was evaluated. If future grammar complexity (see ADR 0047, ADR 0048) makes the hand-rolled approach untenable, rowan is the most likely migration target.

tree-sitter

tree-sitter is a GLR parser generator designed specifically for editor tooling. It generates fast, incremental, error-recovering parsers from a grammar file, and is the basis for syntax highlighting in Neovim, Helix, GitHub code views, and many other tools.

tree-sitter is arguably the strongest IDE-focused alternative and warrants detailed evaluation.

What tree-sitter provides:

Why not adopted:

Factortree-sitterHand-rolled + Pratt
ArchitectureDual-tree (CST + AST transformation)Single AST for IDE and codegen
LanguageGrammar file → C library (FFI from Rust)Pure Rust
DiagnosticsStructural error nodes; semantic errors need separate passFull semantic diagnostics in one pass
Operator precedenceSupported in grammarPratt table
TriviaFull CST (all tokens)Selective trivia attachment
DebuggingGenerated C code, harder to step throughPlain Rust call stack
DependenciesC FFI (tree-sitter crate)None

The decisive factor is the dual-tree problem. tree-sitter produces a CST — every token including whitespace. For an IDE syntax highlighter this is ideal. For a compiler that also does type checking and code generation, the CST must be lowered into a separate AST. This means maintaining two tree representations, a transformation step, and two sources of truth for source positions. Beamtalk's compiler-as-language-service architecture (Principle 12) depends on the compiler's own AST serving IDE queries directly — adding a separate tree-sitter pass would either duplicate work or introduce a divergence risk between the two representations.

A tree-sitter grammar for Beamtalk may still be worth developing independently for editor syntax highlighting — that would not conflict with this decision. But it would be editor tooling, not the compiler pipeline.

Consequences

Positive

Negative

Neutral

Implementation

Completed (BT-110)

The implementation is fully in place:

The three original methods (parse_comparison, parse_additive, parse_multiplicative) were replaced by parse_binary_with_pratt with all existing tests passing unchanged.

Extending the Parser

To add a new binary operator:

  1. Add one entry to binary_binding_power in parser/mod.rs
  2. Add a test to verify the new precedence level
  3. Update docs/beamtalk-language-features.md if the operator is user-visible

No other files need to change.

References