ADR 0018: Document Tree Code Generation (Wadler-Lindig Pretty Printer)

Status

Implemented (2026-02-15)

Context

Beamtalk generates Core Erlang text (ADR 0003) via direct string emission — 1,100+ write!/writeln! macro calls across 28 files in crates/beamtalk-core/src/codegen/core_erlang/. The current architecture uses a single String output buffer with manual indentation tracking:

// Current approach: imperative string building
pub(super) struct CoreErlangGenerator {
    output: String,       // Direct string buffer
    indent: usize,        // Manual indentation counter
    // ...
}

// Example: generating a method table
writeln!(self.output, "'method_table'/0 = fun () ->")?;
self.indent += 1;
self.write_indent()?;
write!(self.output, "~{{")?;
for (i, (name, arity)) in methods.iter().enumerate() {
    if i > 0 { write!(self.output, ", ")?; }
    write!(self.output, "'{name}' => {arity}")?;
}
writeln!(self.output, "}}~")?;
self.indent -= 1;

Problems with Direct String Emission

1. Indentation is fragile. The self.indent counter must be manually incremented/decremented in matched pairs. Forgetting a decrement or nesting incorrectly produces malformed Core Erlang that erlc rejects — but the Rust code compiles fine. This class of bug is invisible at compile time.

2. No composability. Code fragments can't be built independently and combined. Every write! call mutates a shared String buffer, so you can't build a function body, inspect it, test it in isolation, or compose it with other fragments. Adding new codegen features requires understanding the mutation flow across multiple files. In practice, AI agents working on codegen issues struggle with the dense write!/writeln! + manual indent patterns — the code is hard to read and reason through, increasing the risk of subtle indentation errors that compile as valid Rust but produce invalid Core Erlang.

3. Testing requires full string comparison. The codegen subsystem has 196 snapshot tests and 170 unit tests. There's no way to unit-test individual code generation fragments (e.g., "does this method table generate correctly?") without running the full pipeline.

4. The codebase is large and growing. 1,100+ write!/writeln! calls across 28 files, with the heaviest files being:

Filewrite! callsPurpose
primitive_implementations.rs136Intrinsic method bodies
counted_loops.rs126Loop codegen
gen_server/methods.rs98Method dispatch
value_type_codegen.rs204Value type modules
intrinsics.rs80Intrinsic dispatch

As the language grows (pattern matching, exception handling, type annotations), this approach will become increasingly difficult to maintain.

Note: Some specific refactoring tasks (e.g., module renaming) can be addressed independently by extending existing abstractions like ModuleName in erlang_types.rs. The document tree is a long-term architectural improvement for the codegen subsystem as a whole.

What Other Rust-Based Compilers Do

Gleam (Rust → Erlang/JS) uses a Document algebraic data type based on Wadler-Lindig's "Strictly Pretty" algorithm. Code generation builds a tree of document nodes, then renders once:

// Gleam's approach — declarative document composition
let module = docvec![
    header,
    "-compile([no_auto_import, nowarn_unused_vars]).",
    line(),
    exports,
    join(statements, lines(2)),
];
module.to_pretty_string(80)

Gleam's Document enum has variants: Str, Line, Nest, Group, Vec, Break. The docvec! macro provides ergonomic composition. Gleam uses this for both Erlang and JavaScript backends — same Document type, different rendering. Gleam's implementation is ~875 lines including tests.

Other compilers:

Pattern: Modern Rust compilers overwhelmingly use document trees for text code generation, not direct string emission.

Decision

Replace direct write!/writeln! string emission with a Wadler-Lindig document tree for Core Erlang code generation.

Introduce a Document enum (or use the pretty crate) and a docvec! macro for composing Core Erlang output. The code generator builds a tree of Document nodes, which is rendered to a string in a final pass.

What This Looks Like

Before (current):

fn generate_method_table(&mut self, methods: &[(String, usize)]) -> Result<()> {
    writeln!(self.output, "'method_table'/0 = fun () ->")?;
    self.indent += 1;
    self.write_indent()?;
    write!(self.output, "~{{")?;
    for (i, (name, arity)) in methods.iter().enumerate() {
        if i > 0 { write!(self.output, ", ")?; }
        write!(self.output, "'{name}' => {arity}")?;
    }
    writeln!(self.output, "}}~")?;
    self.indent -= 1;
    Ok(())
}

After (document tree):

fn generate_method_table(&self, methods: &[(String, usize)]) -> Document {
    let entries = join(
        methods.iter().map(|(name, arity)| docvec!["'", name, "' => ", arity]),
        ", "
    );
    docvec![
        "'method_table'/0 = fun () ->",
        nest(INDENT, docvec![line(), "~{", entries, "}~"]),
    ]
}

Key differences:

Note on mutable generator state: CoreErlangGenerator holds mutable state beyond the output buffer — var_context (variable scoping/fresh names) and state_threading (field assignment state variables). Functions that use these will still require &mut self even after adopting Document return types. The migration decouples output construction from state mutation but does not eliminate all &mut self methods. The write_document() bridge (Phase 0) accommodates this: Document-returning functions can still take &mut self when they need to generate fresh variable names.

Document Type

A Document enum for Core Erlang generation (based on Gleam's approach, ~250 lines):

pub enum Document<'a> {
    /// A string literal
    Str(&'a str),
    /// An owned string
    String(String),
    /// A newline followed by current indentation
    Line,
    /// Increase indentation for nested content
    Nest(isize, Box<Document<'a>>),
    /// A sequence of documents
    Vec(Vec<Document<'a>>),
    /// A group that can be rendered flat or broken across lines
    Group(Box<Document<'a>>),
    /// A break point — rendered as a space when flat, newline when broken
    Break(&'a str),
    /// Empty document
    Nil,
}

While Core Erlang mostly has fixed formatting, Group and Break are included from the start because pattern matching compilation (planned soon) will generate deeply nested case expressions where readable line-breaking is needed. Including these variants now (~50 extra lines) avoids a disruptive retrofit later and follows Gleam's proven design.

Approach: Roll Our Own vs. Use a Crate

Recommended: Roll a focused implementation (~250 lines), following Gleam's proven design.

OptionProsCons
pretty crateFull Wadler-Lindig, well-testedAdds dependency; includes features we don't need (ForceBroken, FlexBreak)
Gleam-style focusedWhat we need including Group/Break, easy to understandMust write ~250 lines
Full Wadler-Lindig customFuture-proof for formatting toolsUnnecessary complexity

We need: Str, String, Line, Nest, Vec, Group, Break, Nil, and a docvec! macro.

Prior Art

CompilerLanguageApproachNotes
GleamRust → Erlang/JSCustom Document tree (~875 lines)Wadler-Lindig with docvec! macro; shared across Erlang + JS backends
rustcRust → LLVMrustc_ast_pretty moduleDocument tree for AST pretty-printing
prettypleaseRust syn → RustWadler-style documentsFormats generated Rust code
Elm compilerHaskell → JSWadler pretty-printerStandard in Haskell ecosystem
PureScriptHaskell → JSDoc type with renderComposable document fragments
OCaml compilerOCaml → nativeFormat moduleBuilt-in pretty-printing with boxes

Universal pattern: Compilers that emit text-based output use document trees. Direct string concatenation is the exception, not the norm.

Gleam's evolution: Gleam started with simpler codegen and grew into the Document approach as complexity increased. Beamtalk is at a similar inflection point — 1,100+ write calls across 28 files.

Why not a full Core Erlang IR? The prior art compilers listed above all use document trees rather than typed target-language IRs for their text backends. A typed Core Erlang IR (Alternative 2, below) would only become valuable if beamtalk needed to transform or optimize the generated Core Erlang before emission — which it currently doesn't, since erlc handles all optimization passes.

User Impact

This is a purely internal refactoring — it changes how the compiler generates Core Erlang, not what it generates. No user-facing behavior changes.

PersonaImpact
NewcomerNone — same REPL, same error messages, same compiled output
Smalltalk developerNone — language semantics and syntax unchanged
Erlang/BEAM developerNone — generated Core Erlang is byte-for-byte identical
OperatorNone — no runtime impact; fewer codegen bugs means fewer bad BEAM files in production
Compiler contributorSignificant improvement — codegen is easier to read, write, test, and refactor

Contributor Experience (Primary Beneficiary)

Before: Adding a new codegen feature requires:

  1. Understanding the self.output mutation flow across multiple files
  2. Manually tracking indentation state
  3. Writing snapshot tests that compare entire generated files
  4. Risk of indentation bugs that produce valid Rust but invalid Core Erlang

After: Adding a new codegen feature requires:

  1. Writing a function that returns a Document
  2. Composing it with existing document fragments using docvec!
  3. Unit testing the fragment in isolation
  4. Indentation is handled declaratively

Steelman Analysis

The Strongest Argument Against This ADR

This is a pure refactoring of the largest subsystem in a ~40k-line compiler that has multiple active epics of unimplemented language features. The refactoring produces zero user-visible value. Every hour spent migrating write! calls is an hour not spent on pattern matching, type inference, or the features that will determine whether Beamtalk has users.

This is a legitimate concern. The ADR proceeds despite it because:

  1. The migration is designed to be organic, not dedicated — new code uses Document, old code migrates opportunistically during feature work
  2. Codegen is the subsystem that every language feature touches — improving its architecture reduces the cost of all future features
  3. The Document type itself is ~250 lines of net-new code; the migration cost is spread across feature PRs, not front-loaded

Option A: Document Tree (Recommended)

Option B: Keep write! (Status Quo)

Option C: Helper Methods Only (No Document Tree)

Tension Points

Alternatives Considered

1. Helper Methods on CoreErlangGenerator (80/20 Solution)

Extract helper methods that encapsulate common patterns without introducing a new intermediate representation:

impl CoreErlangGenerator {
    fn indented(&mut self, body: impl FnOnce(&mut Self) -> Result<()>) -> Result<()> {
        self.indent += 1;
        body(self)?;
        self.indent -= 1;
        Ok(())
    }

    fn emit_call(&mut self, module: &str, function: &str, args: &[&str]) -> Result<()> { ... }
    fn emit_let(&mut self, var: &str, body: impl FnOnce(&mut Self) -> Result<()>) -> Result<()> { ... }
}

Partially adopted: indented() and targeted helpers should be introduced regardless — they provide immediate value at zero risk and can be adopted one call site at a time. However, helpers alone don't solve composability (fragments can't be returned, stored, or tested independently) or the fundamental problem of interleaved mutation. Helper methods are a stepping stone, not the destination.

2. Template Engine (Tera, Askama)

Use a template engine with Core Erlang templates containing placeholders.

// hypothetical template
module '{{ module_name }}' [{{ exports }}]
  attributes [{{ attributes }}]

{% for function in functions %}
'{{ function.name }}'/{{ function.arity }} = fun ({{ function.params }}) ->
    {{ function.body }}
{% endfor %}

Rejected because:

3. Typed Core Erlang IR

Build a full typed intermediate representation of Core Erlang:

enum CoreExpr {
    Let { var: String, value: Box<CoreExpr>, body: Box<CoreExpr> },
    Apply { fun: Box<CoreExpr>, args: Vec<CoreExpr> },
    Case { expr: Box<CoreExpr>, clauses: Vec<CoreClause> },
    Map { pairs: Vec<(CoreExpr, CoreExpr)> },
    Literal(CoreLiteral),
    // ...
}

Rejected because:

A typed IR may become valuable later (for optimization passes, multiple backends), but it's premature now. The document tree doesn't preclude adding an IR later — they serve different purposes.

4. Use the pretty Crate

Use the existing pretty crate from crates.io instead of rolling our own.

Not adopted because:

5. Incremental Adoption — write! + Document Hybrid

Keep write! for existing code, use Document only for new code.

Partially adopted: The migration strategy (see Implementation) is incremental. But the end goal is full migration — a permanent hybrid would be confusing for contributors who must learn both patterns.

6. Builder Pattern

Use a builder API that wraps String construction with composable methods:

CoreErlangBuilder::new()
    .line("'method_table'/0 = fun () ->")
    .indent(|b| {
        b.text("~{")
         .join(methods.iter(), ", ", |(name, arity)| format!("'{name}' => {arity}"))
         .text("}~")
    })
    .build()  // → String

Not adopted because:

What it does better than write!: Eliminates manual indent += 1 / indent -= 1 pairs, so it solves Problem #1 (fragile indentation). If composability and fragment testing are not priorities, a builder is a pragmatic middle ground.

Consequences

Positive

Negative

Neutral

Implementation

Adoption Strategy: New Code First, Organic Migration

Rather than a dedicated multi-phase migration project that competes with feature work, adopt the document tree organically:

Phase 0: Foundation (~S — single PR)

  1. Add Document enum (with Group/Break variants) and docvec! macro in crates/beamtalk-core/src/codegen/core_erlang/document.rs
  2. Implement to_string() / render() for the document type, including group/break rendering logic
  3. Add unit tests for the document primitives
  4. Add a write_document() bridge method to CoreErlangGenerator that renders a Document to self.output — enabling gradual per-function migration

Phase 1: New Code Convention

Phase 2: Opportunistic Migration (Ongoing)

Phase 3: Cleanup (When Migration Naturally Completes)

Approximate migration order (based on which subsystems feature work will touch first):

SubsystemFileswrite! callsLikely Feature Trigger
Control flow5 files249Block semantics (BT-204)
Gen server5 files293Actor runtime (BT-207)
Module generation5 files367Metaclasses (BT-319)
Expressions + intrinsics2 files146Stdlib (BT-205)
Leaf functions5 files45Various

Verification: Run just ci after every migration — snapshot tests ensure output is identical.

Key advantage: No dedicated migration phases compete with feature work. The document tree earns its keep by making each feature PR's codegen changes cleaner and more testable.

Migration Health Checks

To prevent the organic migration from stalling indefinitely:

Migration Path

Not applicable — this is an internal refactoring with no user-facing changes.

References