ADR 0011: Robustness Testing — Layered Fuzzing and Error Quality

Status

Implemented (2026-02-15)

Context

Beamtalk's parser is designed for error recovery — it must produce a partial AST with diagnostics even when input is malformed (critical for IDE support). However, the current test suite has almost no coverage of this capability:

The parser has good infrastructure for recovery (synchronization at ., ], ), }, ;, ^) and the (Module, Vec<Diagnostic>) return type enforces that errors don't prevent AST construction. But none of this is stress-tested.

Risk

A single panic on malformed input would crash:

Since Beamtalk is interactive-first, users type incomplete and malformed syntax constantly during live coding. Parser robustness is not optional — it's a core UX requirement.

Current State

AreaCoverageGap
Parser crash safetyNoneAny random input could panic
Parser recovery quality2 unit testsRecovery produces AST but quality untested
Diagnostic span validityNoneSpans could point outside input
Error message quality9 E2E casesNear-miss syntax, cascading errors untested
REPL error round-trip9 E2E casesError formatting, hint generation untested
Unicode handlingNoneEmoji, multi-byte, invalid UTF-8 untested

Decision

Adopt a three-layer robustness testing strategy, where each layer catches a different class of bugs. Layers are independent and can be implemented incrementally.

Layer 1: cargo-fuzz — Crash Safety

Coverage-guided fuzzing feeds random bytes to the parser. The parser must never panic on any input.

Fuzz target:

// fuzz/fuzz_targets/parse_arbitrary.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
use beamtalk_core::source_analysis::{lex_with_eof, parse};

fuzz_target!(|data: &[u8]| {
    // Only test valid UTF-8 (parser expects strings)
    if let Ok(source) = std::str::from_utf8(data) {
        let tokens = lex_with_eof(source);
        let (_module, _diagnostics) = parse(tokens);
        // Success = no panic. That's the only assertion.
    }
});

Corpus seeding: All .bt files from examples/ and tests/e2e/cases/ provide realistic starting points for mutation.

CI integration: Run nightly (not per-PR — too slow). Store corpus in fuzz/corpus/parse_arbitrary/.

What it catches:

Layer 2: proptest — Grammar-Aware Properties

Generate syntactically-structured (but sometimes invalid) Beamtalk and verify parser invariants.

Key properties:

use proptest::prelude::*;
use beamtalk_core::source_analysis::{lex_with_eof, parse};

// Property 1: Parser never panics on any string
proptest! {
    #[test]
    fn parser_never_panics(input in "\\PC*") {
        let tokens = lex_with_eof(&input);
        let (_module, _diagnostics) = parse(tokens);
    }
}

// Property 2: Diagnostics always have valid spans
proptest! {
    #[test]
    fn diagnostic_spans_within_input(input in gen_near_valid_beamtalk()) {
        let tokens = lex_with_eof(&input);
        let (_module, diagnostics) = parse(tokens);
        for d in &diagnostics {
            prop_assert!(d.span.end() as usize <= input.len(),
                "Span {:?} exceeds input length {}", d.span, input.len());
        }
    }
}

// Property 3: Errors always produce non-empty diagnostics
proptest! {
    #[test]
    fn errors_produce_diagnostics(input in gen_invalid_beamtalk()) {
        let tokens = lex_with_eof(&input);
        let (_module, diagnostics) = parse(tokens);
        prop_assert!(
            !diagnostics.is_empty(),
            "Invalid input did not produce any diagnostics"
        );
    }
}

// Property 4: Error messages are non-empty and don't contain internal names
proptest! {
    #[test]
    fn error_messages_are_user_facing(input in gen_near_valid_beamtalk()) {
        let tokens = lex_with_eof(&input);
        let (_module, diagnostics) = parse(tokens);
        for d in &diagnostics {
            prop_assert!(!d.message.is_empty(), "Empty error message");
            prop_assert!(!d.message.contains("TokenKind"),
                "Internal type leaked in error: {}", d.message);
            prop_assert!(!d.message.contains("unwrap"),
                "Debug text leaked in error: {}", d.message);
        }
    }
}

Grammar-aware generators produce structured-but-mutated Beamtalk:

fn gen_near_valid_beamtalk() -> impl Strategy<Value = String> {
    prop_oneof![
        // Valid expressions with random mutations
        gen_valid_expr().prop_map(|e| mutate_random_char(e)),
        // Mismatched brackets
        gen_block().prop_map(|b| b.replace(']', ')')),
        // Missing keyword colons
        gen_keyword_msg().prop_map(|m| m.replace(':', "")),
        // Truncated input (character-based to avoid invalid UTF-8 slices)
        gen_valid_expr().prop_map(|e| {
            let mid = e.chars().count() / 2;
            e.chars().take(mid).collect()
        }),
        // Duplicated operators
        gen_binary_expr().prop_map(|e| e.replace("+", "+ +")),
    ]
}

CI integration: Run with cargo test (fast — seconds, not hours). Part of just test-rust.

What it catches:

Layer 3: Curated Error E2E Suite — UX Regression

Hand-written test cases covering common user mistakes, with pinned expected error messages.

Expand tests/e2e/cases/errors.bt (or split into focused files):

// === Near-miss syntax ===

// Missing colon in keyword message
Counter subclass Foo
// => ERROR: expected expression

// Wrong bracket type
[:x | x + 1)
// => ERROR: Expected ']' to close block

// Duplicate parameter names
[:x :x | x + 1]
// => ERROR:

// Extra closing bracket
Counter spawn]
// => ERROR:

// === Invalid literals ===

// Unclosed string
"hello
// => ERROR:

// Invalid number
3.14.15
// => ERROR:

// === Cascading errors ===

// Multiple errors in one expression
[:x | ] + ]
// => ERROR:

// === Unicode edge cases ===

// Emoji in identifier
counter🚀 := 0
// => ERROR:

// === REPL-specific ===

// Empty input (whitespace only)

// => _

// Very deeply nested
((((((((((1))))))))))
// => 1

Snapshot testing for error messages: Use insta to pin exact error message text. Any change requires explicit cargo insta accept.

CI integration: Run with just test-e2e (existing infrastructure).

What it catches:

Prior Art

Rust Compiler (rustc)

Gleam Compiler

TypeScript Compiler

Common Pattern

All mature language implementations use both automated fuzzing (crash safety) and curated error tests (UX quality). Neither alone is sufficient.

User Impact

🧑‍💻 Newcomer

🎩 Smalltalk Developer

⚙️ BEAM Developer

🏭 Operator

Steelman Analysis

Best argument for "proptest-only" (rejected)

"Grammar-aware generation is the sweet spot — it finds deeper bugs than random bytes, and you get shrinking for free. cargo-fuzz mostly finds trivial panics that a few unit tests could catch. Skip the fuzzing infrastructure overhead."

Rebuttal: cargo-fuzz's coverage-guided mutation finds crash paths that structured generators miss — it explores token sequences no grammar generator would produce. The two are complementary, not competitive.

Best argument for "cargo-fuzz + E2E only" (rejected)

"proptest grammar generators are expensive to build and maintain — they essentially re-implement the grammar. For the same effort, you could write 200 curated test cases that cover more real-world scenarios."

Rebuttal: Grammar generators test invariants (spans valid, diagnostics present, no internal leaks) that hold across all inputs. Hand-written tests check specific cases but miss the combinatorial explosion. And generators don't need to cover the full grammar — even simple mutators find real bugs.

Best argument for "do nothing yet" (rejected)

"With only 9 error cases but no reported parser crashes, is this solving a real problem? Wait until users report crash bugs, then add targeted tests."

Rebuttal: The REPL daemon is shared infrastructure — a single crash affects all sessions. The cost of a crash in production (lost REPL state, killed sessions) far exceeds the cost of prevention. And fuzz testing is cheapest to add early, before the parser grows more complex.

Tension points

Alternatives Considered

Alternative: Grammar-Based Fuzzing (e.g., Grammarinator)

Generate inputs from a formal grammar definition, ensuring syntactic structure.

Rejected because:

Alternative: Mutation-Based Testing (e.g., AFL)

Use AFL's mutation strategies instead of libFuzzer.

Rejected because:

Alternative: Error Message Approval Tests Only

Just pin error messages with snapshot tests, no fuzzing.

Rejected because:

Consequences

Positive

Negative

Neutral

Implementation

Phase 1: cargo-fuzz Setup (S)

Phase 2: proptest Parser Properties (M)

Phase 3: Curated Error E2E Suite (M)

Phase 4: REPL Round-Trip Properties (S)

Affected Components

ComponentChangePhase
crates/beamtalk-coreproptest properties, fuzz targets1, 2
tests/e2e/cases/Expanded error test suite3
fuzz/New directory for cargo-fuzz1
.github/workflows/Nightly fuzz CI job1
crates/beamtalk-cliREPL round-trip property tests4

Open Questions

  1. Lexer isolation fuzzing — The current fuzz target chains lex()parse(). Should we also fuzz the lexer in isolation? It handles raw bytes first and could have its own crash paths independent of parser recovery.

  2. Nightly fuzz budget — How long should the CI fuzz job run per night? 10 minutes (cheap, catches easy crashes), 1 hour (good coverage), or 8 hours (thorough but expensive)? Longer runs find rarer bugs but cost more compute.

  3. Error E2E file organization — Keep all error cases in one errors.bt, or split by category (errors_syntax.bt, errors_runtime.bt, errors_unicode.bt)? Single file is simpler; multiple files allow parallel test development.

  4. Snapshot pinning granularity — Should error message snapshots pin the exact full text (brittle — any rewording breaks CI, but catches all regressions) or just key substrings (flexible — allows message improvements, but misses subtle UX degradation)?

  5. Crash triage policy — When cargo-fuzz finds a parser crash, what's the severity? Release-blocker? Or file-and-fix-later? This determines whether nightly fuzz failures page someone or just create Linear issues.

  6. proptest generator depth — How much grammar do we model? Full recursive expression tree generators (expensive to build and maintain as syntax evolves) or simple string mutators like truncation/bracket-swap (ships fast, less coverage)?

  7. Erlang runtime fuzzing — This ADR covers the Rust-side parser and REPL protocol. Should we also fuzz the Erlang eval path (beamtalk_repl_eval)? That requires different tooling (PropEr or EQC) and is a separate implementation effort — possibly a follow-up ADR.

References