ADR 0014: Beamtalk Test Framework — Native Unit Tests and CLI Integration Tests

Status

Implemented (2026-02-15)

Context

Problem Statement

Beamtalk currently has 40 E2E test files (~3,400 lines) in tests/e2e/cases/ that test language features. These tests work by:

  1. Starting a REPL daemon (Erlang node + TCP server)
  2. Sending each expression over TCP to the REPL
  3. Reading back the result and comparing against // => expected comments
  4. Running serially, one expression at a time

This approach has served us well for validating the full compilation pipeline, but it has serious limitations:

Performance: The full E2E suite takes ~90 seconds. Each expression requires a TCP round-trip through the REPL. As the language grows, this will become a bottleneck.

Misclassification: Most E2E tests are actually unit tests for language features (arithmetic, string operations, block semantics) — they don't need the REPL at all. They test that 1 + 2 equals 3, not that the REPL can evaluate 1 + 2.

Missing real E2E coverage: We have no tests for actual end-to-end workflows: beamtalk build, beamtalk repl session management, beamtalk test, workspace lifecycle, CLI argument parsing.

No native test framework: The language features doc (line 1049) describes a TestCase subclass: pattern but there's no implementation. Users can't write tests in Beamtalk itself.

Current State

LayerToolSpeedWhat it tests
Rust unit testscargo testFast (~5s)Parser, AST, codegen
Erlang unit testsrebar3 eunitFast (~3s)Runtime, primitives, object system
Compiler snapshot testscargo testFast (~2s)51 codegen snapshots
E2E testsjust test-e2eSlow (~90s)Language features via REPL

The testing pyramid is inverted for language feature testing — everything goes through E2E when most could be fast compiled tests.

Constraints

  1. Tests must compile to BEAM bytecode (same pipeline as user code)
  2. Must work with EUnit (Erlang's standard test framework) for CI integration
  3. Must not require a running REPL daemon for pure language tests
  4. Must preserve the existing E2E suite for REPL/workspace integration testing
  5. Should feel idiomatic to Smalltalk developers (SUnit heritage)

Decision

Implement a two-phase test framework with a phased rollout:

Phase 1: Compiled Expression Tests (beamtalk test)

Reuse the existing // => assertion format but compile test files directly to EUnit modules — no REPL needed.

Test file format (unchanged from current E2E):

// test/integer_test.bt

// Basic arithmetic
1 + 2
// => 3

5 negated
// => -5

// String operations
'hello' size
// => 5

'hello' , ' world'
// => 'hello world'

What changes: The compiler parses // => comments as test assertions at compile time. The beamtalk test command then generates a thin EUnit wrapper in Erlang source that calls the compiled BEAM modules:

%% Generated EUnit wrapper for test/integer_test.bt
-module(integer_test_tests).
-include_lib("eunit/include/eunit.hrl").

line_3_test() ->
    ?assertEqual(3, beamtalk_integer:dispatch('+', [2], 1)).

line_6_test() ->
    ?assertEqual(-5, beamtalk_integer:dispatch(negated, [], 5)).

line_10_test() ->
    ?assertEqual(5, beamtalk_string:dispatch(size, [], <<"hello">>)).

line_13_test() ->
    ?assertEqual(<<"hello world">>, beamtalk_string:dispatch(',', [<<" world">>], <<"hello">>)).

Note: The wrapper is generated Erlang source (.erl), not Core Erlang. This lets us use EUnit's ?assertEqual macros directly. The actual Beamtalk expressions compile through the normal Core Erlang pipeline; the wrapper just calls the compiled dispatch functions.

CLI command:

$ beamtalk test
Compiling 40 test files...
Running tests...

  integer_test: 12 tests, 12 passed ✓
  string_test: 8 tests, 8 passed ✓
  block_test: 15 tests, 15 passed ✓
  ...

40 files, 287 tests, 287 passed, 0 failed (1.2s)

Key properties:

Stateful tests: Tests that use variables across expressions (e.g., counter := Counter spawn then counter increment) compile to a single EUnit test function with sequential statements, preserving variable bindings. Each test file becomes one EUnit test with internal assertions. This matches EUnit's fixture pattern.

REPL session:

> :test test/integer_test.bt
Compiling 1 test file...
12 tests, 12 passed ✓ (0.1s)

Error output when a test fails:

FAIL test/integer_test.bt:3
  Expression: 1 + 2
  Expected:   4
  Actual:     3

Phase 2: SUnit-style TestCase Classes

Add a TestCase base class enabling idiomatic Smalltalk-style test classes:

// test/counter_test.bt
@load test/fixtures/counter.bt

Object subclass: CounterTest

  setUp =>
    self.counter := Counter spawn

  testIncrement =>
    self.counter increment await
    self assert: (self.counter getValue await) equals: 1

  testMultipleIncrements =>
    3 timesRepeat: [self.counter increment await]
    self assert: (self.counter getValue await) equals: 3

  testInitialValue =>
    self assert: (self.counter getValue await) equals: 0

Key properties:

REPL session:

> :load test/counter_test.bt
> CounterTest runAll
3 tests, 3 passed ✓ (0.3s)

> CounterTest run: #testIncrement
1 test, 1 passed ✓ (0.1s)

Error output:

FAIL CounterTest >> testMultipleIncrements
  assert:equals: failed
  Expected: 4
  Actual:   3
  Location: test/counter_test.bt:12

E2E Tests: Kept for Integration

The existing tests/e2e/ suite continues to test REPL-specific behavior:

// tests/e2e/cases/workspace_bindings.btscript
// These NEED the REPL because they test workspace state

Transcript show: 'Hello from E2E'
// => nil

Transcript class
// => TranscriptStream

Criteria for E2E vs compiled test:

Test needs...Use
Pure language features (arithmetic, strings, blocks)beamtalk test (compiled)
Actor spawning + messagingbeamtalk test with @load
Workspace bindings (Transcript, Beamtalk)E2E (needs REPL)
REPL commands (:load, :quit, :help)E2E (needs REPL)
CLI commands (beamtalk build, beamtalk test)E2E / integration test

Bootstrap Constraint: Why Both Phases Are Needed

SUnit-style TestCase classes (Phase 2) cannot be used to test the stdlib primitives they depend on — this creates a circular dependency:

TestCase (stdlib class)
  └── depends on: Object, Integer (comparisons), String (error messages)

Integer tests
  └── depends on: TestCase
  └── ...which depends on: Integer  ← circular!

In Pharo/Squeak this doesn't matter because everything lives in the image with no compilation order. But Beamtalk has separate compilation with OTP app dependencies, so the build order matters.

Resolution: Phase 1 expression tests (// =>) have zero framework dependencies — they compile directly to EUnit with no Beamtalk class imports. This makes them the correct tool for testing stdlib primitives. Phase 2 TestCase classes are for user-level code (actors, application logic) where setUp/tearDown and rich assertions justify the dependency.

LayerTest withWhy
Primitives (Integer, String, Float, Boolean)Phase 1 (// =>)No framework dependency — avoids bootstrap circularity
Stdlib classes (Array, Block, Dictionary)Phase 1 (// =>)Same reason — these are below TestCase in the dependency chain
User classes and actorsPhase 2 (TestCase)setUp/tearDown, fixtures, rich assertions
Complex integration scenariosPhase 2 (TestCase)State management, test grouping

This is not a limitation — it's actually the right tool for each job. Stdlib tests are mostly "does 1 + 2 equal 3?" which is exactly what expression tests excel at.

Prior Art

SUnit (Pharo/Squeak Smalltalk)

The original xUnit framework. Tests are classes inheriting from TestCase with methods prefixed test. Supports setUp/tearDown, assert:equals:, should:raise:.

What we adopt: Class structure, naming conventions, assertion API, skip protocol (skip/skip:). What we adapt: No poolDictionaries or classVariableNames. BEAM process isolation replaces Smalltalk image-level isolation.

EUnit (Erlang)

Erlang's built-in test framework. Test functions end in _test or _test_. Supports ?assertEqual, ?assertMatch macros, test generators, fixtures.

What we adopt: EUnit as the compilation target (Phase 1 and 2 both generate EUnit modules). What we adapt: Beamtalk syntax instead of Erlang macros.

ExUnit (Elixir)

Compile-time test generation via macros. test "name" do ... end blocks, assert macro, setup/setup_all callbacks. Data-driven test generation with for.

What we adopt: The idea of compile-time test generation (our compiler does what Elixir's macro system does). What we adapt: Message-passing assertions instead of macro-based assertions.

Gleam Testing

gleam test compiles test modules, runs each test in its own BEAM process. Simple should.equal(actual, expected) assertions. Process isolation for crash safety.

What we adopt: The beamtalk test CLI pattern, process isolation. What we adapt: SUnit-style assertions instead of function-call assertions.

User Impact

Newcomer (from Python/JS)

Phase 1 is immediately accessible — the // => format is like doctest in Python or inline assertions in tutorials. No new concepts to learn. Phase 2 introduces test classes, familiar from any xUnit framework.

Smalltalk Developer

Phase 2 is exactly what they expect — SUnit is the canonical Smalltalk test framework. TestCase subclass: MyTest with assert:equals: is home territory. Phase 1 is a nice bonus for quick checks.

Erlang/Elixir Developer

Both phases compile to EUnit, which they already know. beamtalk test works like rebar3 eunit or mix test. No new test infrastructure to learn at the BEAM level.

Production Operator

Tests run fast (1-2s compiled vs 90s E2E). CI pipelines are faster. EUnit output integrates with existing CI tools. BEAM process isolation means crashed tests don't take down the suite.

Steelman Analysis

Alternative: Keep Pure E2E (Current Approach)

CohortTheir strongest argument
🧑‍💻 Newcomer"The current format is dead simple — I write an expression and the expected result. No classes, no imports, no boilerplate."
🎩 Smalltalk purist"In Smalltalk, we test in the live image. The REPL is the test environment. Compiling tests separately breaks the interactive-first promise."
⚙️ BEAM veteran"EUnit already works fine for Erlang. Why add another layer? Just keep testing through the REPL."
🏭 Operator"90 seconds is acceptable for a CI pipeline. Don't add complexity for marginal speed gains."
🎨 Language designer"The test format IS the language tutorial format. Keeping them the same means examples are always tested."

Rebuttal: The speed problem will get worse as the language grows. 40 files at 90s means 200 files will take 7+ minutes. And we genuinely lack REPL/CLI integration tests because everything goes through the same slow path.

Alternative: Phase 2 Only (SUnit-style from the Start)

CohortTheir strongest argument
🧑‍💻 Newcomer"One test framework to learn, not two. TestCase classes are universal."
🎩 Smalltalk purist"SUnit IS the Smalltalk way. Skipping it for a simpler format is a disservice to the language's heritage."
⚙️ BEAM veteran"Class-based tests map directly to EUnit fixtures. setUp/tearDown = EUnit setup. Clean."
🏭 Operator"One framework, one test command, one CI step. Simplicity."
🎨 Language designer"TestCase classes use the class system — testing the language WITH the language is dogfooding at its best."

Rebuttal: Phase 2 requires the class system to be more mature (class-side methods, instantiation protocol). Phase 1 works today with zero language additions. Delivering Phase 1 first gives us fast tests immediately while Phase 2 develops.

Tension Points

Alternatives Considered

Alternative A: Optimize Current E2E Runner (Parallel Execution)

Run E2E tests in parallel REPL sessions to reduce wall-clock time.

Why rejected: Treats the symptom (speed) not the cause (architectural mismatch). Most tests don't need a REPL. Parallel REPL sessions add complexity and race condition risks.

Alternative B: Pragma-Based Test Methods (@test)

@test 'addition'
(1 + 2) assertEquals: 3

Why rejected: New pragma syntax for something that can be achieved with naming conventions (Phase 2) or existing comment syntax (Phase 1). Adds parser complexity without clear benefit over either phase.

Alternative C: Property-Based Testing First

Focus on property-based testing (QuickCheck/PropEr style) instead of unit tests.

Why rejected: Property-based testing is valuable but requires a more mature type system and standard library. Better as a Phase 3 addition built on top of TestCase.

Consequences

Positive

Negative

Neutral

Implementation

Phase 1: Compiled Expression Tests

Effort: M (Medium) — ~250-350 lines across 5-7 files

ComponentLocationDescription
Assertion parsercrates/beamtalk-core/src/source_analysis/parser/Parse // => as TestAssertion AST nodes
EUnit codegencrates/beamtalk-core/src/codegen/core_erlang/Generate EUnit test functions from assertion pairs
beamtalk test CLIcrates/beamtalk-cli/src/commands/test.rsScan dir → compile → run EUnit → format output
Test classifiercrates/beamtalk-cli/src/commands/test.rsDetect workspace binding usage → route to E2E; compile all other tests including @load
Output formattercrates/beamtalk-cli/src/commands/test.rsParse EUnit output → user-friendly format

Affected layers: Parser (Rust), Codegen (Rust), CLI (Rust), minimal Erlang glue.

Phase 2: SUnit-style TestCase

Effort: XL (Extra Large) — ~650-750 lines across 8-12 files

ComponentLocationDescription
TestCase classstdlib/src/TestCase.btAssertion methods, lifecycle hooks
TestCase runtimeruntime/apps/beamtalk_runtime/src/beamtalk_test_case.erl — assertion primitives
Test discoverycrates/beamtalk-cli/src/commands/test.rsFind TestCase subclasses, extract test* methods
EUnit bridgecrates/beamtalk-core/src/codegen/core_erlang/Generate EUnit wrappers from TestCase methods
TestResult classstdlib/src/TestResult.bt (optional)Collect and report test results

Depends on: ADR 0013 (class instantiation protocol — new for value objects), method introspection.

Skip Protocol (BT-1149)

TestCase provides skip and skip: for platform-conditional and environment-conditional tests.

API:

testUnixOnlyFeature =>
  System osFamily = "unix" ifFalse: [^self skip: "Unix only"]
  // ... test body

Protocol:

  1. self skip: reason calls the skip: primitive which throws {bunit_skip, Reason}
  2. run_test_method/4 catches throw:{bunit_skip, Reason} and returns {skip, MethodName, Reason}
  3. structure_results/3 and format_results/2 count skips separately from passes and failures
  4. TestResult gains a skipped field accessible via result skipped
  5. hasPassed returns true when failed = 0, regardless of skip count
  6. Summary: N tests, P passed, S skipped, F failed — skipped only shown when S > 0

Follows SUnit: TestCase>>skip: in Pharo signals TestSkipped exception, same mechanism.

Phase 3: Future Enhancements (Out of Scope)

Migration Path

Moving E2E Tests to Compiled Tests

  1. Classify each tests/e2e/cases/*.btscript file:

    • No @load, no workspace bindings → move to test/ (compiled)
    • Uses @load but no workspace bindings → move to test/ with @load support
    • Uses workspace bindings → keep in tests/e2e/ (needs REPL)
  2. Gradual migration — move files one at a time, verify tests still pass

  3. Expected split:

    • ~30 files → compiled tests (pure language features)
    • ~10 files → remain E2E (REPL/workspace integration)

Test Directory Convention

test/                    # Compiled Beamtalk tests (Phase 1 + 2)
├── integer_test.bt      # Phase 1: expression tests
├── string_test.bt       # Phase 1: expression tests
├── counter_test.bt      # Phase 2: TestCase class
└── fixtures/
    └── counter.bt       # Shared test fixtures

tests/e2e/               # REPL integration tests (keep existing)
├── cases/
│   ├── workspace_bindings.btscript
│   └── repl_commands.btscript
└── fixtures/
    └── counter.bt

References