ADR 0069: Actor Observability and Tracing

Status

Accepted | Implemented (2026-03-21)

Context

Problem

Beamtalk actors (gen_server processes) have no built-in performance instrumentation. When debugging timeouts, bottlenecks, or failures, developers must use ad-hoc Erlang tools (sys, recon, dbg) manually — tools that require knowing internal module names and produce unstructured output. AI agents working via MCP have no way to understand actor performance characteristics at all.

The core workflow we need to enable:

enable tracing → run tests → get structured traces → find bottlenecks → debug

This must work for both human developers in the REPL and AI agents via MCP tools.

Current State

ADR 0064 (Runtime Logging Control) established Beamtalk enableDebug: for per-class/per-actor log filtering — visibility into what's happening. This ADR addresses the complementary concern: how fast is it happening — timing, call counts, error rates, bottleneck detection. This was explicitly deferred as BT-1429 during ADR 0064's design.

Existing instrumentation in beamtalk_actor.erl:

OTP provides the building blocks but they're not exposed:

No Beamtalk-level API exists — there is no tracing class, no profile: method, nothing discoverable from the REPL.

Constraints

Decision

Class Design: Tracing as a Standalone Observability Facade

Tracing is exposed via a dedicated Tracing class — a sealed, class-only facade following the System/Logger pattern.

Why a standalone class, not methods on Beamtalk?

ADR 0064 placed logging configuration on Beamtalk, arguing that runtime configuration belongs on the system facade (DDD boundary: Logger is stdlib, logging config is Runtime Context). That decision was sound for a small surface (~5 methods) that naturally extended Beamtalk's existing system introspection role.

Tracing is different in scale: ~15 methods spanning control, queries, analysis, and live health. Adding these to Beamtalk would double its method count and blur its identity from "system introspection" to "system introspection + performance telemetry." The Smalltalk precedent confirms this separation — Pharo's MessageTally and VisualWorks' PerformanceProfiler are standalone classes, not methods on the Smalltalk global. Profiling has always been a distinct tool in Smalltalk environments.

Why Tracing, not Tracer?

The gerund form (Tracing) names the subsystem, not the agent. This establishes a naming convention for observability facades: Tracing for performance telemetry, with a future Logging class planned to absorb the logging configuration methods currently on Beamtalk (logLevel:, enableDebug:, debugTargets, etc.). The pair Logging / Tracing reads as two sibling observability subsystems, each owning its full surface. Logger remains the emit interface (Logger info:, Logger error:), consistent with ADR 0064's emit/config separation.

This gives us optionality — the Logging migration is a separate future ADR, not a prerequisite for this work. But naming Tracing now signals the direction and avoids a rename later.

Architecture: telemetry event bus + lock-free storage

Use the Erlang telemetry library as the event bus, with lock-free storage for aggregation and direct ETS writes for trace capture. The beamtalk_trace_store gen_server owns the tables and handles queries/lifecycle but is not in the write path — no serialization bottleneck on the hot path.

Beamtalk API (Tracing class — stdlib/src/Tracing.bt)
    ↓ delegates to
Erlang shim (beamtalk_tracing.erl — runtime)
    ↓ queries
Storage:
  counters (lock-free aggregates, ~50ns writes)
  ETS index ({Pid, Selector} → counter ref + slot)
  ETS ordered_set (trace events, direct insert)
    ↑ all writes happen in the CALLING process (no gen_server in write path)
beamtalk_trace_store gen_server:
  - owns tables, handles enable/disable/clear/queries
  - periodic sweep for ring buffer eviction
  - NOT in the write path
    ↑ events emitted by
Instrumented send wrappers (beamtalk_actor.erl)
    ↑ using
telemetry:execute/3 (standard BEAM event bus)
    ↑ VM measurements via
telemetry_poller (periodic VM stats for systemHealth)

Design principle: no bottlenecks on the hot path. Aggregate updates use the counters module (truly lock-free, ~50ns). Trace event inserts go directly to ETS from the calling process. The gen_server only handles control operations (enable/disable/clear) and periodic maintenance (ring buffer eviction). With ETS {heir, SupervisorPid, HeirData} ownership transfer and counter refs in persistent_term, a gen_server crash does not lose accumulated data — only a full VM crash (SIGKILL, OOM) does. The gen_server is not in the write path, so its crash does not affect actor dispatch.

Why telemetry + telemetry_poller?

Why not OpenTelemetry directly?

Why not telemetry_metrics?

Telemetry Events

Beamtalk emits telemetry events from the actor send wrappers in beamtalk_actor.erl:

%% Event: [beamtalk, actor, dispatch, start]
%% Measurements: #{system_time => erlang:system_time()}
%% Metadata: #{pid => Pid, class => Class, selector => Selector, mode => sync | async | cast}

%% Event: [beamtalk, actor, dispatch, stop]
%% Measurements: #{duration => NativeTime}
%% Metadata: #{pid => Pid, class => Class, selector => Selector, mode => sync | async | cast, outcome => ok | error | timeout}

%% Event: [beamtalk, actor, dispatch, exception]
%% Measurements: #{duration => NativeTime}
%% Metadata: #{pid => Pid, class => Class, selector => Selector, mode => sync | async | cast, kind => Kind, reason => Reason, stacktrace => Stack}

These follow the telemetry:span/3 naming convention ([prefix, ..., start | stop | exception]), making them compatible with standard BEAM tooling.

Instrumentation Point: Send Wrappers

Instrument sync_send/3, async_send/4, and cast_send/3 in beamtalk_actor.erl — the three entry points for all actor message sends:

sync_send(ActorPid, Selector, Args) ->
    Class = lookup_class(ActorPid),
    Metadata = #{pid => ActorPid, class => Class, selector => Selector, mode => sync},
    telemetry:span([beamtalk, actor, dispatch], Metadata, fun() ->
        Result = do_sync_send(ActorPid, Selector, Args),
        {Result, Metadata#{outcome => ok}}
    end).

telemetry:span/3 automatically emits start, stop, and exception events with timing. The stop event includes duration in native time units.

Class name resolution: The actor's class name is not accessible from the caller's process (it lives in the gen_server's state map). The beamtalk_object_instances ETS table already maintains a {Class, Pid} registry populated at actor spawn time — a reverse lookup (ets:match on the bag table by Pid) provides the class name. This is a single ETS read (~50-100ns), acceptable within the overhead budget. If the Pid is not found (e.g., Erlang process, not a Beamtalk actor), the class defaults to unknown.

What the timing measures: Send-wrapper instrumentation captures different things depending on the send mode:

To distinguish queueing delay from execution time on sync calls, use Tracing healthFor: which reports message queue depth. A high round-trip time with a deep queue indicates queueing; a high round-trip with an empty queue indicates slow method execution.

telemetry:span/3 and error handling: The existing sync_send/3 wraps gen_server:call in a try/catch that converts exit:{timeout, _} and exit:{noproc, _} to structured #beamtalk_error{} records. telemetry:span/3 catches exceptions to emit the exception event and reraises via erlang:raise/3, preserving the original exception class and reason. Phase 0 (validation spike) will verify this composition works correctly before committing to the architecture.

Lifecycle methods (isAlive, stop, pid, monitor, onExit:, demonitor) are NOT instrumented — they are handled locally in the send wrappers without entering the gen_server message loop.

Codegen-level instrumentation (wrapping safe_dispatch in generated code) is deferred as a future enhancement. Send-wrapper instrumentation captures the caller-perspective round-trip time, which is the actionable metric for bottleneck detection. Codegen-level instrumentation would add precise method-body timing if needed later.

Storage: Lock-Free Aggregates + ETS Trace Buffer

beamtalk_trace_store.erl — a gen_server that owns tables/counters and handles queries/lifecycle, but is not in the write path.

Aggregates: counters module + ETS index (always-on)

%% ETS index table: beamtalk_agg_index (set)
%% Key: {Pid, Selector}
%% Value: {Key, CounterRef, SlotBase}
%%   — SlotBase is the base offset into the counters array
%%   — Each key uses 6 slots: Calls, OkCount, ErrorCount, TimeoutCount, TotalDuration, reserved

%% counters reference: created by counters:new(InitialSize, [write_concurrency])
%%   — Truly lock-free writes via counters:add/3
%%   — ~50ns per increment (vs ~100-200ns for ets:update_counter)

Trace events: ETS ordered_set (opt-in, direct insert)

%% Table: beamtalk_trace_events (ordered_set, public, {write_concurrency, true})
%% Key: {erlang:monotonic_time(nanosecond), erlang:unique_integer([monotonic])}
%%   — composite key ensures uniqueness across concurrent schedulers
%% Row: {Key, Pid, Class, Selector, Mode, Duration_ns, Outcome, Metadata}
%% Outcome: ok | error | timeout | cast
%% Metadata: #{error => Term} for failures, #{} otherwise

beamtalk_trace_store gen_server responsibilities:

The gen_server is the owner and lifecycle manager, NOT the write path:

ResponsibilityHot path?Notes
Own ETS tables + counter refsNoCreated in init/1
enable / disable / clearNoControl operations, infrequent
Periodic ring buffer evictionNoTimer-based sweep, async
Query: traces, stats, slowMethods:NoRead-only, called from REPL/MCP
Crash recoveryNoETS tables survive via heir option (ownership transfers to supervisor); counters refs stored in persistent_term survive. Gen_server restart inherits existing data.

Overhead budget:

OperationCostWhen
telemetry handler dispatch~10-50nsAlways
counters:add/3 (aggregate increment)~50nsAlways
persistent_term check (trace enabled?)~5nsAlways
Class lookup (beamtalk_object_instances)~50-100nsAlways
ETS insert (trace event)~300-500nsOnly when tracing enabled
Total per-call~150ns (tracing off) / ~500ns (tracing on)

Beamtalk API: Tracing Class

sealed Object subclass: Tracing

  // === Tracing lifecycle ===

  class sealed enable -> Nil
  // Start capturing trace events (aggregates are always on)
  // Tracing enable

  class sealed disable -> Nil
  // Stop capturing trace events
  // Tracing disable

  class sealed isEnabled -> Boolean
  // Check whether trace capture is active
  // Tracing isEnabled   // => true

  class sealed clear -> Nil
  // Clear all trace events and aggregate stats
  // Tracing clear

  // === Trace event queries ===

  class sealed traces -> List
  // All captured trace events (up to buffer limit), newest first
  // Tracing traces

  class sealed tracesFor: actor :: Object -> List
  // Trace events for a specific actor
  // Tracing tracesFor: myCounter

  class sealed tracesFor: actor :: Object selector: sel :: Symbol -> List
  // Trace events for a specific actor + method
  // Tracing tracesFor: myCounter selector: #increment

  // === Aggregate stats ===

  class sealed stats -> Dictionary
  // Per-actor, per-method aggregate stats (always available, even without tracing)
  // Tracing stats

  class sealed statsFor: actor :: Object -> Dictionary
  // Aggregate stats for a specific actor
  // Tracing statsFor: myCounter

  // === Analysis (computed from aggregates) ===

  class sealed slowMethods: limit :: Integer -> List
  // Top N methods by average duration (descending)
  // Tracing slowMethods: 10

  class sealed hotMethods: limit :: Integer -> List
  // Top N methods by call count (descending)
  // Tracing hotMethods: 10

  class sealed errorMethods: limit :: Integer -> List
  // Top N methods by error + timeout rate (descending)
  // Tracing errorMethods: 5

  class sealed bottlenecks: limit :: Integer -> List
  // Top N actors by message queue length — live snapshot via erlang:process_info/2
  // Tracing bottlenecks: 5

  // === Live process health ===

  class sealed healthFor: actor :: Object -> Dictionary
  // Live process info: queue depth, memory, reductions, status, uptime
  // Tracing healthFor: myCounter

  class sealed systemHealth -> Dictionary
  // VM overview: scheduler count, memory breakdown, process count, run queue lengths
  // Tracing systemHealth

  // === Configuration ===

  class sealed maxEvents -> Integer
  // Current ring buffer capacity
  // Tracing maxEvents   // => 100000

  class sealed maxEvents: size :: Integer -> Nil
  // Set ring buffer capacity (clears existing events)
  // Tracing maxEvents: 50000

REPL Session

// Always-on aggregates — no setup needed
c := Counter spawn.
1 to: 100 do: [:i | c increment].
Tracing stats
// => {"<0.456.0>" => {"class" => "Counter", "methods" => {"increment" => {"calls" => 100, "ok" => 100, "errors" => 0, "avg_us" => 38, "min_us" => 12, "max_us" => 523}}}}

// Enable trace capture for detailed events
Tracing enable.
c increment.
c increment.
c increment.
Tracing traces
// => [{"timestamp_us" => 1234567890, "actor" => "<0.456.0>", "class" => "Counter", "selector" => "increment", "duration_us" => 42, "outcome" => "ok"}, ...]

// Find bottlenecks
Tracing slowMethods: 5
// => [{"class" => "DatabasePool", "selector" => "query:", "avg_us" => 15023, "calls" => 47}, ...]

Tracing bottlenecks: 3
// => [{"actor" => "<0.789.0>", "class" => "WorkQueue", "queue_depth" => 142, "memory_kb" => 256}, ...]

// Live health check
Tracing healthFor: c
// => {"pid" => "<0.456.0>", "class" => "Counter", "status" => "waiting", "queue_depth" => 0, "memory_kb" => 4, "reductions" => 1523, "uptime_ms" => 45000}

// Clean up
Tracing disable.
Tracing clear.

Error Behavior

// Querying traces when not enabled — returns empty list (not an error)
Tracing traces   // => []

// Stats for unknown actor — returns empty dictionary
Tracing statsFor: somethingInvalid   // => {}

// Health for dead actor
Tracing healthFor: deadActor
// => {"pid" => "<0.456.0>", "class" => "Counter", "status" => "dead", "error" => "process not alive"}

MCP Tools: Lean Surface + evaluate

The existing MCP evaluate tool can execute any Beamtalk expression, including Tracing methods. Rather than 10 dedicated MCP tools that are thin wrappers around evaluate("Tracing ..."), we start with a minimal dedicated tool surface and let agents discover the rest via evaluate + Beamtalk help: Tracing:

Dedicated MCP tools (v1):

MCP ToolBeamtalk APIWhy dedicated?
enable-tracingTracing enableCommon workflow entry point, high discoverability value
get-tracesTracing traces / tracesFor:Structured JSON output with filtering params (actor, selector, limit)
actor-statsTracing stats / statsFor:Structured JSON output, the main query after test runs

Everything else via evaluate:

evaluate("Tracing slowMethods: 10")
evaluate("Tracing bottlenecks: 5")
evaluate("Tracing healthFor: myCounter")
evaluate("Tracing systemHealth")
evaluate("Tracing disable")
evaluate("Tracing clear")
evaluate("Beamtalk help: Tracing")   // agent discovers available methods

This approach reduces implementation effort (3 tools vs 10), avoids API surface bloat in the MCP layer, and tests whether agents can discover the full Tracing API organically. If monitoring shows agents struggle to find specific methods, we can promote them to dedicated tools in a follow-up — additive, not a redesign.

Agent Workflow

1. agent calls enable-tracing          → dedicated tool
2. agent calls test (existing tool)    → runs BUnit suite, actors spawn and die
3. agent calls get-traces              → dedicated tool (structured JSON)
4. agent calls actor-stats             → dedicated tool (structured JSON)
5. agent calls evaluate("Tracing slowMethods: 10")   → via evaluate
6. agent calls evaluate("Tracing healthFor: ...")     → via evaluate
7. agent calls evaluate("Tracing disable")            → via evaluate
8. agent calls evaluate("Tracing clear")              → via evaluate

Cleanup and Lifecycle

Future: Block-Scoped Profiling

The Smalltalk MessageTally spyOn: [block] pattern is compelling for targeted profiling. A future enhancement could add:

Tracing profile: [
  c := Counter spawn.
  1 to: 1000 do: [:i | c increment].
]
// => {"traces" => [...], "stats" => {...}, "duration_us" => 45230}

This would automatically enable tracing, execute the block, disable tracing, and return the captured results — eliminating the enable/disable lifecycle for the common case. The global mode (Tracing enable) remains necessary for the MCP agent workflow (enable → run tests in separate tool call → query results).

Propagated Context Across Actor Boundaries

Actor messages carry a propagated context map as a third element: {Selector, Args, PropagatedCtx}. This context flows automatically through every sync_send, async_send, and cast_send — the user never sees it.

%% In beamtalk_actor.erl — send side
get_propagated_ctx() ->
    #{otel => get_otel_ctx()}.   %% extensible map — future keys added here

get_otel_ctx() ->
    case erlang:function_exported(otel_ctx, get_current, 0) of
        true -> otel_ctx:get_current();
        false -> undefined
    end.

%% In generated handle_call — actor side
handle_call({Selector, Args, PropCtx}, _From, State) ->
    restore_propagated_ctx(PropCtx),
    dispatch(Selector, Args, State).

restore_propagated_ctx(#{otel := Ctx}) when Ctx =/= undefined ->
    otel_ctx:attach(Ctx);   %% only called if otel_ctx module is loaded
restore_propagated_ctx(_) -> ok.

No compile-time dependency on OpenTelemetry. erlang:function_exported/3 is cached by the VM (~10ns). When OTel isn't loaded, the context map contains #{otel => undefined} — restoration is a no-op pattern match.

Cost:

OperationCostWhen
function_exported check~10nsAlways
Context map construction~10nsAlways
Extra bytes in message copy~25 bytesAlways
otel_ctx:attach/1~20nsOnly when OTel is loaded
Total~20ns (no OTel) / ~40ns (with OTel)Negligible vs gen_server call (~5-10μs)

User experience: Add opentelemetry + opentelemetry_telemetry to your project's deps → correlated parent/child traces across actor calls work immediately. No Beamtalk upgrade, no recompile. The telemetry bridge auto-subscribes to our events, and the propagated context provides the parent/child span correlation.

Extensible context map — future uses:

The PropagatedCtx map is designed to carry more than OTel context. Future extensions (each a separate ADR):

KeyWhat it carriesUse case
otelOTel trace contextDistributed tracing (this ADR)
request_idCorrelation ID (string)Log correlation — Logger auto-includes the ID in all logs from downstream actor calls, without the user threading it manually
deadlineMonotonic time of expiryTimeout propagation — like Go's context.WithTimeout. If A→B has a 5s timeout and B takes 3s, B→C automatically gets 2s
causalityLogical clock / vector clockDebugging message ordering in concurrent actor systems

The map starts with just otel in v1. Adding keys is additive — no message format change, no codegen change, no recompile.

Why telemetry matters for this path: Without telemetry, exporting spans requires coupling beamtalk_actor.erl directly to the OTel SDK. With telemetry, the OTel bridge is a user-side dependency that auto-subscribes to our events — zero changes to Beamtalk. This is the primary justification for adopting telemetry now rather than deferring it.

Prior Art

Erlang/OTP — sys module

Every OTP behaviour supports sys:trace(Pid, true) for per-process debug output and sys:statistics(Pid, true) for basic counters (reductions, messages_in, messages_out). The output is unstructured text to stdout — useful for quick debugging but not queryable or aggregatable. The dbg module provides lower-level function tracing with pattern matching. Observer adds a GUI. recon (third-party) adds production-safe rate-limited tracing and process ranking by message queue length, reductions, or memory.

What we adopt: The concept of always-available aggregate stats (like sys:statistics) and opt-in detailed tracing (like sys:trace). What we improve: Structured output (queryable dictionaries, not stdout text), Beamtalk-level API (no need to know Erlang module names), and MCP exposure for agent consumption.

Elixir — telemetry ecosystem

Elixir's telemetry library is the de facto BEAM standard for instrumented observability. Libraries emit events via telemetry:execute/3; applications attach handlers. Phoenix LiveDashboard provides a web UI showing real-time metrics. GenServer itself does not auto-emit telemetry events — libraries add instrumentation explicitly.

What we adopt: The telemetry library as our event bus, following the same event naming conventions ([prefix, object, action, phase]). What we adapt: Beamtalk instruments at the runtime level (all actor sends emit events automatically) rather than requiring library authors to add instrumentation manually. Our Tracing class provides the query API that the telemetry ecosystem leaves to third-party dashboards.

Akka — Cinnamon/Insights

Akka Insights provides per-actor-class instrumentation via configuration (no code changes). Four key metrics: mailbox size, mailbox time (queue wait), processing time, and stash size. Configuration supports per-class, per-instance, and wildcard selection with threshold alerts.

What we adopt: The four-metric model (our aggregates track call count, duration, errors, timeouts — analogous). Always-on aggregates with opt-in detailed tracing matches Akka's approach. What we differ on: Akka's instrumentation is configuration-driven and commercial. Ours is API-driven (Tracing enable) and built-in.

Pharo Smalltalk — MessageTally

Pharo's MessageTally spyOn: [block] profiles a block expression, producing a hierarchical tree of methods with execution time percentages. Research by Bergel et al. showed that counting message sends (not sampling) produces perfectly stable profiles. Deeply integrated into the IDE — right-click to profile, visual hierarchy browser. VisualWorks' PerformanceProfiler follows a similar pattern as a standalone class-side singleton.

What we adopt: The standalone profiler class pattern — both Pharo (MessageTally) and VisualWorks (PerformanceProfiler) keep profiling as a dedicated tool, not methods on the Smalltalk global. This validates our Tracing class design. What we defer: The block-scoped profile: [block] invocation pattern is planned as a future enhancement (see Decision section). Our v1 uses global enable/disable because the MCP agent workflow requires tracing to span multiple tool calls.

Pony — Flight Recorder

Pony's runtime supports a flight recorder mode: a circular in-memory buffer that captures trace events continuously with low overhead and dumps to stderr on crash (SIGILL/SIGSEGV/SIGBUS). This provides post-mortem debugging data without the overhead of continuous export.

What we learn: Our ring buffer design is essentially a flight recorder. The Pony model of "always capturing, only materializing on demand" aligns with our approach of always-on aggregates + opt-in trace capture.

BEAM Telemetry Ecosystem — In-Memory Storage

The pattern of "telemetry handler → ETS store → query API" is standard on BEAM — every application that needs local metrics does it, typically in ~20-30 lines. No one packages it as a general-purpose library because the query API is always application-specific. Several projects demonstrate the pattern:

What we learn: The storage pattern is well-established on BEAM. What's specific to Beamtalk is the data model (actor-scoped traces + per-method aggregates), the query API (slowMethods:, bottlenecks:), and the Beamtalk class wrapping it for REPL + MCP consumption. peep's success with atomics/counters validates our lock-free aggregate design.

User Impact

Newcomer (from Python/JS/Ruby)

Smalltalk Developer

Erlang/BEAM Developer

Production Operator

Steelman Analysis

Alternative A: Custom ETS-Only (no telemetry dependency)

CohortBest argument
Newcomer"Zero dependencies means nothing can break. Tracing enable just works — no library version to worry about."
Smalltalk purist"ETS and counters are already part of OTP — the runtime we ship with. Adding telemetry means core observability depends on a third-party library's release cycle. OTP primitives won't break on a hex version bump."
BEAM veteran"You've already built the hard part — context propagation across gen_server boundaries. That's the real work. The telemetry event bus is just a dispatch layer I could replace with a direct function call. If a user wants OTel export, they can write a 20-line handler on your ETS tables. telemetry is a convenience, not a necessity."
Operator"Fewer moving parts in production. I can inspect the ETS tables directly with ets:tab2list/1 — no abstraction hiding the data."
Language designer"The persistent_term toggle pattern is simpler and faster than telemetry's handler dispatch. We're optimizing for one specific use case, not building a general-purpose event bus."

Alternative B: OpenTelemetry-native

CohortBest argument
Newcomer"OpenTelemetry is the industry standard I learned in my distributed systems class. I can export Beamtalk traces to Jaeger and see them in the same dashboard as my Go microservices."
Smalltalk purist"If you're going to add an external dependency anyway, pick the one that gives you distributed tracing. telemetry is a half-measure — a BEAM-only event bus that still needs a custom store. OTel is the complete package."
BEAM veteran"If we're going to add observability, do it once with the standard. telemetry is BEAM-only — OpenTelemetry gives us cross-service distributed tracing from day one."
Operator"I already have OTel collectors running in production. One more service that speaks OTLP is zero ops effort. A custom ETS store is one more bespoke thing I have to learn and monitor."
Language designer"OTel's span model gives you parent/child tracing across actor calls for free — if Counter.increment calls DatabasePool.query:, the span hierarchy shows the causal chain. telemetry events are flat; you'd have to build correlation yourself."

Alternative C: Tracing methods on Beamtalk (ADR 0064 pattern)

CohortBest argument
Newcomer"One place for system tools. I already know Beamtalk logLevel: and Beamtalk enableDebug:Beamtalk enableTracing is the obvious next thing I'd try."
Smalltalk purist"Smalltalk's Smalltalk global is the ONE place you go for system tools. That's the beauty — one entry point, everything discoverable via help:. Splitting across classes means I have to know Tracing exists."
BEAM veteran"DDD consistency. ADR 0064 said runtime configuration goes on Beamtalk. Tracing enable/disable is runtime configuration. Splitting it to a new class contradicts the precedent."
Operator"One object to query from MCP. Beamtalk is already exposed — I don't need to discover that Tracing exists as a separate entry point."
Language designer"Fewer top-level names in the namespace. Beamtalk is already known; Tracing is a new name to discover. Discoverability is better when tools are co-located."

Decided: telemetry + Tracing class (hybrid)

CohortBest argument
Newcomer"The Beamtalk API is all I see — Tracing enable, Tracing stats. The telemetry library is invisible to me."
Smalltalk purist"MessageTally is a standalone class in Pharo; Tracing is a standalone class here. This is the Smalltalk way — profiling tools are their own thing, not bolted onto the system dictionary."
BEAM veteran"telemetry events mean I can telemetry:attach/4 my own handlers from the Erlang shell — export to Prometheus, StatsD, whatever. Context propagation across actor boundaries means opentelemetry_telemetry gives me correlated distributed traces just by adding the dep. No other BEAM framework does this out of the box."
Operator"Best of both worlds: local ETS queries for REPL debugging, composable handlers for production export, and an OTel upgrade path."
Language designer"The Tracing / Logging pair is a cleaner taxonomy than everything on Beamtalk. Each subsystem owns its full surface. This scales — a future Metrics class has an obvious home."

Tension Points

Alternatives Considered

Alternative: Custom ETS-Only (no telemetry library)

Direct ets:insert/2 and ets:update_counter/3 from instrumented send wrappers, with a persistent_term toggle for trace capture.

Rejected because: Loses composability. Users cannot attach their own handlers without modifying Beamtalk internals. No bridge to OpenTelemetry. The ~50ns additional overhead from telemetry dispatch (vs direct ETS write) is negligible, while the extensibility benefit is significant. We would be reimplementing the handler dispatch that telemetry already provides as a battle-tested, zero-dependency library.

Alternative: OpenTelemetry SDK

Use opentelemetry-erlang as the primary instrumentation layer with spans, metrics, and OTLP export.

Rejected because: 6-10+ transitive dependencies including a gRPC stack. ~10-50μs per span (10x the overhead of our budget). Designed for distributed export, not local REPL querying — fighting the abstraction for our primary use case. Erlang SDK documentation lags behind Go/Java/Python. Available as an add-on via the opentelemetry_telemetry bridge when distributed tracing is actually needed.

Alternative: sys module integration

Use OTP's built-in sys:trace/2 and sys:statistics/2 for per-actor tracing.

Rejected because: sys:trace produces unstructured text to stdout — not queryable, not aggregatable, not exposable via MCP. sys:statistics only tracks reductions, messages_in, messages_out — no per-method breakdown, no timing, no error classification. Would require significant wrapping to produce the structured output needed by the Tracing API and MCP tools.

Alternative: Tracing in codegen (safe_dispatch)

Instrument the generated safe_dispatch function in each actor's Core Erlang code, measuring method body execution time.

Rejected for now: Adds codegen complexity and requires recompilation to enable/disable. Send-wrapper instrumentation captures the caller-perspective round-trip time (including message queue wait), which is the more actionable metric for bottleneck detection. Message queue depth from Tracing healthFor: disambiguates queueing vs execution time. Codegen-level instrumentation remains a future enhancement if method-body timing is needed.

Alternative: Tracing control on Beamtalk (ADR 0064 DDD pattern)

Place enableTracing / disableTracing / tracingEnabled on Beamtalk, with only query methods on a separate class.

Rejected because: Adds 3 more methods to an already-large BeamtalkInterface (currently ~20 methods). More importantly, it splits the tracing API across two objects — users must know that control is on Beamtalk but queries are on Tracing. A standalone class with the full surface is simpler to discover and document. The ADR 0064 precedent was sound for logging config (~5 methods extending an existing introspection role), but tracing is a large enough domain (~15 methods) to justify its own namespace. The planned future Logging class will retroactively make both subsystems consistent as standalone facades.

Consequences

Positive

Negative

Neutral

Implementation

Phase 0: Validation Spike (S)

Affected components: Runtime (Erlang), rebar.config

Prove the core assumptions before committing to the full architecture:

  1. Add telemetry and telemetry_poller dependencies to runtime/rebar.config
  2. Write a minimal EUnit test that wraps a gen_server:call with telemetry:span/3 and verifies:
    • The stop event fires with correct duration
    • The exception event fires on timeout and preserves the exit:{timeout, _} shape (critical: existing sync_send/3 error handling depends on this)
    • An attached handler can write to ETS / increment counters from the calling process
  3. Verify beamtalk_object_instances reverse lookup (Pid → Class) works and measure overhead
  4. Microbenchmark: telemetry:span/3 + counters:add/3 under the actual scheduler count — validate the ~150ns budget
  5. Verify telemetry_poller can emit periodic VM stats (scheduler utilization, memory) as telemetry events

This spike should be a single branch with ~100 lines of test code. If the telemetry/try-catch composition breaks, fall back to direct ETS writes (the Custom ETS-Only alternative).

Phase 1: Core Infrastructure (M)

Affected components: Runtime (Erlang)

  1. Create beamtalk_trace_store.erl — gen_server owning ETS tables + counters refs, telemetry handler callbacks, query API, periodic eviction sweep
  2. Add beamtalk_trace_store to beamtalk_runtime_sup supervision tree
  3. Configure telemetry_poller for periodic VM measurements (feeds Tracing systemHealth)
  4. EUnit tests for trace store: direct insert, query, ring buffer eviction sweep, counters-based aggregates, clear, composite key uniqueness, gen_server crash recovery

Phase 2: Actor Instrumentation + Context Propagation (M)

Affected components: Runtime (beamtalk_actor.erl), Codegen (callbacks.rs)

  1. Instrument sync_send/3, async_send/4, cast_send/3 with telemetry:span/3
  2. Wire aggregate updates (always on) and trace event capture (when enabled)
  3. Add propagated context: get_propagated_ctx() in send wrappers, restore_propagated_ctx/1 in runtime
  4. Update codegen to generate 3-tuple handle_call({Selector, Args, PropCtx}, ...) with context restoration
  5. EUnit tests: verify telemetry events, verify overhead budget, verify context propagation (with and without OTel module present)

Phase 3: Beamtalk API (M)

Affected components: Stdlib (Tracing.bt), Runtime (beamtalk_tracing.erl)

  1. Create beamtalk_tracing.erl — Erlang shim that the Tracing class delegates to
  2. Create Tracing.bt — sealed class-only facade
  3. BUnit tests for Tracing API: enable/disable, traces, stats, slowMethods, healthFor, systemHealth

Phase 4: MCP Tools (S)

Affected components: Runtime (beamtalk_repl_ops_perf.erl), Rust MCP server (server.rs)

  1. Create beamtalk_repl_ops_perf.erl — REPL op handlers for enable-tracing, get-traces, actor-stats
  2. Wire ops into beamtalk_repl_server.erl
  3. Add 3 MCP tools to server.rs (following existing pattern)
  4. Integration tests: MCP workflow (enable-tracing → test → get-traces → actor-stats)
  5. Monitor agent usage — promote additional methods to dedicated tools if discoverability is an issue

Phase 5: Documentation and Polish (S)

  1. Update docs/beamtalk-language-features.md with Tracing class documentation
  2. Add examples to the example corpus for MCP search
  3. Update ADR 0064 to cross-reference this ADR

Implementation Tracking

Epic: BT-1600 Issues:

Status: Planned Recommended start: BT-1601 (validation spike, no dependencies)

References