ADR 0062: Bundled Example Corpus and Search Tool for MCP Server

Status

Implemented (2026-03-15)

Context

The Problem

Agents working via the Beamtalk MCP server have no way to verify Beamtalk syntax before writing .bt code. The CLAUDE.md rule — "grep for 3+ existing codebase examples of the pattern" — only works when the agent has repo access. An external agent connected over MCP stdio has no repo, no test files, and no examples. This leads to hallucinated syntax and incorrect patterns.

The MCP server already exposes evaluate, complete, docs, lint, and other tools. An agent can evaluate code and get completions, but it cannot ask "show me how closures work in Beamtalk" or "what's the syntax for pattern matching?". The gap is between knowing-what (API docs via the docs tool) and knowing-how (working examples showing idiomatic patterns).

What We Already Have

Beamtalk has rich content that could serve as an example corpus:

Source	Count	Description
`stdlib/test/fixtures/*.bt`	~97 files	Complete class definitions — actors, value classes, NLR, coordination, error handling
`tests/repl-protocol/fixtures/*.bt`	~52 files	Real-world patterns — bank accounts, chat rooms, supervision, hot reload
`docs/learning/fixtures/*.bt`	~50 files	Pedagogical examples — typed classes, destructuring, FFI, supervision
`stdlib/test/*.bt`	~151 files	BUnit test cases exercising the fixtures
`docs/learning/*.md`	27 files	Structured learning modules with prose explanations
`examples/*/.bt`	6 projects	Getting-started, bank, chat-room, GoF patterns, OTP tree, SICP
`docs/beamtalk-language-features.md`	1 file	Curated language reference
`crates/beamtalk-core/src/unparse/fixtures/*.bt`	7 files	Parser round-trip fixtures showing valid syntax patterns
`stdlib/bootstrap-test/*.btscript`	~20 files	Bootstrap primitive tests (too low-level for agents)

The test fixtures are the most valuable source — they are complete, self-contained, tested class definitions that directly answer "how do I write X in Beamtalk?". For example, stdlib/test/fixtures/counter.bt is a 15-line actor with state, mutation, and accessors; tests/repl-protocol/fixtures/bank_account.bt shows error handling, early returns, and string interpolation in a realistic domain.

These are already maintained, tested, and current. The problem is making them accessible to agents that don't have the repo.

Prior Art: Production MCP Servers

Two production MCP servers already implement documentation-search patterns:

Context7 (Upstash) — indexes 9,000+ libraries into a searchable corpus. Two-tool pattern: resolve-library-id resolves a library name to an ID, then query-docs searches that library's docs. The indirection handles multi-library disambiguation but adds a round-trip. For a single-language server like ours, the indirection is unnecessary overhead.

Atlassian Forge MCP — bundles how-to guides and code snippets as named MCP tools (search-forge-docs, forge-ui-kit-developer-guide). Each guide is a curated, pre-chunked document. Works well for a fixed set of topics but doesn't scale to ad-hoc queries.

Constraints

The MCP server ships as a single binary via cargo-dist. Sidecar files add packaging complexity.
The corpus must work offline — no network calls at query time.
Agent context windows are limited. Results must be concise (bounded token count).
The corpus will drift from source files unless there's a regeneration mechanism.

Decision

Single `search_examples` Tool

Add one MCP tool to the beamtalk-mcp server:

search_examples(query: string, limit?: integer) -> CallToolResult

Parameters:

Parameter	Type	Required	Default	Description
`query`	string	yes	—	Natural-language or keyword query (e.g. "closures", "pattern matching actors")
`limit`	integer	no	5	Maximum number of results to return (capped at 20)

Returns: A list of matching examples, each containing:

{
  "id": "collections-array-do",
  "title": "Iterating an Array with do:",
  "category": "collections",
  "tags": ["Array", "do:", "closures", "iteration"],
  "source": "// Array iteration\nfruits := #('apple' 'banana' 'cherry').\nfruits do: [:fruit | Transcript show: fruit].",
  "explanation": "The do: message sends a block to each element. The block receives one argument — the current element."
}

This is a single-tool pattern (not Context7's two-tool pattern) because we have exactly one corpus for one language. No disambiguation step needed.

Corpus Format and Storage

The corpus is a JSON file checked into the repo at crates/beamtalk-examples/corpus.json and embedded at compile time via include_bytes!. It is deserialized once at startup using std::sync::LazyLock.

Why checked-in JSON (not build.rs, not bincode):

serde_json is already a dependency — no new crates
A checked-in file is reviewable in PRs and diffable
CI can verify freshness: just build-corpus && git diff --exit-code crates/beamtalk-examples/corpus.json
No build.rs latency on every cargo build
JSON is human-readable for manual inspection and editing

Corpus Entry Schema

#[derive(Debug, serde::Deserialize, serde::Serialize)]
pub struct CorpusEntry {
    /// Unique identifier (e.g. "closures-value-capture").
    pub id: String,
    /// Human-readable title.
    pub title: String,
    /// Top-level category for grouping.
    pub category: String,
    /// Searchable tags — class names, selector names, concepts.
    pub tags: Vec<String>,
    /// Beamtalk source code (the example itself).
    pub source: String,
    /// Brief explanation of what the example demonstrates.
    pub explanation: String,
}

Content Sources and Chunking Strategy

Not all source content is equally useful as a corpus entry. The strategy is curated extraction, not bulk ingestion. Test fixtures are the primary source — they are complete, self-contained class definitions that directly answer "how do I write X?".

Source	Strategy	Granularity	Value
`stdlib/test/fixtures/*.bt`	Include as whole-file examples	One entry per fixture file	High — complete class defs: actors, value classes, NLR, coordination, HOM
`tests/repl-protocol/fixtures/*.bt`	Include as whole-file examples	One entry per fixture file	High — real-world patterns: bank accounts, chat rooms, supervision, hot reload
`docs/learning/fixtures/*.bt`	Include as whole-file examples	One entry per fixture file	High — pedagogical, tied to learning modules, typed classes, destructuring, FFI
`stdlib/test/*.bt`	Extract individual test methods	One entry per test method (with class context)	Medium — shows how to use patterns, but test scaffolding adds noise
`examples/*/.bt`	Extract key patterns from each project	One entry per notable pattern	Medium — multi-file projects, harder to chunk into standalone entries
`docs/learning/*.md`	Extract fenced `.bt` code blocks only if they pass parser validation	One entry per validated code block with surrounding prose	Medium — explanatory context, but snippets may be illustrative/non-working; only include blocks that parse successfully via `beamtalk-core`
`docs/beamtalk-language-features.md`	Extract each feature section's examples	One entry per language feature	Medium — reference-style, terse
`crates/beamtalk-core/src/unparse/fixtures/*.bt`	Include selectively	One entry per file	Low — compiler internals, but shows valid syntax patterns
`stdlib/bootstrap-test/*.btscript`	Skip	—	Low — too low-level for agent consumption

Estimated corpus size: ~300-400 entries, ~1-1.5MB as JSON.

Search Mechanism

Keyword-based scoring with weighted fields. No external dependencies.

score = (title_matches * 10) + (tag_matches * 8) + (category_matches * 5)
      + (explanation_matches * 2) + (source_matches * 1)

The query is tokenized into keywords (split on whitespace, lowercased). Each keyword is matched against each field. Results are sorted by score descending and truncated to limit.

Why not semantic/vector search: The corpus is small (~300 entries) and domain-specific. Agents are prompted with CLAUDE.md which names Beamtalk constructs explicitly, so queries are keyword-rich by design. Adding a vector search model (even a small ONNX one) would add ~30MB to the binary and a new dependency category. If keyword search proves insufficient in practice, semantic search can be added later without changing the tool interface.

Corpus Generation

A just build-corpus task runs a binary crate that:

Walks the content source directories
Parses .bt files using beamtalk-core to extract class names, selectors, and method boundaries
Parses .md files to extract fenced .bt code blocks with surrounding context — only includes blocks that pass beamtalk-core parser validation (markdown snippets can be illustrative/non-working; the parser gate prevents false examples from entering the corpus)
Writes crates/beamtalk-examples/corpus.json

The generator lives at crates/beamtalk-examples/build-corpus/ as a small binary crate. It depends on beamtalk-core for parsing .bt files — this is the key reason the generator is a crate rather than a standalone script. A script cannot access workspace dependencies, so it would have to use fragile regex heuristics instead of the real parser for method boundary detection.

Deterministic output: The CI freshness check (git diff --exit-code) requires identical output across runs. The generator must use sorted file traversal (not OS-dependent directory order), stable entry ordering (sorted by source path, then byte offset), and sorted tag arrays within each entry. All maps serialized to JSON must use sorted keys. This prevents flaky CI failures from non-deterministic ordering.

Freshness check in CI: just ci runs just build-corpus and asserts no diff. This catches corpus drift when test files or examples change.

Crate Structure

A new beamtalk-examples library crate owns the corpus types, storage, and search logic. The corpus generator is a sub-binary crate within it.

crates/beamtalk-examples/
├── Cargo.toml           # lib crate — depends on serde, serde_json
├── corpus.json          # checked-in generated corpus
├── src/
│   ├── lib.rs           # re-exports
│   ├── corpus.rs        # CorpusEntry, Corpus, LazyLock deserialization
│   └── search.rs        # weighted keyword search scoring
└── build-corpus/
    ├── Cargo.toml       # binary crate — depends on beamtalk-examples + beamtalk-core
    └── src/
        └── main.rs      # corpus generator: parses .bt/.md files, writes corpus.json

Why a separate crate (not modules in beamtalk-mcp):

The corpus generator needs two dependencies: beamtalk-core (to parse .bt files) and the CorpusEntry type (to serialize the corpus). If CorpusEntry lives in beamtalk-mcp, the generator would depend on the entire MCP server — REPL client, MCP transport, etc. — just for one struct. A shared beamtalk-examples crate gives a clean dependency graph:

beamtalk-mcp        → beamtalk-examples  (search at runtime)
beamtalk-examples   → serde, serde_json  (no other deps)
build-corpus        → beamtalk-examples + beamtalk-core  (generate corpus)

No circular dependencies, no pulling MCP machinery into the generator. And if the LSP later wants example lookups, it depends on beamtalk-examples directly — no extraction refactor needed.

The MCP server's server.rs adds a thin search_examples tool handler that delegates to beamtalk_examples::search().

Tool Registration

The search_examples tool is registered alongside existing tools in the #[tool_router] impl block. Unlike evaluate or complete, it does not use the REPL client — it queries the embedded corpus directly. This means it works even when the REPL is disconnected.

#[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
pub struct SearchExamplesParams {
    /// Search query — keywords or natural language describing what you're looking for.
    #[schemars(description = "Keywords or natural language query (e.g. 'closures', 'actor state', 'pattern matching')")]
    pub query: String,
    /// Maximum number of results (default 5, max 20).
    #[schemars(description = "Maximum results to return. Default 5, max 20.")]
    pub limit: Option<usize>,
}

Search Telemetry

The claim that keyword search with synonym tags is sufficient for ~300 entries needs to be falsifiable. Without observability, search quality degrades silently — we won't know that agents are searching for "for loop" and getting zero results until someone manually investigates.

What to log (structured, to the MCP server's log output):

Field	Purpose
`query_hash`	SHA-256 hash of the query (default; for counting unique queries without exposing content)
`query`	The raw query string (only at `DEBUG` level — opt-in, since queries may contain proprietary code snippets or PII from agent prompts)
`result_count`	Number of results returned
`top_score`	Score of the highest-ranked result (0 = no matches)
`duration_us`	Search latency in microseconds

This is local-only logging via tracing (already a dependency of beamtalk-mcp), not a phone-home service. The MCP server runs on the developer's machine; logs go to stderr or a log file. At the default INFO level, raw queries are never logged — only the hash and numeric metrics. Set RUST_LOG=beamtalk_mcp::search=debug to include raw queries for local eval sessions.

What this enables:

Zero-result queries reveal vocabulary gaps — queries that agents ask but the corpus can't answer. These directly inform synonym tag additions (Phase 3 curation).
Low-score queries reveal weak matches — the corpus has something relevant but the scoring didn't surface it well. These inform weight tuning.
Query distribution reveals which topics agents ask about most, guiding corpus expansion priorities.
Duration tracking establishes a baseline for deciding whether search performance matters (it probably doesn't at ~300 entries, but the data confirms it).

Eval workflow: Periodically review logs (or pipe to a script) to extract zero-result and low-score queries. Each one becomes either a synonym tag addition, a new corpus entry, or evidence that keyword search is hitting its ceiling and semantic search should be reconsidered.

A just search-eval task can parse structured logs and produce a summary report — top failing queries, score distribution, result count histogram.

Versioning and Staleness

The corpus is regenerated by just build-corpus and checked in
CI enforces that the checked-in corpus matches what the generator produces
The corpus carries no version field — it's as current as the last just build-corpus run
The embedded binary ships whatever corpus was checked in at build time

Prior Art

System	Corpus Source	Search	Bundled?	Tool Count
Context7	9,000+ libraries from npm/PyPI/docs sites	Keyword + embeddings	No (cloud service)	2 (`resolve-library-id`, `query-docs`)
Forge MCP	Curated Atlassian guides	Named guide lookup	Yes (in server)	Multiple named tools
Pharo	Class/method comments in the image	`Finder` tool, string/example search	Yes (in image)	N/A (not MCP)
Elixir HexDocs	Package documentation	Full-text search	No (web service)	N/A
This ADR	Test files, examples, learning docs	Weighted keyword scoring	Yes (`include_bytes!`)	1 (`search_examples`)

What we adopt:

Context7's corpus-and-search pattern — a searchable index of examples, not just API docs. Context7 proved that agents benefit enormously from example code alongside documentation.
Forge MCP's bundled approach — embedding content in the server binary for zero-config, offline operation. No cloud service, no API keys, no network dependency.

What we adapt:

Context7's two-tool pattern → single tool. Context7 needs resolve-library-id because it indexes 9,000+ libraries and must disambiguate. We have one language with one corpus. The indirection adds latency and complexity for no benefit.
Forge MCP's named guides → keyword search. Forge uses fixed tool names per guide (forge-ui-kit-developer-guide). This doesn't scale to ad-hoc queries. We use a single searchable tool with keyword matching instead.

What we reject:

Pharo's live Finder — searches the running image for examples by evaluating expressions. Requires a live system with all classes loaded. Our corpus must work offline, without a REPL connection.
Full semantic search — vector embeddings, ONNX models, etc. Overkill for ~300 entries in a single-language domain where queries are keyword-rich.

User Impact

Agent Developer (primary consumer)

Before: Agent must hallucinate Beamtalk syntax or ask the user for examples. No programmatic way to look up "how do I iterate an array?"
After: search_examples("array iteration") returns working, tested code. The agent can verify patterns before generating .bt files.
Caveat: Keyword search requires knowing approximate terminology. An agent asking "for loop" won't find do: unless tags include "loop" as a synonym. Tag quality is critical.

Language Developer (corpus maintainer)

New workflow: When adding tests or examples, run just build-corpus to regenerate the corpus. CI catches forgotten regeneration.
Caveat: Manual curation of tags and explanations adds maintenance burden. Mitigated by generating reasonable defaults from test method names and file paths.

Production Operator

Binary size increase: ~1MB for the embedded corpus. The beamtalk-mcp binary grows from ~20MB to ~21MB.
Memory: Corpus is deserialized once into ~1MB of heap. Negligible for a dev tool.
No runtime impact on existing tools: search_examples is independent of the REPL connection.

Newcomer

Discoverability: An agent using Beamtalk MCP can now self-serve examples. "Show me how closures work" returns working code, not a hallucination.

Steelman Analysis

Alternative: Separate file instead of embedded binary

Cohort	Strongest argument
Operator	"A 1-3MB corpus baked into the binary inflates every deploy, even when no agent ever calls `search_examples`. Ship it as `corpus.json` next to the binary — users who don't use MCP agents pay zero cost. Also lets users swap in a custom corpus without recompiling."
Language dev	"During corpus taxonomy development, I need dozens of regenerate→test→inspect cycles. Each cycle currently requires recompiling `beamtalk-examples` and relinking `beamtalk-mcp` — 30+ seconds of Rust compilation for a content change. A sidecar file means `just build-corpus && restart` with zero Rust compilation. That's the difference between a 2-second and a 30-second feedback loop during content curation."
Contributor	"If the corpus is a separate JSON file, I can edit it by hand to add a quick example without learning the generator, without building the Rust workspace, without even having Rust installed. Lower contribution barrier for an open-source project."

Counter: The operator argument is weaker than it appears: the corpus is embedded in beamtalk-examples, which is only a dependency of beamtalk-mcp — not the main beamtalk CLI or the runtime. Every user running beamtalk-mcp is, by definition, using MCP agents, so the "users who don't use agents pay zero cost" scenario doesn't apply. Beyond that, distribution complexity is a real cost: cargo-dist ships a single binary today. A sidecar file means packaging changes, install instructions, and a runtime "file not found" failure mode. The ~1MB cost is negligible for a dev tool binary that's already ~20MB.

Alternative: Live REPL search (no static corpus)

Cohort	Strongest argument
Smalltalk purist	"The system should be self-describing. If an agent wants examples, it should ask the running system — `ExamplesFinder search: 'blocks'`. This is live, always current, and the corpus is exactly what's loaded. You're building a dead snapshot of content that drifts from the real system. The freshness problem you're solving with CI checks simply doesn't exist with a live approach."
Pragmatist	"The MCP server already has a running REPL connection. Once Beamtalk has modules, you emit an `ExamplesFinder` module as part of the dev-mode stdlib — opt-in for development, stripped from production builds. No Rust crate, no JSON file, no build step. The module hot-reloads with the stdlib, so the corpus is always current. The bloat argument doesn't apply — production binaries never see it."
BEAM developer	"On the BEAM, introspection is a first-class capability — `module_info`, `beam_lib`, even decompilation. Building a static JSON index of what the runtime already knows is working against the platform. An Erlang dev would build a `gen_server` that walks loaded modules and exposes a search function over the running system. You're solving a BEAM problem with a non-BEAM tool."

Counter: The dev-only module approach is viable and addresses the bloat concern. The remaining arguments against it are operational:

REPL coupling. A REPL-based approach ties search availability to REPL connectivity. The static corpus works during startup, reconnection, and error states — exactly when an agent most needs to look up examples (e.g., diagnosing why a REPL connection failed). The MCP server's search_examples tool works even with "repl_connected": false.
Search evolution. Keyword search in Erlang is straightforward, but if we later want synonym expansion, fuzzy matching, or (eventually) semantic search, maintaining that in Erlang means a parallel implementation. The Rust-side approach keeps search logic in one place, co-located with the MCP server that consumes it.
Content isolation. The corpus is ~300 entries drawn from test fixtures, e2e cases, and learning docs. A stdlib class or runtime module would load all that tutorial content into the BEAM VM for every REPL session — dead weight for users who never search. The Rust-side approach keeps the corpus in the beamtalk-mcp binary only, where the sole consumer (MCP agents) lives. The runtime stays lean.

Of these, (1) and (3) are the strongest. The live approach is compelling philosophically but forces a choice between always-loaded bloat and conditional loading complexity, while the Rust-side corpus avoids both.

Alternative: Semantic/vector search

Cohort Strongest argument

AI tooling advocate "Agents don't always use the right keywords. 'How do I loop over elements' should find do: and collect:, but keyword search requires the agent to already know those names. Semantic search closes that vocabulary gap. The corpus is small enough that a lightweight model (e.g., ONNX MiniLM) adds ~30MB, not 200MB. You're building a search tool for AI agents — the one user who would most benefit from semantic understanding — and giving them grep."

Cohort	Strongest argument
AI tooling advocate	"Agents don't always use the right keywords. 'How do I loop over elements' should find `do:` and `collect:`, but keyword search requires the agent to already know those names. Semantic search closes that vocabulary gap. The corpus is small enough that a lightweight model (e.g., ONNX MiniLM) adds ~30MB, not 200MB. You're building a search tool for AI agents — the one user who would most benefit from semantic understanding — and giving them grep."

Counter: The vocabulary gap is real but addressable without a model. Synonym tags (e.g., "loop" → do:, "lambda" → blocks, "for each" → do:) close the most common mismatches at negligible cost. The corpus is ~300 entries — small enough that well-chosen tags cover the search space. Adding an ONNX model introduces a new dependency category (native ML runtime) and complicates cross-compilation. Pure-Rust alternatives exist — Tantivy for BM25/stemming (~2MB), candle for embeddings (~22MB) — both cross-platform with zero native deps. Crucially, structured telemetry on every search call (query_hash, result_count, top_score) makes this a data-driven decision: if zero-result queries exceed ~15% or cluster around vocabulary mismatches that tags can't cover, that's concrete evidence to upgrade the search backend. (Raw query is DEBUG-only opt-in.) The tool interface (search_examples(query, limit)) is deliberately stable — the backend can be upgraded from keyword→Tantivy→candle without changing agent integration. See Phase 5 for the tiered upgrade path.

Alternative: Context7-style two-tool pattern

Cohort	Strongest argument
MCP protocol purist	"A `list_categories` or `browse_examples(category)` tool lets agents discover the corpus structure before searching. Without it, the agent is shooting blind — it doesn't know whether to search 'actors', 'processes', 'concurrency', or 'message passing' for the same concept. Context7's resolve step isn't just disambiguation — it's discoverability."
Agent developer	"Single-tool search has a cold start problem. The agent's first query is a guess. If it guesses wrong, it doesn't know why — was the query bad, or does the corpus not cover that topic? A `list_categories` tool gives the agent a table of contents on the first call. Every subsequent search is informed. This isn't about two tools — it's about giving agents a map before asking them to navigate."

Counter: The discoverability argument is real but doesn't justify a second tool. A single search_examples call with a broad query (e.g., "concurrency") already returns results with category fields that reveal the corpus structure. We could also add a categories field to the tool's schema description listing available categories. The two-tool pattern doubles the integration surface (agents must learn two tools, handle the round-trip, deal with empty resolve results) for a discoverability benefit achievable within one tool.

Alternative: LSP bundling instead of MCP

Cohort Strongest argument

IDE developer "The LSP already runs in every editor session. If the corpus lives there, VSCode completions and hover docs can show examples inline — not just MCP agents. One corpus, two consumers."

Newcomer "Docstrings show examples for the method you're already looking at. But the real problem is earlier: I don't know which class or method to use. When I type Array and see 20 completions, I need to know which collection method fits my use case. Hover docs on do: vs collect: vs inject:into: don't help if I don't know which one to hover on. Inline corpus search in the LSP — triggered by a command or completion prefix — bridges that gap."

Cohort	Strongest argument
IDE developer	"The LSP already runs in every editor session. If the corpus lives there, VSCode completions and hover docs can show examples inline — not just MCP agents. One corpus, two consumers."
Newcomer	"Docstrings show examples for the method you're already looking at. But the real problem is earlier: I don't know which class or method to use. When I type `Array` and see 20 completions, I need to know which collection method fits my use case. Hover docs on `do:` vs `collect:` vs `inject:into:` don't help if I don't know which one to hover on. Inline corpus search in the LSP — triggered by a command or completion prefix — bridges that gap."

Counter: The LSP runs in the user's workspace, not the Beamtalk repo — it has no access to stdlib test files or example source code to grep. However, the LSP already serves key examples via docstrings in hover documentation, which covers the primary "show me how to use this" use case for IDE users. The corpus is designed for MCP agents that lack any repo access; LSP users already have a richer interaction model (completions, hover, go-to-definition) that docstring examples integrate into naturally. Bundling the full corpus into the LSP adds binary bloat for marginal benefit over existing docstrings. That said, the beamtalk-examples crate architecture makes LSP integration trivial if desired later — just add the dependency.

Tension Points

Freshness vs simplicity: A checked-in JSON file is simple but can drift. A build.rs generator is always fresh but adds build complexity. CI freshness checks are the compromise.
Curation vs automation: Hand-curated tags and explanations produce better search results but don't scale. Auto-generated entries from test method names are lower quality but zero-maintenance. The recommendation is auto-generate with manual refinement.
Crate boundary: A separate beamtalk-examples crate adds workspace complexity but is justified by the generator's dependency needs (it needs beamtalk-core + CorpusEntry without pulling in MCP server deps).

Alternatives Considered

Alternative A: Context7-Style Two-Tool Pattern

Expose resolve_topic(query) -> topic_id and get_examples(topic_id, limit) -> examples as separate tools.

Rejected because: We have one language, one corpus. The indirection adds a mandatory round-trip that wastes agent turns. Context7 needs disambiguation across 9,000+ libraries; we don't. A single search_examples tool serves the same purpose with less friction.

Alternative B: Forge-Style Named Guide Tools

Expose each content category as a separate named tool: closures_guide, collections_guide, actors_guide, etc.

Rejected because: Fixed tool names don't support ad-hoc queries. An agent looking for "how to send a message to an actor" has to guess which guide name matches. Adding new topics requires adding new tools (code changes, not just corpus updates). Search is strictly more flexible.

Alternative C: `build.rs` Corpus Generation

Generate the corpus at compile time via build.rs in beamtalk-mcp, parsing test files and emitting a serialized corpus to OUT_DIR.

Rejected because: Runs on every cargo build, adding latency to every compile cycle. Creates a build-time dependency on beamtalk-core's parser (already a dependency, but using it in build.rs is a different compilation unit). The corpus doesn't change on every build — it changes when content source files change. A manual just build-corpus with CI freshness checks is more appropriate.

Alternative D: No Corpus — Enhance Existing `docs` Tool

Extend the existing docs MCP tool to return examples alongside API documentation, pulling from /// doc comments that contain code blocks.

Rejected because: /// doc comments are API-level documentation, not tutorial-style examples. They document what a method does, not how to use the language feature it exemplifies. The gap is "how do I use closures?" not "what does Array >> #do: return?". Different content, different tool.

Alternative E: Keep Everything in `beamtalk-mcp` (No Separate Crate)

Keep CorpusEntry, Corpus, and search logic as modules inside beamtalk-mcp rather than creating a separate beamtalk-examples crate. Only one consumer exists today.

Rejected because: The corpus generator binary needs both beamtalk-core (to parse .bt files) and the CorpusEntry type (to serialize the corpus). If CorpusEntry lives in beamtalk-mcp, the generator depends on the entire MCP server — REPL client, MCP transport, etc. — just for one struct. A separate beamtalk-examples crate with minimal dependencies (serde, serde_json) gives a clean dependency graph and avoids pulling MCP machinery into the generator. The separate crate is justified by the generator's needs today, not hypothetical future consumers.

Consequences

Positive

Agents can verify syntax before writing code — the primary goal. search_examples("closures") returns tested, working Beamtalk code.
Zero-config, offline operation — no API keys, no network, no sidecar files. The corpus ships inside the binary.
No new dependencies — uses serde_json (already present) for deserialization, no vector DB or embedding model.
Stable tool interface — the search_examples(query, limit) signature supports upgrading the search backend (keyword → semantic) without changing the agent integration.
Data-driven search evolution — structured telemetry on every search call makes keyword search quality falsifiable. Zero-result and low-score queries directly inform tag curation and provide evidence for or against upgrading to semantic search.
Content reuse — leverages existing test files and learning docs rather than creating new content.

Negative

Maintenance burden — the corpus must be regenerated when content sources change. CI freshness checks catch drift but just build-corpus is a manual step.
Keyword search limitations — vocabulary mismatch (e.g., "for loop" vs do:) requires careful tagging. Quality depends on tag curation.
Binary size increase — ~1MB added to the beamtalk-mcp binary.
Staleness risk — the shipped corpus is as current as the last just build-corpus run. Agents on older binary versions get older examples.

Neutral

No impact on existing MCP tools — search_examples is additive. All existing tools (evaluate, docs, complete, etc.) are unchanged.
No REPL dependency — unlike most MCP tools, search_examples works without a REPL connection.
Corpus format is internal — the JSON schema is a build artifact, not a public API. It can change freely between versions.

Implementation

Phase 1: `beamtalk-examples` Crate and Corpus Generator (S)

Affected components: New beamtalk-examples crate, build tooling

Create crates/beamtalk-examples/ library crate with CorpusEntry, Corpus, LazyLock deserialization, and search.rs
Create crates/beamtalk-examples/build-corpus/ binary crate (depends on beamtalk-examples + beamtalk-core)
Implement extractors for each content source:
- .bt fixture files (stdlib/test/fixtures/, tests/repl-protocol/fixtures/, docs/learning/fixtures/): include as whole-file entries with auto-generated tags from class names, selectors, and file path
- .bt test files (stdlib/test/*.bt): extract individual test methods with class context
- .bt example files (examples/**/*.bt): extract notable patterns
- .md learning modules: extract fenced .bt code blocks that pass parser validation, with surrounding prose (skip illustrative/non-working snippets)
- beamtalk-language-features.md: extract per-feature examples
Generate crates/beamtalk-examples/corpus.json
Add just build-corpus task
Add CI freshness check to just ci
Add unit tests for search scoring and edge cases (empty query, no matches, limit capping)

Phase 2: MCP Tool Integration (S)

Affected components: beamtalk-mcp

Add beamtalk-examples as a dependency of beamtalk-mcp
Add SearchExamplesParams and search_examples tool handler in server.rs — delegates to beamtalk_examples::search()
Add structured tracing telemetry — log query_hash, result_count, top_score, duration_us on every search call (raw query at DEBUG only)
Update server instructions text to mention the new tool

Phase 3: Corpus Curation (M)

Affected components: Content

Review auto-generated corpus entries for quality
Add synonym tags for common vocabulary mismatches (e.g., "loop" → do:, "lambda" → blocks)
Write explanations for entries that lack sufficient context
Validate that representative queries return useful results

Phase 4: Search Eval Tooling (S)

Affected components: Build tooling

Add just search-eval task that parses structured logs and produces a summary: zero-result queries, low-score queries, score distribution, query frequency
Use eval results to drive synonym tag additions and corpus expansion
Establish a baseline for search quality — if zero-result rate exceeds a threshold (e.g., >15% of queries), escalate to search backend upgrade (Phase 5)

Phase 5: Search Backend Upgrade (Future — post-0.1.0)

Affected components: beamtalk-examples internals only — no API changes

The search_examples tool interface is deliberately stable: (query, limit) → results. The search backend can be upgraded without changing the MCP tool contract, agent prompts, or corpus format. Three tiers, in order of increasing capability:

Tier 1: Tantivy + manual synonym tags (~2MB binary cost)

Replace the weighted keyword scorer with Tantivy, a pure-Rust full-text search engine. Adds BM25 scoring, stemming ("iterating" → "iterate"), and fuzzy matching. Pure Rust, no native dependencies, no cross-platform issues. Combined with the synonym tags from Phase 3, this closes the morphology gap without any ML. Lowest effort, highest confidence.

Tier 2: Tantivy + LLM-generated tag expansion (~2MB binary cost)

At build time (just build-corpus), use an LLM API to generate rich synonym/concept tags for each corpus entry — not just hand-curated synonyms but exhaustive expansions (e.g., "loop, iterate, for each, traverse, walk, map over, apply to each element" for do:). Store expanded tags in corpus.json. At runtime, Tantivy searches over these tags with stemming and fuzzy matching. This pushes the "smart" part to build time (where API calls are acceptable) and keeps runtime pure Rust with zero model overhead.

Tier 3: candle + MiniLM embeddings (~22MB binary cost)

Pre-compute embeddings for all corpus entries at build time and store vectors in corpus.json. At query time, embed the query using candle (Hugging Face's pure-Rust ML framework, no C++ deps, no ONNX runtime) with a MiniLM-L6 model (~22MB). Cosine similarity against stored vectors. Fully closes the vocabulary gap — "for loop" finds do: without any tag engineering. Cross-platform (compiles anywhere Rust compiles). Query embedding takes ~10-50ms on CPU.

Tier	Binary cost	Cross-platform	Vocab gap	Complexity
Current (keyword + manual tags)	0	Perfect	Partial	Low
Tier 1 (Tantivy + manual tags)	~2MB	Perfect	Better (stemming)	Low
Tier 2 (Tantivy + LLM tags)	~2MB	Perfect	Good	Low-medium
Tier 3 (candle + MiniLM)	~22MB	Perfect	Full	Medium

The decision of which tier to pursue should be data-driven, informed by Phase 4 eval results. If zero-result queries cluster around morphological mismatches ("iterating" vs "iterate"), Tier 1 suffices. If they cluster around vocabulary gaps ("for loop" → do:), Tier 2 is the sweet spot. If novel phrasings dominate, Tier 3 is warranted.

References

Related issues: BT-1347
Related ADRs: ADR 0033 (Runtime-embedded documentation — prior art for embedding content in the runtime)
Context7: https://github.com/upstash/context7
Forge MCP: https://developer.atlassian.com/platform/forge/forge-mcp/
MCP specification (tools): https://modelcontextprotocol.io/specification/2025-06-18/server/tools
Existing MCP server: crates/beamtalk-mcp/src/server.rs