ADR 0062: Bundled Example Corpus and Search Tool for MCP Server

Status

Accepted (2026-03-15)

Context

The Problem

Agents working via the Beamtalk MCP server have no way to verify Beamtalk syntax before writing .bt code. The CLAUDE.md rule — "grep for 3+ existing codebase examples of the pattern" — only works when the agent has repo access. An external agent connected over MCP stdio has no repo, no test files, and no examples. This leads to hallucinated syntax and incorrect patterns.

The MCP server already exposes evaluate, complete, docs, lint, and other tools. An agent can evaluate code and get completions, but it cannot ask "show me how closures work in Beamtalk" or "what's the syntax for pattern matching?". The gap is between knowing-what (API docs via the docs tool) and knowing-how (working examples showing idiomatic patterns).

What We Already Have

Beamtalk has rich content that could serve as an example corpus:

SourceCountDescription
stdlib/test/fixtures/*.bt~97 filesComplete class definitions — actors, value classes, NLR, coordination, error handling
tests/e2e/fixtures/*.bt~52 filesReal-world patterns — bank accounts, chat rooms, supervision, hot reload
docs/learning/fixtures/*.bt~50 filesPedagogical examples — typed classes, destructuring, FFI, supervision
stdlib/test/*.bt~151 filesBUnit test cases exercising the fixtures
docs/learning/*.md27 filesStructured learning modules with prose explanations
examples/**/*.bt6 projectsGetting-started, bank, chat-room, GoF patterns, OTP tree, SICP
docs/beamtalk-language-features.md1 fileCurated language reference
crates/beamtalk-core/src/unparse/fixtures/*.bt7 filesParser round-trip fixtures showing valid syntax patterns
stdlib/bootstrap-test/*.btscript~20 filesBootstrap primitive tests (too low-level for agents)

The test fixtures are the most valuable source — they are complete, self-contained, tested class definitions that directly answer "how do I write X in Beamtalk?". For example, stdlib/test/fixtures/counter.bt is a 15-line actor with state, mutation, and accessors; tests/e2e/fixtures/bank_account.bt shows error handling, early returns, and string interpolation in a realistic domain.

These are already maintained, tested, and current. The problem is making them accessible to agents that don't have the repo.

Prior Art: Production MCP Servers

Two production MCP servers already implement documentation-search patterns:

Context7 (Upstash) — indexes 9,000+ libraries into a searchable corpus. Two-tool pattern: resolve-library-id resolves a library name to an ID, then query-docs searches that library's docs. The indirection handles multi-library disambiguation but adds a round-trip. For a single-language server like ours, the indirection is unnecessary overhead.

Atlassian Forge MCP — bundles how-to guides and code snippets as named MCP tools (search-forge-docs, forge-ui-kit-developer-guide). Each guide is a curated, pre-chunked document. Works well for a fixed set of topics but doesn't scale to ad-hoc queries.

Constraints

Decision

Single search_examples Tool

Add one MCP tool to the beamtalk-mcp server:

search_examples(query: string, limit?: integer) -> CallToolResult

Parameters:

ParameterTypeRequiredDefaultDescription
querystringyesNatural-language or keyword query (e.g. "closures", "pattern matching actors")
limitintegerno5Maximum number of results to return (capped at 20)

Returns: A list of matching examples, each containing:

{
  "id": "collections-array-do",
  "title": "Iterating an Array with do:",
  "category": "collections",
  "tags": ["Array", "do:", "closures", "iteration"],
  "source": "// Array iteration\nfruits := #('apple' 'banana' 'cherry').\nfruits do: [:fruit | Transcript show: fruit].",
  "explanation": "The do: message sends a block to each element. The block receives one argument — the current element."
}

This is a single-tool pattern (not Context7's two-tool pattern) because we have exactly one corpus for one language. No disambiguation step needed.

Corpus Format and Storage

The corpus is a JSON file checked into the repo at crates/beamtalk-examples/corpus.json and embedded at compile time via include_bytes!. It is deserialized once at startup using std::sync::LazyLock.

Why checked-in JSON (not build.rs, not bincode):

Corpus Entry Schema

#[derive(Debug, serde::Deserialize, serde::Serialize)]
pub struct CorpusEntry {
    /// Unique identifier (e.g. "closures-value-capture").
    pub id: String,
    /// Human-readable title.
    pub title: String,
    /// Top-level category for grouping.
    pub category: String,
    /// Searchable tags — class names, selector names, concepts.
    pub tags: Vec<String>,
    /// Beamtalk source code (the example itself).
    pub source: String,
    /// Brief explanation of what the example demonstrates.
    pub explanation: String,
}

Content Sources and Chunking Strategy

Not all source content is equally useful as a corpus entry. The strategy is curated extraction, not bulk ingestion. Test fixtures are the primary source — they are complete, self-contained class definitions that directly answer "how do I write X?".

SourceStrategyGranularityValue
stdlib/test/fixtures/*.btInclude as whole-file examplesOne entry per fixture fileHigh — complete class defs: actors, value classes, NLR, coordination, HOM
tests/e2e/fixtures/*.btInclude as whole-file examplesOne entry per fixture fileHigh — real-world patterns: bank accounts, chat rooms, supervision, hot reload
docs/learning/fixtures/*.btInclude as whole-file examplesOne entry per fixture fileHigh — pedagogical, tied to learning modules, typed classes, destructuring, FFI
stdlib/test/*.btExtract individual test methodsOne entry per test method (with class context)Medium — shows how to use patterns, but test scaffolding adds noise
examples/**/*.btExtract key patterns from each projectOne entry per notable patternMedium — multi-file projects, harder to chunk into standalone entries
docs/learning/*.mdExtract fenced .bt code blocks only if they pass parser validationOne entry per validated code block with surrounding proseMedium — explanatory context, but snippets may be illustrative/non-working; only include blocks that parse successfully via beamtalk-core
docs/beamtalk-language-features.mdExtract each feature section's examplesOne entry per language featureMedium — reference-style, terse
crates/beamtalk-core/src/unparse/fixtures/*.btInclude selectivelyOne entry per fileLow — compiler internals, but shows valid syntax patterns
stdlib/bootstrap-test/*.btscriptSkipLow — too low-level for agent consumption

Estimated corpus size: ~300-400 entries, ~1-1.5MB as JSON.

Search Mechanism

Keyword-based scoring with weighted fields. No external dependencies.

score = (title_matches * 10) + (tag_matches * 8) + (category_matches * 5)
      + (explanation_matches * 2) + (source_matches * 1)

The query is tokenized into keywords (split on whitespace, lowercased). Each keyword is matched against each field. Results are sorted by score descending and truncated to limit.

Why not semantic/vector search: The corpus is small (~300 entries) and domain-specific. Agents are prompted with CLAUDE.md which names Beamtalk constructs explicitly, so queries are keyword-rich by design. Adding a vector search model (even a small ONNX one) would add ~30MB to the binary and a new dependency category. If keyword search proves insufficient in practice, semantic search can be added later without changing the tool interface.

Corpus Generation

A just build-corpus task runs a binary crate that:

  1. Walks the content source directories
  2. Parses .bt files using beamtalk-core to extract class names, selectors, and method boundaries
  3. Parses .md files to extract fenced .bt code blocks with surrounding context — only includes blocks that pass beamtalk-core parser validation (markdown snippets can be illustrative/non-working; the parser gate prevents false examples from entering the corpus)
  4. Writes crates/beamtalk-examples/corpus.json

The generator lives at crates/beamtalk-examples/build-corpus/ as a small binary crate. It depends on beamtalk-core for parsing .bt files — this is the key reason the generator is a crate rather than a standalone script. A script cannot access workspace dependencies, so it would have to use fragile regex heuristics instead of the real parser for method boundary detection.

Deterministic output: The CI freshness check (git diff --exit-code) requires identical output across runs. The generator must use sorted file traversal (not OS-dependent directory order), stable entry ordering (sorted by source path, then byte offset), and sorted tag arrays within each entry. All maps serialized to JSON must use sorted keys. This prevents flaky CI failures from non-deterministic ordering.

Freshness check in CI: just ci runs just build-corpus and asserts no diff. This catches corpus drift when test files or examples change.

Crate Structure

A new beamtalk-examples library crate owns the corpus types, storage, and search logic. The corpus generator is a sub-binary crate within it.

crates/beamtalk-examples/
├── Cargo.toml           # lib crate — depends on serde, serde_json
├── corpus.json          # checked-in generated corpus
├── src/
│   ├── lib.rs           # re-exports
│   ├── corpus.rs        # CorpusEntry, Corpus, LazyLock deserialization
│   └── search.rs        # weighted keyword search scoring
└── build-corpus/
    ├── Cargo.toml       # binary crate — depends on beamtalk-examples + beamtalk-core
    └── src/
        └── main.rs      # corpus generator: parses .bt/.md files, writes corpus.json

Why a separate crate (not modules in beamtalk-mcp):

The corpus generator needs two dependencies: beamtalk-core (to parse .bt files) and the CorpusEntry type (to serialize the corpus). If CorpusEntry lives in beamtalk-mcp, the generator would depend on the entire MCP server — REPL client, MCP transport, etc. — just for one struct. A shared beamtalk-examples crate gives a clean dependency graph:

beamtalk-mcp        → beamtalk-examples  (search at runtime)
beamtalk-examples   → serde, serde_json  (no other deps)
build-corpus        → beamtalk-examples + beamtalk-core  (generate corpus)

No circular dependencies, no pulling MCP machinery into the generator. And if the LSP later wants example lookups, it depends on beamtalk-examples directly — no extraction refactor needed.

The MCP server's server.rs adds a thin search_examples tool handler that delegates to beamtalk_examples::search().

Tool Registration

The search_examples tool is registered alongside existing tools in the #[tool_router] impl block. Unlike evaluate or complete, it does not use the REPL client — it queries the embedded corpus directly. This means it works even when the REPL is disconnected.

#[derive(Debug, serde::Deserialize, schemars::JsonSchema)]
pub struct SearchExamplesParams {
    /// Search query — keywords or natural language describing what you're looking for.
    #[schemars(description = "Keywords or natural language query (e.g. 'closures', 'actor state', 'pattern matching')")]
    pub query: String,
    /// Maximum number of results (default 5, max 20).
    #[schemars(description = "Maximum results to return. Default 5, max 20.")]
    pub limit: Option<usize>,
}

Search Telemetry

The claim that keyword search with synonym tags is sufficient for ~300 entries needs to be falsifiable. Without observability, search quality degrades silently — we won't know that agents are searching for "for loop" and getting zero results until someone manually investigates.

What to log (structured, to the MCP server's log output):

FieldPurpose
query_hashSHA-256 hash of the query (default; for counting unique queries without exposing content)
queryThe raw query string (only at DEBUG level — opt-in, since queries may contain proprietary code snippets or PII from agent prompts)
result_countNumber of results returned
top_scoreScore of the highest-ranked result (0 = no matches)
duration_usSearch latency in microseconds

This is local-only logging via tracing (already a dependency of beamtalk-mcp), not a phone-home service. The MCP server runs on the developer's machine; logs go to stderr or a log file. At the default INFO level, raw queries are never logged — only the hash and numeric metrics. Set RUST_LOG=beamtalk_mcp::search=debug to include raw queries for local eval sessions.

What this enables:

Eval workflow: Periodically review logs (or pipe to a script) to extract zero-result and low-score queries. Each one becomes either a synonym tag addition, a new corpus entry, or evidence that keyword search is hitting its ceiling and semantic search should be reconsidered.

A just search-eval task can parse structured logs and produce a summary report — top failing queries, score distribution, result count histogram.

Versioning and Staleness

Prior Art

SystemCorpus SourceSearchBundled?Tool Count
Context79,000+ libraries from npm/PyPI/docs sitesKeyword + embeddingsNo (cloud service)2 (resolve-library-id, query-docs)
Forge MCPCurated Atlassian guidesNamed guide lookupYes (in server)Multiple named tools
PharoClass/method comments in the imageFinder tool, string/example searchYes (in image)N/A (not MCP)
Elixir HexDocsPackage documentationFull-text searchNo (web service)N/A
This ADRTest files, examples, learning docsWeighted keyword scoringYes (include_bytes!)1 (search_examples)

What we adopt:

What we adapt:

What we reject:

User Impact

Agent Developer (primary consumer)

Language Developer (corpus maintainer)

Production Operator

Newcomer

Steelman Analysis

Alternative: Separate file instead of embedded binary

CohortStrongest argument
Operator"A 1-3MB corpus baked into the binary inflates every deploy, even when no agent ever calls search_examples. Ship it as corpus.json next to the binary — users who don't use MCP agents pay zero cost. Also lets users swap in a custom corpus without recompiling."
Language dev"During corpus taxonomy development, I need dozens of regenerate→test→inspect cycles. Each cycle currently requires recompiling beamtalk-examples and relinking beamtalk-mcp — 30+ seconds of Rust compilation for a content change. A sidecar file means just build-corpus && restart with zero Rust compilation. That's the difference between a 2-second and a 30-second feedback loop during content curation."
Contributor"If the corpus is a separate JSON file, I can edit it by hand to add a quick example without learning the generator, without building the Rust workspace, without even having Rust installed. Lower contribution barrier for an open-source project."

Counter: The operator argument is weaker than it appears: the corpus is embedded in beamtalk-examples, which is only a dependency of beamtalk-mcp — not the main beamtalk CLI or the runtime. Every user running beamtalk-mcp is, by definition, using MCP agents, so the "users who don't use agents pay zero cost" scenario doesn't apply. Beyond that, distribution complexity is a real cost: cargo-dist ships a single binary today. A sidecar file means packaging changes, install instructions, and a runtime "file not found" failure mode. The ~1MB cost is negligible for a dev tool binary that's already ~20MB.

Alternative: Live REPL search (no static corpus)

CohortStrongest argument
Smalltalk purist"The system should be self-describing. If an agent wants examples, it should ask the running system — ExamplesFinder search: 'blocks'. This is live, always current, and the corpus is exactly what's loaded. You're building a dead snapshot of content that drifts from the real system. The freshness problem you're solving with CI checks simply doesn't exist with a live approach."
Pragmatist"The MCP server already has a running REPL connection. Once Beamtalk has modules, you emit an ExamplesFinder module as part of the dev-mode stdlib — opt-in for development, stripped from production builds. No Rust crate, no JSON file, no build step. The module hot-reloads with the stdlib, so the corpus is always current. The bloat argument doesn't apply — production binaries never see it."
BEAM developer"On the BEAM, introspection is a first-class capability — module_info, beam_lib, even decompilation. Building a static JSON index of what the runtime already knows is working against the platform. An Erlang dev would build a gen_server that walks loaded modules and exposes a search function over the running system. You're solving a BEAM problem with a non-BEAM tool."

Counter: The dev-only module approach is viable and addresses the bloat concern. The remaining arguments against it are operational:

  1. REPL coupling. A REPL-based approach ties search availability to REPL connectivity. The static corpus works during startup, reconnection, and error states — exactly when an agent most needs to look up examples (e.g., diagnosing why a REPL connection failed). The MCP server's search_examples tool works even with "repl_connected": false.
  2. Search evolution. Keyword search in Erlang is straightforward, but if we later want synonym expansion, fuzzy matching, or (eventually) semantic search, maintaining that in Erlang means a parallel implementation. The Rust-side approach keeps search logic in one place, co-located with the MCP server that consumes it.
  3. Content isolation. The corpus is ~300 entries drawn from test fixtures, e2e cases, and learning docs. A stdlib class or runtime module would load all that tutorial content into the BEAM VM for every REPL session — dead weight for users who never search. The Rust-side approach keeps the corpus in the beamtalk-mcp binary only, where the sole consumer (MCP agents) lives. The runtime stays lean.

Of these, (1) and (3) are the strongest. The live approach is compelling philosophically but forces a choice between always-loaded bloat and conditional loading complexity, while the Rust-side corpus avoids both.

Alternative: Semantic/vector search

CohortStrongest argument
AI tooling advocate"Agents don't always use the right keywords. 'How do I loop over elements' should find do: and collect:, but keyword search requires the agent to already know those names. Semantic search closes that vocabulary gap. The corpus is small enough that a lightweight model (e.g., ONNX MiniLM) adds ~30MB, not 200MB. You're building a search tool for AI agents — the one user who would most benefit from semantic understanding — and giving them grep."

Counter: The vocabulary gap is real but addressable without a model. Synonym tags (e.g., "loop" → do:, "lambda" → blocks, "for each" → do:) close the most common mismatches at negligible cost. The corpus is ~300 entries — small enough that well-chosen tags cover the search space. Adding an ONNX model introduces a new dependency category (native ML runtime) and complicates cross-compilation. Pure-Rust alternatives exist — Tantivy for BM25/stemming (~2MB), candle for embeddings (~22MB) — both cross-platform with zero native deps. Crucially, structured telemetry on every search call (query_hash, result_count, top_score) makes this a data-driven decision: if zero-result queries exceed ~15% or cluster around vocabulary mismatches that tags can't cover, that's concrete evidence to upgrade the search backend. (Raw query is DEBUG-only opt-in.) The tool interface (search_examples(query, limit)) is deliberately stable — the backend can be upgraded from keyword→Tantivy→candle without changing agent integration. See Phase 5 for the tiered upgrade path.

Alternative: Context7-style two-tool pattern

CohortStrongest argument
MCP protocol purist"A list_categories or browse_examples(category) tool lets agents discover the corpus structure before searching. Without it, the agent is shooting blind — it doesn't know whether to search 'actors', 'processes', 'concurrency', or 'message passing' for the same concept. Context7's resolve step isn't just disambiguation — it's discoverability."
Agent developer"Single-tool search has a cold start problem. The agent's first query is a guess. If it guesses wrong, it doesn't know why — was the query bad, or does the corpus not cover that topic? A list_categories tool gives the agent a table of contents on the first call. Every subsequent search is informed. This isn't about two tools — it's about giving agents a map before asking them to navigate."

Counter: The discoverability argument is real but doesn't justify a second tool. A single search_examples call with a broad query (e.g., "concurrency") already returns results with category fields that reveal the corpus structure. We could also add a categories field to the tool's schema description listing available categories. The two-tool pattern doubles the integration surface (agents must learn two tools, handle the round-trip, deal with empty resolve results) for a discoverability benefit achievable within one tool.

Alternative: LSP bundling instead of MCP

CohortStrongest argument
IDE developer"The LSP already runs in every editor session. If the corpus lives there, VSCode completions and hover docs can show examples inline — not just MCP agents. One corpus, two consumers."
Newcomer"Docstrings show examples for the method you're already looking at. But the real problem is earlier: I don't know which class or method to use. When I type Array and see 20 completions, I need to know which collection method fits my use case. Hover docs on do: vs collect: vs inject:into: don't help if I don't know which one to hover on. Inline corpus search in the LSP — triggered by a command or completion prefix — bridges that gap."

Counter: The LSP runs in the user's workspace, not the Beamtalk repo — it has no access to stdlib test files or example source code to grep. However, the LSP already serves key examples via docstrings in hover documentation, which covers the primary "show me how to use this" use case for IDE users. The corpus is designed for MCP agents that lack any repo access; LSP users already have a richer interaction model (completions, hover, go-to-definition) that docstring examples integrate into naturally. Bundling the full corpus into the LSP adds binary bloat for marginal benefit over existing docstrings. That said, the beamtalk-examples crate architecture makes LSP integration trivial if desired later — just add the dependency.

Tension Points

Alternatives Considered

Alternative A: Context7-Style Two-Tool Pattern

Expose resolve_topic(query) -> topic_id and get_examples(topic_id, limit) -> examples as separate tools.

Rejected because: We have one language, one corpus. The indirection adds a mandatory round-trip that wastes agent turns. Context7 needs disambiguation across 9,000+ libraries; we don't. A single search_examples tool serves the same purpose with less friction.

Alternative B: Forge-Style Named Guide Tools

Expose each content category as a separate named tool: closures_guide, collections_guide, actors_guide, etc.

Rejected because: Fixed tool names don't support ad-hoc queries. An agent looking for "how to send a message to an actor" has to guess which guide name matches. Adding new topics requires adding new tools (code changes, not just corpus updates). Search is strictly more flexible.

Alternative C: build.rs Corpus Generation

Generate the corpus at compile time via build.rs in beamtalk-mcp, parsing test files and emitting a serialized corpus to OUT_DIR.

Rejected because: Runs on every cargo build, adding latency to every compile cycle. Creates a build-time dependency on beamtalk-core's parser (already a dependency, but using it in build.rs is a different compilation unit). The corpus doesn't change on every build — it changes when content source files change. A manual just build-corpus with CI freshness checks is more appropriate.

Alternative D: No Corpus — Enhance Existing docs Tool

Extend the existing docs MCP tool to return examples alongside API documentation, pulling from /// doc comments that contain code blocks.

Rejected because: /// doc comments are API-level documentation, not tutorial-style examples. They document what a method does, not how to use the language feature it exemplifies. The gap is "how do I use closures?" not "what does Array >> #do: return?". Different content, different tool.

Alternative E: Keep Everything in beamtalk-mcp (No Separate Crate)

Keep CorpusEntry, Corpus, and search logic as modules inside beamtalk-mcp rather than creating a separate beamtalk-examples crate. Only one consumer exists today.

Rejected because: The corpus generator binary needs both beamtalk-core (to parse .bt files) and the CorpusEntry type (to serialize the corpus). If CorpusEntry lives in beamtalk-mcp, the generator depends on the entire MCP server — REPL client, MCP transport, etc. — just for one struct. A separate beamtalk-examples crate with minimal dependencies (serde, serde_json) gives a clean dependency graph and avoids pulling MCP machinery into the generator. The separate crate is justified by the generator's needs today, not hypothetical future consumers.

Consequences

Positive

Negative

Neutral

Implementation

Phase 1: beamtalk-examples Crate and Corpus Generator (S)

Affected components: New beamtalk-examples crate, build tooling

  1. Create crates/beamtalk-examples/ library crate with CorpusEntry, Corpus, LazyLock deserialization, and search.rs
  2. Create crates/beamtalk-examples/build-corpus/ binary crate (depends on beamtalk-examples + beamtalk-core)
  3. Implement extractors for each content source:
    • .bt fixture files (stdlib/test/fixtures/, tests/e2e/fixtures/, docs/learning/fixtures/): include as whole-file entries with auto-generated tags from class names, selectors, and file path
    • .bt test files (stdlib/test/*.bt): extract individual test methods with class context
    • .bt example files (examples/**/*.bt): extract notable patterns
    • .md learning modules: extract fenced .bt code blocks that pass parser validation, with surrounding prose (skip illustrative/non-working snippets)
    • beamtalk-language-features.md: extract per-feature examples
  4. Generate crates/beamtalk-examples/corpus.json
  5. Add just build-corpus task
  6. Add CI freshness check to just ci
  7. Add unit tests for search scoring and edge cases (empty query, no matches, limit capping)

Phase 2: MCP Tool Integration (S)

Affected components: beamtalk-mcp

  1. Add beamtalk-examples as a dependency of beamtalk-mcp
  2. Add SearchExamplesParams and search_examples tool handler in server.rs — delegates to beamtalk_examples::search()
  3. Add structured tracing telemetry — log query_hash, result_count, top_score, duration_us on every search call (raw query at DEBUG only)
  4. Update server instructions text to mention the new tool

Phase 3: Corpus Curation (M)

Affected components: Content

  1. Review auto-generated corpus entries for quality
  2. Add synonym tags for common vocabulary mismatches (e.g., "loop" → do:, "lambda" → blocks)
  3. Write explanations for entries that lack sufficient context
  4. Validate that representative queries return useful results

Phase 4: Search Eval Tooling (S)

Affected components: Build tooling

  1. Add just search-eval task that parses structured logs and produces a summary: zero-result queries, low-score queries, score distribution, query frequency
  2. Use eval results to drive synonym tag additions and corpus expansion
  3. Establish a baseline for search quality — if zero-result rate exceeds a threshold (e.g., >15% of queries), escalate to search backend upgrade (Phase 5)

Phase 5: Search Backend Upgrade (Future — post-0.1.0)

Affected components: beamtalk-examples internals only — no API changes

The search_examples tool interface is deliberately stable: (query, limit) → results. The search backend can be upgraded without changing the MCP tool contract, agent prompts, or corpus format. Three tiers, in order of increasing capability:

Tier 1: Tantivy + manual synonym tags (~2MB binary cost)

Replace the weighted keyword scorer with Tantivy, a pure-Rust full-text search engine. Adds BM25 scoring, stemming ("iterating" → "iterate"), and fuzzy matching. Pure Rust, no native dependencies, no cross-platform issues. Combined with the synonym tags from Phase 3, this closes the morphology gap without any ML. Lowest effort, highest confidence.

Tier 2: Tantivy + LLM-generated tag expansion (~2MB binary cost)

At build time (just build-corpus), use an LLM API to generate rich synonym/concept tags for each corpus entry — not just hand-curated synonyms but exhaustive expansions (e.g., "loop, iterate, for each, traverse, walk, map over, apply to each element" for do:). Store expanded tags in corpus.json. At runtime, Tantivy searches over these tags with stemming and fuzzy matching. This pushes the "smart" part to build time (where API calls are acceptable) and keeps runtime pure Rust with zero model overhead.

Tier 3: candle + MiniLM embeddings (~22MB binary cost)

Pre-compute embeddings for all corpus entries at build time and store vectors in corpus.json. At query time, embed the query using candle (Hugging Face's pure-Rust ML framework, no C++ deps, no ONNX runtime) with a MiniLM-L6 model (~22MB). Cosine similarity against stored vectors. Fully closes the vocabulary gap — "for loop" finds do: without any tag engineering. Cross-platform (compiles anywhere Rust compiles). Query embedding takes ~10-50ms on CPU.

TierBinary costCross-platformVocab gapComplexity
Current (keyword + manual tags)0PerfectPartialLow
Tier 1 (Tantivy + manual tags)~2MBPerfectBetter (stemming)Low
Tier 2 (Tantivy + LLM tags)~2MBPerfectGoodLow-medium
Tier 3 (candle + MiniLM)~22MBPerfectFullMedium

The decision of which tier to pursue should be data-driven, informed by Phase 4 eval results. If zero-result queries cluster around morphological mismatches ("iterating" vs "iterate"), Tier 1 suffices. If they cluster around vocabulary gaps ("for loop" → do:), Tier 2 is the sweet spot. If novel phrasings dominate, Tier 3 is warranted.

References