microsoft/TypeAgent

Public

mirrored fromhttps://github.com/microsoft/TypeAgentAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
copilot/fix-github-actions-job-shell-and-cli

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

ts/docs/architecture/collision-rollout.md

708lines · modepreview

# Soft Rollout: TypeAgent Collision Detection

> Companion docs: [`collision-analysis.md`](./collision-analysis.md) is
> the user guide to the data + analysis tooling (`@collision corpus *`,
> `@collision neighborhoods preview`, the interactive HTML reports). The
> [Action Collision Detection section of the dispatcher package README](../../packages/dispatcher/dispatcher/README.md#action-collision-detection)
> is the runtime detection-point reference.

## Context

The `dev/robgruen/action_collision` branch ships the full collision-detection
system (four detection points, four resolution strategies, telemetry ring
buffer, NFA-product-construction static scanner with concrete witnesses) but
**every detection point defaults to `detect: false`**. In a stock session the
runtime path is byte-identical to legacy behavior — the infrastructure is
present but never exercised. We can't tell whether collisions are rare or
endemic, whether `score-rank` differs from `first-match` in practice, or
whether `user-clarify` would help or annoy users.

This plan describes a soft rollout to the **TypeAgent dev team (3–10
people)** who **manually opt in per detection point**. Defaults stay OFF;
tester behavior is the experiment surface. Each experiment ships with
explicit success/abort criteria and a one-step rollback (config-only, no
rebuild). Telemetry feeds into the **existing DocumentDB pipeline** plus a
local JSONL fallback for offline triage.

## Current state (one-paragraph snapshot)

- **Detection points:** 3 of 4 wired and complete (`static`, `grammarMatch`,
  `llmSelect`); `fuzzy` is wired statically but the runtime hook is absent
  and the only shipped scorer is `PlaceholderScorer` which returns 0.
- **Strategies:** all 4 implemented (`first-match`, `score-rank`, `priority`,
  `user-clarify`); `pause-and-prompt` for `MultipleAction` auto-degrades.
- **Static NFA scanner:** `@grammar collisions [--json <path>]` and the
  standalone `analyze-grammar-collisions` CLI both ship. Latest scan: **103
  cross-agent collisions** across 27 schemas.
- **Config persistence:** `~/.typeagent/profiles/<profile>/sessions/<name>/data.json`
  → `settings.collision`. One-time read at session load. Hand-edit + restart
  works; no hot reload. `session.updateSettings(...)` writes back and
  reapplies in-memory.
- **Local telemetry today:** in-memory 50-event ring buffer in
  `CommandHandlerContext.collisionEvents` + `DEBUG=typeagent:dispatcher:collision`
  log lines. Not durable across session exit; no shell command surfaces it.
- **Remote telemetry already exists:** `packages/telemetry/src/logger/cosmosDBLoggerSink.ts`
  uploads to Cosmos `telemetrydb / dispatcherlogs` via `LogEvent` blobs
  (generic `eventName` + `eventData`). Gated by `@config log db on|off`,
  default OFF, env var `COSMOSDB_CONNECTION_STRING`. Auto-disables on auth
  errors. **The collision system does not yet emit into this pipeline** —
  but the extension point is one call to `logger.logEvent("collision", ...)`
  inside `emitCollisionEvent`.
- **Runtime config flip path today:** none from the shell (no `@config
collision`). Edit JSON, restart. M1 below removes that constraint.

## Tooling milestones (gate Phase 1)

These ship before any user-facing experiment runs.

- [ ] **M1. `@config collision <point> [detect|strategy] <value>`** — runtime
      flip via `session.updateSettings`, which already persists to
      `data.json` and re-applies in-memory. Mirrors existing `@config agent`
      / `@config log db` patterns. Covers all 4 detection points + 4
      strategies + `priorityOrder` + `telemetry.emit`. Without this, every
      tester opt-in requires hand-editing JSON + restart.
      _Touches:_ [`configCommandHandlers.ts`](ts/packages/dispatcher/dispatcher/src/context/system/handlers/configCommandHandlers.ts),
      [`session.ts`](ts/packages/dispatcher/dispatcher/src/context/session.ts).

- [ ] **M2. Enrich `CollisionEvent` shape.** The existing event captures
      `kind` (detection point), `strategy`, `candidates`, `chosen`,
      `elapsedMs`, `request`, `note`, `timestamp`. For experiment analysis
      we need to add — once, before logging lands, so we don't schema-
      migrate later: - `firstMatchCandidate?: CollisionCandidate` — the candidate
      `first-match` would have picked. Lets every Cosmos query answer
      "did the experiment strategy pick differently than legacy?" without
      re-running anything offline. - `classifier?: "distinctActions" | "tiedHeuristics" | undefined` —
      only meaningful for `kind="grammarMatch"`; records which classifier
      flagged the collision. - `candidates[].matchedCount?`, `nonOptionalCount?`,
      `wildcardCharCount?`, `priorityRank?` — heuristic counters per
      candidate so offline analysis can recompute alternative rankings
      without replay. (Today only an optional `score` is captured.) - `requestId?: string` — correlation key tying multiple events from
      the same user request (e.g. a `grammarMatch` collision followed by
      a `user-clarify` follow-up event). - `experimentId?: string` — copy of `collision.telemetry.experimentId`
      from session config (new field). Lets testers tag a window of
      events (`E2.1-2026-05-12`) for clean attribution; defaults unset. - `sessionId: string` — copy of the dispatcher session name so per-
      tester analysis can filter on it without joining other tables.
      _Touches:_ [`collisionTelemetry.ts`](ts/packages/dispatcher/dispatcher/src/context/collisionTelemetry.ts)
      (type + emit), [`session.ts`](ts/packages/dispatcher/dispatcher/src/context/session.ts)
      (add `experimentId` to `CollisionConfig.telemetry`), and the four
      detection-point call sites that build `candidates` (need to pass the
      heuristic counters and `firstMatchCandidate`):
      [`matchCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/matchCollision.ts),
      [`translateRequest.ts`](ts/packages/dispatcher/dispatcher/src/translation/translateRequest.ts),
      [`appAgentManager.ts`](ts/packages/dispatcher/dispatcher/src/context/appAgentManager.ts),
      [`fuzzyCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/fuzzyCollision.ts).

- [ ] **M3. Hook `emitCollisionEvent` into the existing telemetry logger.**
      Add one call in
      [`collisionTelemetry.ts`](ts/packages/dispatcher/dispatcher/src/context/collisionTelemetry.ts):
      `logger.logEvent("collision", stamped)` alongside the existing ring-
      buffer append. Reuses the `Logger` already plumbed via
      [`commandHandlerContext.ts`](ts/packages/dispatcher/dispatcher/src/context/commandHandlerContext.ts)
      lines 369–406 (Cosmos / Mongo dual-sink, auto-fallback, batch upload).
      Gated by **the existing `dblogging` flag** (`@config log db on`)
      AND `collision.telemetry.emit`. No new database schema; events land
      in `dispatcherlogs` with `eventName: "collision"` and the enriched
      payload from M2.
      _Touches:_ [`collisionTelemetry.ts`](ts/packages/dispatcher/dispatcher/src/context/collisionTelemetry.ts).

- [ ] **M4. Per-session local JSONL append.** Same `emitCollisionEvent`
      call site also writes to
      `~/.typeagent/profiles/<profile>/sessions/<name>/collision-events.jsonl`
      so testers without DB credentials still capture data, and so we have
      a fallback if Cosmos is misconfigured. One-line append per event;
      gated by `collision.telemetry.emit`. (We don't gate on `dblogging` —
      JSONL is the always-on local record; DB upload is the optional
      uploaded copy.)
      _Touches:_ [`collisionTelemetry.ts`](ts/packages/dispatcher/dispatcher/src/context/collisionTelemetry.ts).

- [ ] **M5. `@collision events [--limit N] [--kind <point>]`** — read recent
      events from the current session's JSONL or the in-memory ring buffer.
      Lets a tester confirm in-flight that detection is firing without
      shelling out to a file. `--kind` filters to one detection point so
      you can isolate the experiment in progress.
      _Touches:_ new handler in
      [`grammarCommandHandlers.ts`](ts/packages/dispatcher/dispatcher/src/context/system/handlers/grammarCommandHandlers.ts)
      or a sibling `collisionCommandHandlers.ts`.

**Optional but recommended (not blocking):**

- An `analyze-collision-events` CLI in
  `packages/actionGrammar/src/generation/` that aggregates per-tester
  JSONL files for offline analysis. Mirrors the `analyze-grammar-collisions`
  pattern. Useful when DB access is awkward.

## Tester opt-in protocol

Defaults stay OFF. Each tester runs:

```
@config log db on                       # start uploading telemetry
@config collision telemetry emit on     # start recording collision events
@config collision <point> detect on     # opt in to one detection point
                                        # (typically `grammarMatch` or `llmSelect`)
```

To roll back at any point:

```
@config collision <point> detect off    # stop the active experiment
```

Each tester opts into **one experiment at a time** so attribution stays
clean. We don't run multiple strategy experiments concurrently on the same
tester.

## Phase 1 — Observability (no behavior change)

For each detection point, set `detect: true` and keep `strategy: "first-match"`.
Outcomes are byte-identical to legacy behavior; only telemetry changes.
Goal: establish baseline collision rates per detection point in real
traffic.

| ID   | Experiment                              | Config diff                                                                               | Status  | Started | Notes         |
| ---- | --------------------------------------- | ----------------------------------------------------------------------------------------- | ------- | ------- | ------------- |
| E1.1 | `static` detection, warn-only           | `static.detect=true`, `static.strategy="warn"`, `telemetry.emit=true`, `dblogging=true`   | planned |         |               |
| E1.2 | `grammarMatch` detection, no re-routing | `grammarMatch.detect=true`, `grammarMatch.strategy="first-match"`                         | planned |         |               |
| E1.3 | `llmSelect` detection, no re-routing    | `llmSelect.detect=true`, `llmSelect.strategy="first-match"`                               | planned |         |               |
| E1.4 | `fuzzy.staticEnabled` baseline          | `fuzzy.staticEnabled=true` (still `PlaceholderScorer` → vacuous; deferred until F1 lands) | blocked |         | blocked on F1 |

**Run cadence:** sequence E1.1 → E1.2 → E1.3, ~1 week each, multiple
testers in parallel. E1.4 parked until Phase 3 / F1.

**Measurements (per experiment, queried from Cosmos):**

- Events per N user requests (collision rate per detection point)
- p50 / p99 detection-call-site latency
- Distinct schema-pairs that show up — does the runtime set match the
  static scanner's 103?
- Any user-visible failures (sanity check that `first-match` truly preserves
  legacy behavior; if not, we have a deeper bug)

**Success criteria to ship to Phase 2:** baseline rate measured, p99
latency overhead < 5ms on the cache path, zero user-visible regressions.

**Abort criteria:** any user-visible regression, telemetry overhead > 10%
of request time, Cosmos sink hitting auth-error auto-disable consistently.

**Rollback:** `@config collision <point> detect off`. Effective immediately.

## Phase 2 — Strategy A/B (behavior changes)

One detection point at a time, conservative strategy first. Each tester
pins to one strategy for the experiment week. Telemetry records the
candidates so post-hoc analysis can answer "how often did the new strategy
pick differently than `first-match` would have?" by comparing the
heuristically-first candidate against `chosen`.

| ID   | Experiment                      | Config diff                                             | Risk   | Status  |
| ---- | ------------------------------- | ------------------------------------------------------- | ------ | ------- |
| E2.1 | `grammarMatch` → `score-rank`   | `grammarMatch.strategy="score-rank"`                    | low    | planned |
| E2.2 | `llmSelect` → `score-rank`      | `llmSelect.strategy="score-rank"`                       | low    | planned |
| E2.3 | `grammarMatch` → `priority`     | `grammarMatch.strategy="priority"`, set `priorityOrder` | medium | planned |
| E2.4 | `llmSelect` → `priority`        | `llmSelect.strategy="priority"`                         | medium | planned |
| E2.5 | `grammarMatch` → `user-clarify` | `grammarMatch.strategy="user-clarify"`                  | high   | planned |
| E2.6 | `llmSelect` → `user-clarify`    | `llmSelect.strategy="user-clarify"`                     | high   | planned |

**Risk reasoning:**

- `score-rank` is deterministic re-ranking using existing match metadata;
  no UX change beyond the picked candidate.
- `priority` requires the operator to pick a sensible ordering; surprising
  results possible if the order is wrong, but no UX disruption.
- `user-clarify` synthesizes a clarification action — visible UX, can loop
  if the user keeps selecting an ambiguous candidate (see _Cross-cutting_).

**Success criteria per experiment:** divergence from `first-match` is
measurable (≥ 5% of collision events resolve differently) AND either
(a) categorically reduces user-visible misroutes (eyeball + tester reports),
or (b) for `user-clarify`, the user reaches the right action in ≤2 round-
trips ≥80% of the time.

**Abort criteria:** misroute rate increases, clarify-loop entered (same
collision repeats within 3 round-trips), tester drops out citing friction.

**Rollback:** `@config collision <point> strategy first-match`.

## Phase 3 — Fuzzy (blocked on scorer)

Sequential; code lands before experiments run.

- [ ] **F1.** Real `ActionEmbeddingScorer` — wraps the multi-vector
      similarity engine built in **Phase 5 / S1** (see below) so the
      fuzzy detection point can call into it directly. Replaces
      `PlaceholderScorer` as the default when `scorer: "actionEmbedding"`
      is set. Blocked on S1.
      _Touches:_ [`fuzzyCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/fuzzyCollision.ts).
- [ ] **F2.** Wire runtime fuzzy hook — `isFuzzyCollisionForMatch()` exists
      but has zero call sites. Add the post-resolver call site in
      [`matchCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/matchCollision.ts)
      gated on `fuzzy.runtimeEnabled`.
- [ ] **F3.** Threshold calibration — current `0.85` is a placeholder. Use
      a labeled pair set of known-similar and known-dissimilar agent-action
      pairs. Sweep the threshold and measure precision/recall.
- [ ] **F4.** On-disk fuzzy matrix cache — once F1 lands, the static
      pairwise scan is non-trivial; cache the matrix in the agent cache
      directory keyed by agent-action set hash so it doesn't re-run on
      every dispatcher boot.
- [ ] **F5.** Re-run Phase 1 (E1.4) and Phase 2 ladders for `fuzzy`.

## Phase 4 — Static NFA collision triage (parallel track)

Independent of the runtime experiments. Uses the JSON output of the static
scanner.

- [ ] **T1.** Generate baseline: `analyze-grammar-collisions --dir packages/agents
--out collisions-baseline.json`. Commit to repo as the reference set.
- [ ] **T2.** Categorize the 103 collisions:
  - **Tier 1 (real bugs, fix first):** short witness + concrete (no
    placeholders) + matched actions diverge in a way the user would notice.
  - **Tier 2 (likely false positives):** witness contains placeholder
    tokens — type-only overlap, may not actually surface at runtime.
  - **Tier 3 (deliberate / by-design):** vampire test agent, etc.
- [ ] **T3.** Tune `.agr` files for Tier 1 collisions (add disambiguating
      prefixes, narrow wildcards, etc.). Re-scan, diff JSON, confirm
      reduction.
- [ ] **T4.** CI gate: run `analyze-grammar-collisions` and `jq`-check that
      Tier 1 collision count doesn't regress past `collisions-baseline.json`.

## Phase 5 — Semantic action collision discovery (parallel track)

Surfaces collisions that the grammar / NFA path can't see — actions that
are semantically the same kind of operation (`browser.openWebPage` ⟷
`desktop.openFile` ⟷ `archives.expand`) even when their `.agr` patterns
don't overlap. Output of this phase becomes the engine for the F1
milestone above; until S1 lands, fuzzy detection is inert.

Sequential — each milestone uses the artifacts of the previous one.

- [x] **S1. `@collision similar` — multi-vector cross-schema similarity (semantic neighborhoods).** _Demoted from "find dispatch collisions" — see findings below._ Embeds each loaded action under multiple independent vectors (desc / params / nameShape / agentContext / agentAndAction), runs pairwise scoring across cross-schema pairs under one of six named strategies, and clusters via complete-linkage agglomeration. HTML cluster view, `--json` export, score-distribution histogram.
      _Status:_ shipped (S1 → S1.2 → `@collision probe`).
      _What it answers:_ "Which actions are the same kind of operation, regardless of agent?" — a **semantic-neighborhoods** scanner.
      _What it does NOT answer:_ "Which actions actually compete at the dispatcher's routing path?" Validated empirically against the toggle clusters: 12 hand-crafted probes ran through `@collision probe`; 11 of 12 routed to the expected target as top-1 — the cross-agent embedding cluster was a semantic neighborhood, not a dispatch collision. The competitors that matter are within-agent siblings, which `@collision similar` skips by design (cross-schema-only).
      _Useful for:_ surfacing naming inconsistencies, finding duplicate-purpose actions across the agent set, action-tuning candidates. Keep as is, but stop framing it as the rollout's primary collision tool.

- [ ] **S1b. Within-schema sibling analysis.** Add `--within-schema`
      mode to `@collision similar` that runs the same multi-vector
      analysis on action pairs _within_ each agent. Per the today's-
      findings, runtime ambiguity comes from sibling pairs like
      (`ConnectWifi`, `EnableWifi`, `DisconnectWifi`) or
      (`EnableFilterKeys`, `EnableStickyKeys`). Same engine, different
      filter; small change.

- [x] **S2. LLM-synthesized phrase corpus per action — multi-model.**
      _Done._ See findings below.

      Original spec follows for reference:
      For each loaded action, prompt **every available chat model** (via
      `aiclient.getChatModelNames()`) to generate **3 phrases each**
      using a **diversity prompt** ("one short imperative, one
      conversational/polite, one casual/abbreviated").  Multiple models
      add phrasing variance — different LLMs converge on different
      defaults; merging across models broadens the surface.  Output:
      one corpus JSON with per-phrase source attribution
      (`{schemaName, actionName, phrases: [{text, model}]}`), deduped
      by lowercased text.
      Cache key: `(modelName, actionShapeHash)` — only regenerate when
      either the model list or the action's shape changes.  Cache lives
      under the dispatcher's instance dir alongside other agent caches.
      Implementation: a `corpus-runner.mjs` script (sibling to
      `probe-runner.mjs`) that spins up a read-only dispatcher and
      drives the model calls with concurrency 8 — at ~1000 actions × 4
      models × 3 phrases, full run is ~40 min wall-clock.
      Sample first (one or two agents end-to-end), then scale.
      _Touches:_ new `packages/cli/scripts/corpus-runner.mjs`, uses
      existing model client + cache directory.

- [x] **S3. Replay the corpus through the semantic ranker.**
      _Done._ Implemented as
      [`packages/cli/scripts/probe-corpus-runner.mjs`](ts/packages/cli/scripts/probe-corpus-runner.mjs)
      (calls `agents.semanticSearchActionSchema` directly rather than going
      through `@collision probe`'s HTML output). Reanalyzed with a
      prefix-aware matcher in
      [`reanalyze-probe-results.mjs`](ts/packages/cli/scripts/reanalyze-probe-results.mjs)
      to fold out type-name-vs-enum-name false misroutes.

      **Future work:** ship a "feed events to the JSONL/Cosmos pipeline
      with `kind: "fuzzy"`" mode so probe-corpus runs surface in
      `@collision events` and the Phase 1 Cosmos queries.  Today the
      script just writes a local JSON.

## S2/S3 calibration findings (baseline before any corrective action)

**Corpus** ([`corpus-runner.mjs`](ts/packages/cli/scripts/corpus-runner.mjs)
output, run 2026-05-07): 489 actions across 65 schemas (1 schema —
`mcpfilesystem` — failed to load and was skipped). Three working
OpenAI-family models (`GPT_4_1`, `GPT_5`, `GPT_5_NANO`). Each (action,
model) call asks for 3 phrases in distinct styles (imperative,
conversational, casual). **4392 raw model outputs → 4258 unique
phrases (96.9% dedup keep-rate)** = high stylistic variance. Run time
~25 min at concurrency 8.

`GPT_4_O` and `GPT_4_O_MINI` are broken in this checkout (stale API
version pin / wrong API key) and would add more variance once fixed.

**Probe replay** ([`probe-corpus-runner.mjs`](ts/packages/cli/scripts/probe-corpus-runner.mjs)
on the corpus, delta=0.05, top=5):

| Verdict  | Count | %     | Meaning                                             |
| -------- | ----- | ----- | --------------------------------------------------- |
| CLEAN    | 419   | 9.8%  | top-1 correct AND Δ to #2 ≥ 0.05                    |
| TIGHT    | 1983  | 46.6% | top-1 correct but Δ < 0.05 (`llmSelect` would flag) |
| MISROUTE | 1856  | 43.6% | top-1 wrong                                         |

**Misroute split: 55% cross-agent, 45% within-agent.** Both buckets
are big enough to matter.

**Per-style:** terse phrasing wrecks routing.

| Style          | CLEAN | TIGHT | MISROUTE |
| -------------- | ----- | ----- | -------- |
| imperative     | 16.5% | 49.0% | 33.7%    |
| conversational | 9.6%  | 53.7% | 36.6%    |
| casual         | 5.5%  | 35.0% | 59.5%    |

**Per-source-model:** GPT_5_NANO produces the most-routable phrasings.
Probably because nano outputs are shorter and more imperative on
average — closer to the action description voice — which the ranker
can pin down.

| Model      | CLEAN | TIGHT | MISROUTE |
| ---------- | ----- | ----- | -------- |
| GPT_5_NANO | 13.5% | 49.0% | 38.0%    |
| GPT_4_1    | 9.4%  | 45.6% | 45.5%    |
| GPT_5      | 8.9%  | 46.3% | 45.6%    |

### Five misroute patterns the data exposes

The signal isn't uniform — it concentrates in five distinct categories,
each calling for a different fix.

- **A. Cross-agent semantic hubs.** One generic action absorbs phrases
  meant for siblings across multiple agents. The exemplar:
  `desktop.SetVolumeAction` is the universal sink for any volume-related
  phrase generated for `player.setVolume`, `player.setMaxVolume`,
  `player.changeVolume`, `localPlayer.setVolume`, `localPlayer.changeVolume`,
  and even `desktop.AdjustVolume` (5 of the top-30 misroute edges).
  This is the canonical "open bbc" case at scale.
  _Fix candidate:_ tighten descriptions ("set system audio volume" not
  just "set volume"), or add `priorityOrder` so music agents win when
  both match.

- **B. Within-agent disambiguation hubs.** One action absorbs phrases
  meant for siblings _inside the same agent_. `player.PlayArtistAction`
  steals from `playTrack` (9×), `playGenre` (8×), `addSongsToPlaylist`
  (7×); `list.GetListAction` steals from `createList` (7×) and others.
  These don't show up in `@collision similar`'s output at all because
  it filters cross-schema-only — exactly the gap S1b was reframed for.
  _Fix candidate:_ tighten the hub action's description to be more
  specific; add example utterances to siblings' descriptions.

- **C. Near-duplicate agents.** `localPlayer` (local file player) and
  `player` (Spotify) cover the same conceptual surface; phrases
  generated for one routinely route to the other. `localPlayer.shuffle
→ player.ShuffleAction` (8×), `localPlayer.mute → desktop.MuteVolumeAction`
  (9×), several volume edges.
  _Fix candidate:_ document the agent boundary explicitly in agent
  descriptions ("local file" vs "Spotify"); or accept the collision and
  use `priorityOrder` to bias the more common case.

- **D. Engineered collisions firing as designed.** `vampire.createCalendarEvent
→ calendar.ScheduleEventAction` (7×), `vampire.revive → player.PlayArtistAction`
  (7×). Vampire is doing what it was designed to do — these aren't
  bugs, they're the test fixtures detecting collisions correctly.

- **E. Naming-hygiene noise.** TypeScript type names sometimes carry
  prefixes the action description doesn't (`code.code-editor.saveAllFiles`
  → `EditorActionSaveAllFiles` ×9; similar for other `EditorAction*`
  types). Routing is _correct_ — the embedder just doesn't realize
  it because the type-name and the description don't share vocabulary.
  _Fix candidate:_ rename the types to drop the `EditorAction` prefix.
  Pure refactor, doesn't change runtime behavior.

### Cleanest actions (calibration anchors)

These set the bar for what good disambiguation looks like. Common
pattern: distinctive vocabulary + no within-agent siblings competing.

```
desktop.Debug                                       9 CLEAN
desktop.desktop-taskbar.DisplayTaskbarOnAllMonitors 8 CLEAN
desktop.desktop-taskbar.ShowBadgesOnTaskbar         8 CLEAN
code.code-display.showOutputPanel                   8 CLEAN
desktop.desktop-personalization.ApplyColorToTitleBar 7 CLEAN
desktop.desktop-taskbar.DisplaySecondsInSystrayClock 7 CLEAN
onboarding.listIntegrations                         7 CLEAN
desktop.ListThemes                                  7 CLEAN
```

### What this means for Phase 1 / Phase 2 baseline

The 9.8% / 46.6% / 43.6% split is the **before-corrective-action
baseline** for the rollout's empirical questions. Expectations:

- **Phase 1 / E1.3 (`llmSelect.detect=on`)**: with the current
  scoreDeltaThreshold of 0.05, 56.4% of phrases would emit a collision
  event (TIGHT + MISROUTE). That's a high event rate; if the
  experiment ratio holds for real traffic, the JSONL/Cosmos pipeline
  will see ~1 collision per ~1.8 user requests. Plan capacity
  accordingly.
- **Phase 2 / E2.x (strategy A/B)**: the divergence rate (chosen ≠
  `firstMatchCandidate`) under non-`first-match` strategies will be at
  least 56.4% — strategies have something to act on. If Phase 2
  measures divergence well below that, the strategies aren't
  triggering.

- [ ] **S4. Cross-pollinate with real-world telemetry.** The collision
      events JSONL accumulating from Phase 1 has actual user requests in
      its `request` field. Feed those through the same probe path as
      S3 to extend the synthetic corpus with real phrasing. The
      synthetic-vs-real divergence is itself a calibration signal.

- [ ] **S5. Wire S1 as the `actionEmbedding` scorer (= F1).** Once S1's
      similarity scores are stable, replace `PlaceholderScorer` in
      [`fuzzyCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/fuzzyCollision.ts)
      with a thin adapter calling into the S1 engine. This unblocks
      Phase 1 / E1.4 and Phase 3 / F2-F4.

## Cross-cutting items (track but don't block phases)

- [ ] **Clarify-loop bias** — when user picks an agent in response to
      `ClarifyMultipleAgentMatches`, the same collision can repeat next
      round-trip. Mitigation noted in dispatcher README. Address before
      E2.5/E2.6 if it bites.
- [ ] **`pause-and-prompt` for `MultipleAction`** — auto-degrades today.
      Address only if Phase 2 shows users frequently hit `MultipleAction` +
      `user-clarify`.
- [ ] **`@grammar collisions runtime`** — surface the `lastStaticCollisions`
      snapshot from `commandHandlerContext`. M4 (`@collision events`)
      covers the more useful per-event ring buffer.

## Cosmos query reference

Events upload to `telemetrydb` / `dispatcherlogs` with `eventName: "collision"`.
Sample queries to drive experiment analysis:

```sql
-- 1. Collision rate per (detection point, strategy) — last 7 days
SELECT c.event.kind, c.event.strategy, COUNT(1) AS events
FROM   c
WHERE  c.eventName = "collision"
  AND  c.timestamp > DateTimeAdd("dd", -7, GetCurrentDateTime())
GROUP  BY c.event.kind, c.event.strategy

-- 2. Latency distribution per detection point
SELECT c.event.kind, c.event.elapsedMs
FROM   c
WHERE  c.eventName = "collision"
  AND  c.event.kind = "grammarMatch"
  AND  c.timestamp > DateTimeAdd("dd", -7, GetCurrentDateTime())
-- Roll up p50/p99 in the analysis layer (Cosmos lacks PERCENTILE_CONT).

-- 3. Strategy divergence — how often did the chosen candidate differ from
--    what first-match would have picked?  M2's `firstMatchCandidate` field
--    makes this a one-row check per event.
SELECT
    c.event.kind,
    c.event.strategy,
    c.event.experimentId,
    SUM(
      CASE
        WHEN c.event.chosen.schemaName  = c.event.firstMatchCandidate.schemaName
         AND c.event.chosen.actionName = c.event.firstMatchCandidate.actionName
        THEN 0 ELSE 1
      END
    ) AS diverged,
    COUNT(1) AS total
FROM c
WHERE c.eventName = "collision"
  AND c.event.strategy != "first-match"
  AND c.event.chosen != null
  AND c.timestamp > DateTimeAdd("dd", -7, GetCurrentDateTime())
GROUP BY c.event.kind, c.event.strategy, c.event.experimentId

-- 4. Distinct schema-pairs surfacing in the runtime — compare against
--    analyze-grammar-collisions --out collisions-baseline.json
SELECT DISTINCT
    c.event.candidates[0].schemaName AS schemaA,
    c.event.candidates[1].schemaName AS schemaB
FROM c
WHERE c.eventName = "collision"
  AND ARRAY_LENGTH(c.event.candidates) >= 2

-- 5. Per-tester / per-experiment summary — useful for our 3–10 dev-team
--    rollout where each tester opts in via @config and may pin a different
--    strategy for the experiment week.
SELECT
    c.event.sessionId,
    c.event.experimentId,
    c.event.kind,
    c.event.strategy,
    COUNT(1) AS events
FROM c
WHERE c.eventName = "collision"
  AND c.timestamp > DateTimeAdd("dd", -14, GetCurrentDateTime())
GROUP BY c.event.sessionId, c.event.experimentId, c.event.kind, c.event.strategy

-- 6. Classifier breakdown for grammarMatch — which classifier flagged it?
--    Helps decide whether to default to distinctActions or tiedHeuristics.
SELECT c.event.classifier, COUNT(1) AS events
FROM   c
WHERE  c.eventName = "collision"
  AND  c.event.kind = "grammarMatch"
GROUP  BY c.event.classifier
```

These run in the Azure portal Cosmos Data Explorer or via the Cosmos SDK
in a small offline analysis script. No in-repo query/dashboard tooling
exists today; if cross-experiment dashboards become a recurring need,
that's a follow-up CLI.

## Experiment card template

Each experiment row above expands into a detailed card when activated. Add
the card inline to this doc under the matching row.

```
### E1.2 — grammarMatch detection, no re-routing

Hypothesis: cache-path collisions occur on >1% of natural-language requests
once detection is on; the runtime set is a subset of the agent pairs
surfaced by the static NFA scanner.

Config diff (delta from defaults):
  collision.grammarMatch.detect = true
  collision.grammarMatch.strategy = "first-match"   # no behavior change
  collision.telemetry.emit = true
  dblogging = true

What we measure (Cosmos query #1, #2):
  - count(events where kind="grammarMatch") / count(total user requests)
  - per-event: schemaA, schemaB, request, elapsedMs
  - p99 elapsedMs at the detection call site

Success criteria: rate measured, p99 < 5ms; can promote to E2.1.
Abort criteria:    user-visible regression, p99 > 50ms.

Rollback: @config collision grammarMatch detect off

Status: planned | running | complete | aborted
Started: <date>
Ended:   <date>
Result:  <one-line summary, link to Cosmos query result or JSONL>
Notes:   <surprises, follow-ups, links to events of interest>
```

## Verification (how we know each milestone landed)

- **M1 (`@config collision`):** `@config collision grammarMatch detect on`
  in a fresh shell session; verify `data.json` mutated; restart; verify
  setting persisted; verify `@config` echoes the current value back.
- **M2 (enriched event shape):** unit-test that `emitCollisionEvent`
  produces an event with all of `kind`, `strategy`, `firstMatchCandidate`,
  `classifier` (when grammarMatch), per-candidate heuristic counters,
  `requestId`, `sessionId`, and (when set) `experimentId`. Update
  `collisionTelemetry.spec.ts` to cover the new fields. Trigger from each
  of the four detection-point call sites and assert the call site
  populates the right counters.
- **M3 (DocumentDB upload):** trigger a known collision via the vampire
  test agent + a colliding utterance; verify a document with
  `eventName: "collision"` and the enriched payload appears in the
  `dispatcherlogs` collection within ~2 seconds (sink batches at 1s).
- **M4 (local JSONL):** same trigger; verify
  `collision-events.jsonl` exists in the session dir, contains a single
  line with all enriched fields populated.
- **M5 (`@collision events`):** after M3/M4, run `@collision events --limit 5`
  in the shell; verify it surfaces the events.
- **Phase 1 readiness:** all M1–M5 verified; unit tests in
  `collisionMatch.spec.ts` / `collisionTelemetry.spec.ts` still green.
- **Per-experiment:** experiment is "complete" when its card has Started,
  Ended, Result populated and the JSONL or Cosmos evidence linked.

## Critical files reference (for execution)

- Engine: [`grammarCollisionScanner.ts`](ts/packages/actionGrammar/src/grammarCollisionScanner.ts),
  [`nfaIntersection.ts`](ts/packages/actionGrammar/src/nfaIntersection.ts)
- Detection wiring: [`appAgentManager.ts`](ts/packages/dispatcher/dispatcher/src/context/appAgentManager.ts),
  [`matchCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/matchCollision.ts),
  [`translateRequest.ts`](ts/packages/dispatcher/dispatcher/src/translation/translateRequest.ts),
  [`fuzzyCollision.ts`](ts/packages/dispatcher/dispatcher/src/translation/fuzzyCollision.ts)
- Config + persistence: [`session.ts`](ts/packages/dispatcher/dispatcher/src/context/session.ts)
- Local telemetry: [`collisionTelemetry.ts`](ts/packages/dispatcher/dispatcher/src/context/collisionTelemetry.ts)
- Remote telemetry pipeline:
  [`telemetry/src/logger/cosmosDBLoggerSink.ts`](ts/packages/telemetry/src/logger/cosmosDBLoggerSink.ts),
  [`telemetry/src/logger/databaseLoggerSink.ts`](ts/packages/telemetry/src/logger/databaseLoggerSink.ts),
  [`telemetry/src/logger/logger.ts`](ts/packages/telemetry/src/logger/logger.ts);
  wired in
  [`commandHandlerContext.ts`](ts/packages/dispatcher/dispatcher/src/context/commandHandlerContext.ts)
  lines 369–406.
- Existing handlers (M1, M4 patterns): [`configCommandHandlers.ts`](ts/packages/dispatcher/dispatcher/src/context/system/handlers/configCommandHandlers.ts)
  (`@config log db` lives at lines 1590–1596),
  [`grammarCommandHandlers.ts`](ts/packages/dispatcher/dispatcher/src/context/system/handlers/grammarCommandHandlers.ts).
- Tests: [`collisionMatch.spec.ts`](ts/packages/dispatcher/dispatcher/test/collisionMatch.spec.ts),
  [`collisionFuzzy.spec.ts`](ts/packages/dispatcher/dispatcher/test/collisionFuzzy.spec.ts),
  [`collisionTelemetry.spec.ts`](ts/packages/dispatcher/dispatcher/test/collisionTelemetry.spec.ts),
  [`nfaIntersection.spec.ts`](ts/packages/actionGrammar/test/nfaIntersection.spec.ts).

## Phase 1 kick-off (immediate actions)

The plan moves into the repo as the canonical record (so testers can read
the same document, and PRs can reference experiment IDs). Proposed path:
**`ts/docs/architecture/collision-rollout.md`** — alongside the existing
`dispatcher.md` it cross-references.

First execution-mode steps after this plan is approved:

1. **Check in the plan.** Copy
   `~/.claude/plans/let-s-develop-a-plan-soft-robin.md` →
   `ts/docs/architecture/collision-rollout.md`. Cross-link from the
   "Action Collision Detection" section of the dispatcher README and the
   architecture doc's TODO bullets so it's discoverable. Commit on
   `dev/robgruen/action_collision`.
2. **M1 — `@config collision`.** Add the handler in
   `configCommandHandlers.ts` mirroring `@config log db` and
   `@config agent` patterns. Subcommands:
   - `@config collision <point> detect <on|off>`
   - `@config collision <point> strategy <name>`
   - `@config collision priority <comma,separated,agents>`
   - `@config collision telemetry [emit|debugLog|experimentId] <value>`
   - `@config collision` (no args → echo current config)
     Add unit coverage to the existing dispatcher test suite.
3. **M2 — Enrich `CollisionEvent`.** Update the type and the four
   detection-point call sites to populate `firstMatchCandidate`,
   `classifier`, per-candidate heuristic counters, `requestId`,
   `experimentId`, `sessionId`. Update `collisionTelemetry.spec.ts` to
   cover the new fields.
4. **M3 — Hook the logger.** One-line addition in `emitCollisionEvent` to
   call `logger.logEvent("collision", stamped)`; gate on `dblogging`
   AND `collision.telemetry.emit`.
5. **M4 — JSONL export.** Append every emitted event to
   `<sessionDir>/collision-events.jsonl`.
6. **M5 — `@collision events`.** New handler reading recent events from
   the ring buffer (or JSONL if buffer is empty).
7. **Validate end-to-end.** Enable vampire agent + `@config collision
grammarMatch detect on` + `@config log db on`; trigger a known
   colliding utterance; confirm the event lands in (a) the ring buffer
   via `@collision events`, (b) the local JSONL, and (c) the
   `dispatcherlogs` Cosmos collection.
8. **E1.1.** Recruit first tester (likely the author); run E1.1 (`static`
   warn-only) for one week; record results in the experiment card.

Each numbered step is a small commit. After step 7, the platform is ready
to onboard testers and Phase 1 experiments E1.2 / E1.3 can run in
parallel across the dev team.

## Update protocol

This document is the canonical record of the rollout. As experiments run:

- Flip status in the table from `planned` → `running` → `complete` /
  `aborted`.
- Add the expanded experiment card under the row when activated; fill in
  Started, Ended, Result, Notes as it runs.
- Capture surprises in **Notes** even if the experiment "succeeds" —
  unexpected agent pairs, latency spikes, telemetry gaps. These feed the
  next experiment's hypothesis.
- If a phase produces evidence that an item in _Cross-cutting_ is biting
  (e.g. clarify-loop), promote it to a numbered experiment in the next
  phase.
- The plan is mutable. Reorder, add, drop experiments based on data.