microsoft/hve-core
Publicmirrored fromhttps://github.com/microsoft/hve-coreAvailable
evals/README.md
62lines · modecode
| 1 | --- |
| 2 | title: Evaluations |
| 3 | description: 'Architecture overview and contributor guide for Vally evaluation specs' |
| 4 | author: HVE Core Team |
| 5 | ms.date: 2026-05-14 |
| 6 | --- |
| 7 | |
| 8 | This directory contains [Vally](https://www.npmjs.com/package/@microsoft/vally-cli) evaluation specs for hve-core. |
| 9 | |
| 10 | ## Architecture |
| 11 | |
| 12 | ```text |
| 13 | evals/ |
| 14 | ├── skill-quality/ copilot-sdk evals testing skill behavior |
| 15 | ├── agent-behavior/ copilot-sdk evals testing agent responses |
| 16 | └── script-validation/ copilot-sdk evals testing deterministic scripts |
| 17 | ``` |
| 18 | |
| 19 | ## Executors |
| 20 | |
| 21 | | Suite | Executor | Purpose | |
| 22 | |---------------------|---------------|------------------------------------------------------------------------------------| |
| 23 | | `skill-quality` | `copilot-sdk` | Tests that skills provide accurate guidance via real agent conversation | |
| 24 | | `agent-behavior` | `copilot-sdk` | Tests that agents respond correctly to domain prompts | |
| 25 | | `script-validation` | `copilot-sdk` | Tests agent reasoning about validation rules (will migrate to mock when available) | |
| 26 | |
| 27 | ## Running Evals |
| 28 | |
| 29 | ```bash |
| 30 | # Lint all eval specs (no execution, fast) |
| 31 | npm run eval:lint |
| 32 | |
| 33 | # Run all evals |
| 34 | npx vally eval |
| 35 | |
| 36 | # Run a specific suite |
| 37 | npx vally eval --suite skill-quality |
| 38 | npx vally eval --suite script-validation |
| 39 | |
| 40 | # Compare results against baseline |
| 41 | npx vally compare |
| 42 | ``` |
| 43 | |
| 44 | ## Adding New Evals |
| 45 | |
| 46 | 1. Create a directory under `evals/` with an `eval.yaml`. |
| 47 | 2. Choose the executor: |
| 48 | * `copilot-sdk` for testing skill/agent behavior (non-deterministic, use `runs: 3`+). |
| 49 | * `mock` for testing scripts/validators with fixture files (deterministic, use `runs: 1`). Not yet available - use `copilot-sdk` until the mock executor plugin ships. |
| 50 | 3. Write per-stimulus graders (one stimulus per test case). |
| 51 | 4. Run `npm run eval:lint` to validate the spec. |
| 52 | 5. Tag stimuli with `category` matching a suite filter in `.vally.yaml`. |
| 53 | |
| 54 | ## Anti-Patterns |
| 55 | |
| 56 | * Don't use `runs: 1` for copilot-sdk evals (non-deterministic output needs multiple runs). |
| 57 | * Don't set timeout below `120s` for copilot-sdk evals. |
| 58 | * Don't use `output-contains` as the sole grader for qualitative agent output. |
| 59 | * Don't bundle multiple test cases into one stimulus with an aggregate grader. |
| 60 | * Don't pin models in eval specs. |
| 61 | |
| 62 | 🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers. |
| 63 | |