microsoft/hve-core

Public

mirrored fromhttps://github.com/microsoft/hve-coreAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
173d558d015d9849174acf0e38bf43eb009e11ff

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

evals/README.md

62lines · modecode

1---
2title: Evaluations
3description: 'Architecture overview and contributor guide for Vally evaluation specs'
4author: HVE Core Team
5ms.date: 2026-05-14
6---
7
8This directory contains [Vally](https://www.npmjs.com/package/@microsoft/vally-cli) evaluation specs for hve-core.
9
10## Architecture
11
12```text
13evals/
14├── skill-quality/ copilot-sdk evals testing skill behavior
15├── agent-behavior/ copilot-sdk evals testing agent responses
16└── script-validation/ copilot-sdk evals testing deterministic scripts
17```
18
19## Executors
20
21| Suite | Executor | Purpose |
22|---------------------|---------------|------------------------------------------------------------------------------------|
23| `skill-quality` | `copilot-sdk` | Tests that skills provide accurate guidance via real agent conversation |
24| `agent-behavior` | `copilot-sdk` | Tests that agents respond correctly to domain prompts |
25| `script-validation` | `copilot-sdk` | Tests agent reasoning about validation rules (will migrate to mock when available) |
26
27## Running Evals
28
29```bash
30# Lint all eval specs (no execution, fast)
31npm run eval:lint
32
33# Run all evals
34npx vally eval
35
36# Run a specific suite
37npx vally eval --suite skill-quality
38npx vally eval --suite script-validation
39
40# Compare results against baseline
41npx vally compare
42```
43
44## Adding New Evals
45
461. Create a directory under `evals/` with an `eval.yaml`.
472. Choose the executor:
48 * `copilot-sdk` for testing skill/agent behavior (non-deterministic, use `runs: 3`+).
49 * `mock` for testing scripts/validators with fixture files (deterministic, use `runs: 1`). Not yet available - use `copilot-sdk` until the mock executor plugin ships.
503. Write per-stimulus graders (one stimulus per test case).
514. Run `npm run eval:lint` to validate the spec.
525. Tag stimuli with `category` matching a suite filter in `.vally.yaml`.
53
54## Anti-Patterns
55
56* Don't use `runs: 1` for copilot-sdk evals (non-deterministic output needs multiple runs).
57* Don't set timeout below `120s` for copilot-sdk evals.
58* Don't use `output-contains` as the sole grader for qualitative agent output.
59* Don't bundle multiple test cases into one stimulus with an aggregate grader.
60* Don't pin models in eval specs.
61
62🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.
63