microsoft/hve-core

Public

mirrored fromhttps://github.com/microsoft/hve-coreAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

173d558d015d9849174acf0e38bf43eb009e11ff

Find a branch or tag

Branches

173d558d015d9849174acf0e38bf43eb009e11ff

Clone

HTTPS

Download ZIP

hve-core/evals

evals/README.md

62lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`---`
2	`title: Evaluations`
3	`description: 'Architecture overview and contributor guide for Vally evaluation specs'`
4	`author: HVE Core Team`
5	`ms.date: 2026-05-14`
6	`---`
7
8	`This directory contains [Vally](https://www.npmjs.com/package/@microsoft/vally-cli) evaluation specs for hve-core.`
9
10	`## Architecture`
11
12	```text
13	`evals/`
14	`├── skill-quality/ copilot-sdk evals testing skill behavior`
15	`├── agent-behavior/ copilot-sdk evals testing agent responses`
16	`└── script-validation/ copilot-sdk evals testing deterministic scripts`
17	```
18
19	`## Executors`
20
21	`\| Suite \| Executor \| Purpose \|`
22	`\|---------------------\|---------------\|------------------------------------------------------------------------------------\|`
23	\| `skill-quality` \| `copilot-sdk` \| Tests that skills provide accurate guidance via real agent conversation \|
24	\| `agent-behavior` \| `copilot-sdk` \| Tests that agents respond correctly to domain prompts \|
25	\| `script-validation` \| `copilot-sdk` \| Tests agent reasoning about validation rules (will migrate to mock when available) \|
26
27	`## Running Evals`
28
29	```bash
30	`# Lint all eval specs (no execution, fast)`
31	`npm run eval:lint`
32
33	`# Run all evals`
34	`npx vally eval`
35
36	`# Run a specific suite`
37	`npx vally eval --suite skill-quality`
38	`npx vally eval --suite script-validation`
39
40	`# Compare results against baseline`
41	`npx vally compare`
42	```
43
44	`## Adding New Evals`
45
46	1. Create a directory under `evals/` with an `eval.yaml`.
47	`2. Choose the executor:`
48	* `copilot-sdk` for testing skill/agent behavior (non-deterministic, use `runs: 3`+).
49	* `mock` for testing scripts/validators with fixture files (deterministic, use `runs: 1`). Not yet available - use `copilot-sdk` until the mock executor plugin ships.
50	`3. Write per-stimulus graders (one stimulus per test case).`
51	4. Run `npm run eval:lint` to validate the spec.
52	5. Tag stimuli with `category` matching a suite filter in `.vally.yaml`.
53
54	`## Anti-Patterns`
55
56	* Don't use `runs: 1` for copilot-sdk evals (non-deterministic output needs multiple runs).
57	* Don't set timeout below `120s` for copilot-sdk evals.
58	* Don't use `output-contains` as the sole grader for qualitative agent output.
59	`* Don't bundle multiple test cases into one stimulus with an aggregate grader.`
60	`* Don't pin models in eval specs.`
61
62	`🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.`
63

microsoft/hve-core

Branches

Tags

Clone