microsoft/hve-core
Publicmirrored from https://github.com/microsoft/hve-coreAvailable
evals/script-validation/eval.yaml
91lines · modecode
| 1 | name: script-validation |
| 2 | description: > |
| 3 | Validate that agents correctly identify valid and invalid skill |
| 4 | structures when presented with file listings. Uses copilot-sdk |
| 5 | to test real agent reasoning about validation rules. Will migrate |
| 6 | to the mock executor when the deterministic plugin ships. |
| 7 | type: capability |
| 8 | defaults: |
| 9 | runs: 3 |
| 10 | timeout: 120s |
| 11 | executor: copilot-sdk |
| 12 | |
| 13 | # Skill paths are resolved relative to this spec's directory (evals/script-validation/), |
| 14 | # so they ascend to the repo root before descending into .github/skills. |
| 15 | environment: |
| 16 | skills: |
| 17 | - ../../.github/skills/coding-standards/python-foundational |
| 18 | |
| 19 | scoring: |
| 20 | threshold: 0.7 |
| 21 | |
| 22 | stimuli: |
| 23 | - name: identify-valid-skill-structure |
| 24 | prompt: | |
| 25 | I have a skill directory with the following structure: |
| 26 | ``` |
| 27 | my-skill/ |
| 28 | ├── SKILL.md (contains frontmatter with name, description, license) |
| 29 | ├── references/ |
| 30 | │ └── api-guide.md |
| 31 | └── scripts/ |
| 32 | └── validate.py |
| 33 | ``` |
| 34 | Does this follow valid skill structure conventions? What required |
| 35 | elements are present? |
| 36 | tags: |
| 37 | category: script-validation |
| 38 | script: skill-validation |
| 39 | graders: |
| 40 | - type: output-matches |
| 41 | name: identifies-valid-structure |
| 42 | config: |
| 43 | pattern: "(?i)valid|correct|proper|follows.*convention" |
| 44 | - type: output-matches |
| 45 | name: identifies-skill-md |
| 46 | config: |
| 47 | pattern: "(?i)SKILL\\.md|skill.entry|entrypoint" |
| 48 | |
| 49 | - name: identify-missing-skill-md |
| 50 | prompt: | |
| 51 | I have a skill directory with this structure: |
| 52 | ``` |
| 53 | broken-skill/ |
| 54 | ├── README.md |
| 55 | ├── references/ |
| 56 | │ └── guide.md |
| 57 | └── scripts/ |
| 58 | └── run.py |
| 59 | ``` |
| 60 | Is this a valid skill structure? What's missing? |
| 61 | tags: |
| 62 | category: script-validation |
| 63 | script: skill-validation |
| 64 | graders: |
| 65 | - type: output-matches |
| 66 | name: identifies-missing-skillmd |
| 67 | config: |
| 68 | pattern: "(?i)missing.*SKILL\\.md|no.*SKILL\\.md|SKILL\\.md.*required|SKILL\\.md.*missing" |
| 69 | - type: output-matches |
| 70 | name: identifies-invalid |
| 71 | config: |
| 72 | pattern: "(?i)invalid|not.*valid|missing.*required|incomplete" |
| 73 | |
| 74 | - name: identify-frontmatter-issues |
| 75 | prompt: | |
| 76 | This SKILL.md file has the following frontmatter: |
| 77 | ```yaml |
| 78 | --- |
| 79 | name: my-tool |
| 80 | --- |
| 81 | ``` |
| 82 | Is this frontmatter complete for a skill file? What fields |
| 83 | are missing according to skill conventions? |
| 84 | tags: |
| 85 | category: script-validation |
| 86 | script: frontmatter-validation |
| 87 | graders: |
| 88 | - type: output-matches |
| 89 | name: identifies-missing-description |
| 90 | config: |
| 91 | pattern: "(?i)description.*missing|missing.*description|need.*description|no.*description|description.*required|lacks.*description|without.*description|description.*absent|description.*not" |