microsoft/hve-core
Publicmirrored fromhttps://github.com/microsoft/hve-coreAvailable
evals/script-validation/eval.yaml
87lines · modecode
| 1 | name: script-validation |
| 2 | description: > |
| 3 | Validate that agents correctly identify valid and invalid skill |
| 4 | structures when presented with file listings. Uses copilot-sdk |
| 5 | to test real agent reasoning about validation rules. Will migrate |
| 6 | to the mock executor when the deterministic plugin ships. |
| 7 | type: capability |
| 8 | config: |
| 9 | runs: 3 |
| 10 | timeout: 120s |
| 11 | executor: copilot-sdk |
| 12 | |
| 13 | environment: coding-standards |
| 14 | |
| 15 | scoring: |
| 16 | threshold: 0.7 |
| 17 | |
| 18 | stimuli: |
| 19 | - name: identify-valid-skill-structure |
| 20 | prompt: | |
| 21 | I have a skill directory with the following structure: |
| 22 | ``` |
| 23 | my-skill/ |
| 24 | ├── SKILL.md (contains frontmatter with name, description, license) |
| 25 | ├── references/ |
| 26 | │ └── api-guide.md |
| 27 | └── scripts/ |
| 28 | └── validate.py |
| 29 | ``` |
| 30 | Does this follow valid skill structure conventions? What required |
| 31 | elements are present? |
| 32 | tags: |
| 33 | category: script-validation |
| 34 | script: skill-validation |
| 35 | graders: |
| 36 | - type: output-matches |
| 37 | name: identifies-valid-structure |
| 38 | config: |
| 39 | pattern: "(?i)valid|correct|proper|follows.*convention" |
| 40 | - type: output-matches |
| 41 | name: identifies-skill-md |
| 42 | config: |
| 43 | pattern: "(?i)SKILL\\.md|skill.entry|entrypoint" |
| 44 | |
| 45 | - name: identify-missing-skill-md |
| 46 | prompt: | |
| 47 | I have a skill directory with this structure: |
| 48 | ``` |
| 49 | broken-skill/ |
| 50 | ├── README.md |
| 51 | ├── references/ |
| 52 | │ └── guide.md |
| 53 | └── scripts/ |
| 54 | └── run.py |
| 55 | ``` |
| 56 | Is this a valid skill structure? What's missing? |
| 57 | tags: |
| 58 | category: script-validation |
| 59 | script: skill-validation |
| 60 | graders: |
| 61 | - type: output-matches |
| 62 | name: identifies-missing-skillmd |
| 63 | config: |
| 64 | pattern: "(?i)missing.*SKILL\\.md|no.*SKILL\\.md|SKILL\\.md.*required|SKILL\\.md.*missing" |
| 65 | - type: output-matches |
| 66 | name: identifies-invalid |
| 67 | config: |
| 68 | pattern: "(?i)invalid|not.*valid|missing.*required|incomplete" |
| 69 | |
| 70 | - name: identify-frontmatter-issues |
| 71 | prompt: | |
| 72 | This SKILL.md file has the following frontmatter: |
| 73 | ```yaml |
| 74 | --- |
| 75 | name: my-tool |
| 76 | --- |
| 77 | ``` |
| 78 | Is this frontmatter complete for a skill file? What fields |
| 79 | are missing according to skill conventions? |
| 80 | tags: |
| 81 | category: script-validation |
| 82 | script: frontmatter-validation |
| 83 | graders: |
| 84 | - type: output-matches |
| 85 | name: identifies-missing-description |
| 86 | config: |
| 87 | pattern: "(?i)description.*missing|missing.*description|need.*description|no.*description|description.*required|lacks.*description|without.*description|description.*absent|description.*not" |