microsoft/hve-core
Publicmirrored from https://github.com/microsoft/hve-coreAvailable
evals/agent-behavior/expectations/code-review-functional.expectations.yml
113lines · modecode
| 1 | # Bucket-A expectations for code-review-functional |
| 2 | # Format: per-agent YAML, 5–10 grader-worthy expectations grounded in the agent |
| 3 | # file's explicit promises and/or current matrix failures. This file is consumed |
| 4 | # by the next pass that rewrites stimuli + graders end-to-end; do not treat it |
| 5 | # as a Vally grader file directly. |
| 6 | # |
| 7 | # Note: code-review-functional is the functional-correctness sibling of |
| 8 | # code-review-standards. It reviews behavior, edge cases, error handling, |
| 9 | # concurrency, and security risk — NOT language style. Findings should be |
| 10 | # scoped to the diff and persisted under |
| 11 | # `.copilot-tracking/reviews/code-reviews/<branch>/<run>/functional-review.md`. |
| 12 | slug: code-review-functional |
| 13 | class: code-reviewer |
| 14 | agent_file: .github/agents/coding-standards/code-review-functional.agent.md |
| 15 | stimulus_file: evals/agent-behavior/stimuli/code-review-functional.yml |
| 16 | latest_result: evals/results/agent-matrix/2026-05-28/code-review-functional.json |
| 17 | source_review_date: 2026-05-28 |
| 18 | |
| 19 | expectations: |
| 20 | - expectation_id: functional-scope-only |
| 21 | summary: Findings address behavior/correctness, not language style. |
| 22 | signal: Output focuses on behavior, edge cases, error handling, concurrency, security, or contracts. |
| 23 | pass_criteria: | |
| 24 | Findings name functional concerns (incorrect behavior, missing edge |
| 25 | cases, error handling, race conditions, security risk, contract |
| 26 | violations, performance correctness). Pure style findings (naming, |
| 27 | formatting, idiom preference) are absent or deferred to |
| 28 | `code-review-standards`. |
| 29 | failure_modes: |
| 30 | - Findings list formatting/naming/style issues as primary findings. |
| 31 | - Mixes language-standards findings into functional review. |
| 32 | priority: high |
| 33 | contract_ref: "agent §Scope (functional correctness only; style is owned by code-review-standards)" |
| 34 | |
| 35 | - expectation_id: severity-per-finding |
| 36 | summary: Each functional finding carries a severity label. |
| 37 | signal: Output applies severity words per finding. |
| 38 | pass_criteria: | |
| 39 | Each functional finding has a case-insensitive severity from |
| 40 | `critical|high|medium|low|info|warning`. Severity is per-finding. |
| 41 | failure_modes: |
| 42 | - Findings unlabeled. |
| 43 | - Severities used only in a summary block. |
| 44 | priority: high |
| 45 | contract_ref: "agent §Output Contract (severity per finding); current `severity-vocab` grader" |
| 46 | |
| 47 | - expectation_id: findings-structure-present |
| 48 | summary: Output presents findings in a structured form. |
| 49 | signal: Output contains a severity-labeled table or per-finding sections. |
| 50 | pass_criteria: | |
| 51 | Output uses a markdown table with severity column OR per-finding |
| 52 | sections using `finding|issue|concern|recommendation` language with |
| 53 | each finding tied to a file path and line range when possible. |
| 54 | failure_modes: |
| 55 | - Single paragraph with no per-finding structure. |
| 56 | - Bulleted list with no severity framing. |
| 57 | priority: high |
| 58 | contract_ref: "agent §Output Contract; current `findings-table-present` grader" |
| 59 | |
| 60 | - expectation_id: diff-scoped-findings |
| 61 | summary: Findings are scoped to the reviewed diff. |
| 62 | signal: Findings reference changed files or hunks from the diff. |
| 63 | pass_criteria: | |
| 64 | Findings cite changed files, line ranges, or hunks from the supplied |
| 65 | diff. Findings that step outside the diff are explicitly marked as |
| 66 | out-of-scope context or pre-existing risk. |
| 67 | failure_modes: |
| 68 | - Findings invented for files not in the diff. |
| 69 | - Bulk findings about unrelated subsystems. |
| 70 | priority: medium |
| 71 | contract_ref: "agent §Scope (diff-scoped functional review)" |
| 72 | |
| 73 | - expectation_id: tracking-path-shape |
| 74 | summary: Functional review artifact lives at the documented path. |
| 75 | signal: Output names a path matching `.copilot-tracking/reviews/code-reviews/<branch>/<run>/functional-review.md`. |
| 76 | pass_criteria: | |
| 77 | When the agent reports persisting a functional review, the path |
| 78 | starts with `.copilot-tracking/reviews/code-reviews/`, includes a |
| 79 | normalized branch segment, includes a run identifier, and ends in |
| 80 | `functional-review.md`. |
| 81 | failure_modes: |
| 82 | - Artifact written outside `.copilot-tracking/reviews/code-reviews/`. |
| 83 | - Filename other than `functional-review.md`. |
| 84 | priority: medium |
| 85 | applies_when: "agent reports artifact creation" |
| 86 | contract_ref: "agent §Tracking Artifact (functional-review.md)" |
| 87 | |
| 88 | - expectation_id: verdict-stated |
| 89 | summary: Functional review ends with a verdict from the documented vocabulary. |
| 90 | signal: Output names an overall verdict. |
| 91 | pass_criteria: | |
| 92 | Response concludes with an overall functional verdict drawn from |
| 93 | `approve|approve with changes|request changes|block`. Verdict reflects |
| 94 | the highest-severity finding. |
| 95 | failure_modes: |
| 96 | - No final verdict. |
| 97 | - Verdict expressed only in informal prose. |
| 98 | priority: medium |
| 99 | contract_ref: "agent §Output Contract (functional verdict)" |
| 100 | |
| 101 | - expectation_id: no-source-edit |
| 102 | summary: Review-only — no edits to source code or build manifests. |
| 103 | signal: Output does not reference modifications to source-tree files. |
| 104 | pass_criteria: | |
| 105 | No occurrences of edit/create verbs paired with `.cs`/`.py`/`.ts`/`.js`/ |
| 106 | `.go`/`.rs`/`.java`/`package.json`/`pyproject.toml`/`Cargo.toml` paths. |
| 107 | Proposed fixes appear as recommendations or fenced snippets, not as |
| 108 | claimed edits. |
| 109 | failure_modes: |
| 110 | - Agent claims to apply a fix during functional review. |
| 111 | - Edits build manifests while reviewing. |
| 112 | priority: high |
| 113 | contract_ref: "agent scope (review-only); current `no-source-edit` grader" |
| 114 | |