microsoft/hve-core

Public

mirrored from https://github.com/microsoft/hve-coreAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
ci/2086-enforce-powershell-coverage

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

evals/agent-behavior/expectations/code-review-functional.expectations.yml

113lines · modecode

1# Bucket-A expectations for code-review-functional
2# Format: per-agent YAML, 5–10 grader-worthy expectations grounded in the agent
3# file's explicit promises and/or current matrix failures. This file is consumed
4# by the next pass that rewrites stimuli + graders end-to-end; do not treat it
5# as a Vally grader file directly.
6#
7# Note: code-review-functional is the functional-correctness sibling of
8# code-review-standards. It reviews behavior, edge cases, error handling,
9# concurrency, and security risk — NOT language style. Findings should be
10# scoped to the diff and persisted under
11# `.copilot-tracking/reviews/code-reviews/<branch>/<run>/functional-review.md`.
12slug: code-review-functional
13class: code-reviewer
14agent_file: .github/agents/coding-standards/code-review-functional.agent.md
15stimulus_file: evals/agent-behavior/stimuli/code-review-functional.yml
16latest_result: evals/results/agent-matrix/2026-05-28/code-review-functional.json
17source_review_date: 2026-05-28
18
19expectations:
20 - expectation_id: functional-scope-only
21 summary: Findings address behavior/correctness, not language style.
22 signal: Output focuses on behavior, edge cases, error handling, concurrency, security, or contracts.
23 pass_criteria: |
24 Findings name functional concerns (incorrect behavior, missing edge
25 cases, error handling, race conditions, security risk, contract
26 violations, performance correctness). Pure style findings (naming,
27 formatting, idiom preference) are absent or deferred to
28 `code-review-standards`.
29 failure_modes:
30 - Findings list formatting/naming/style issues as primary findings.
31 - Mixes language-standards findings into functional review.
32 priority: high
33 contract_ref: "agent §Scope (functional correctness only; style is owned by code-review-standards)"
34
35 - expectation_id: severity-per-finding
36 summary: Each functional finding carries a severity label.
37 signal: Output applies severity words per finding.
38 pass_criteria: |
39 Each functional finding has a case-insensitive severity from
40 `critical|high|medium|low|info|warning`. Severity is per-finding.
41 failure_modes:
42 - Findings unlabeled.
43 - Severities used only in a summary block.
44 priority: high
45 contract_ref: "agent §Output Contract (severity per finding); current `severity-vocab` grader"
46
47 - expectation_id: findings-structure-present
48 summary: Output presents findings in a structured form.
49 signal: Output contains a severity-labeled table or per-finding sections.
50 pass_criteria: |
51 Output uses a markdown table with severity column OR per-finding
52 sections using `finding|issue|concern|recommendation` language with
53 each finding tied to a file path and line range when possible.
54 failure_modes:
55 - Single paragraph with no per-finding structure.
56 - Bulleted list with no severity framing.
57 priority: high
58 contract_ref: "agent §Output Contract; current `findings-table-present` grader"
59
60 - expectation_id: diff-scoped-findings
61 summary: Findings are scoped to the reviewed diff.
62 signal: Findings reference changed files or hunks from the diff.
63 pass_criteria: |
64 Findings cite changed files, line ranges, or hunks from the supplied
65 diff. Findings that step outside the diff are explicitly marked as
66 out-of-scope context or pre-existing risk.
67 failure_modes:
68 - Findings invented for files not in the diff.
69 - Bulk findings about unrelated subsystems.
70 priority: medium
71 contract_ref: "agent §Scope (diff-scoped functional review)"
72
73 - expectation_id: tracking-path-shape
74 summary: Functional review artifact lives at the documented path.
75 signal: Output names a path matching `.copilot-tracking/reviews/code-reviews/<branch>/<run>/functional-review.md`.
76 pass_criteria: |
77 When the agent reports persisting a functional review, the path
78 starts with `.copilot-tracking/reviews/code-reviews/`, includes a
79 normalized branch segment, includes a run identifier, and ends in
80 `functional-review.md`.
81 failure_modes:
82 - Artifact written outside `.copilot-tracking/reviews/code-reviews/`.
83 - Filename other than `functional-review.md`.
84 priority: medium
85 applies_when: "agent reports artifact creation"
86 contract_ref: "agent §Tracking Artifact (functional-review.md)"
87
88 - expectation_id: verdict-stated
89 summary: Functional review ends with a verdict from the documented vocabulary.
90 signal: Output names an overall verdict.
91 pass_criteria: |
92 Response concludes with an overall functional verdict drawn from
93 `approve|approve with changes|request changes|block`. Verdict reflects
94 the highest-severity finding.
95 failure_modes:
96 - No final verdict.
97 - Verdict expressed only in informal prose.
98 priority: medium
99 contract_ref: "agent §Output Contract (functional verdict)"
100
101 - expectation_id: no-source-edit
102 summary: Review-only — no edits to source code or build manifests.
103 signal: Output does not reference modifications to source-tree files.
104 pass_criteria: |
105 No occurrences of edit/create verbs paired with `.cs`/`.py`/`.ts`/`.js`/
106 `.go`/`.rs`/`.java`/`package.json`/`pyproject.toml`/`Cargo.toml` paths.
107 Proposed fixes appear as recommendations or fenced snippets, not as
108 claimed edits.
109 failure_modes:
110 - Agent claims to apply a fix during functional review.
111 - Edits build manifests while reviewing.
112 priority: high
113 contract_ref: "agent scope (review-only); current `no-source-edit` grader"
114