-
Notifications
You must be signed in to change notification settings - Fork 582
evals for api review command #2606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,258 @@ | ||
| # API Review Command Eval Test Suite | ||
|
|
||
| Design document for a Go/Ginkgo-based evaluation framework to test the `/api-review` Claude command against known API review scenarios. | ||
|
|
||
| ## Overview | ||
|
|
||
| This test suite validates that the `/api-review` Claude command correctly identifies API documentation issues. Each test case consists of a patch file and expected issues. The suite applies patches to a clean clone of the repository, runs the API review command, and verifies the output matches expectations exactly (no missing issues, no hallucinated issues). | ||
|
|
||
| ## Directory Structure | ||
|
|
||
| ``` | ||
| tests/eval/ | ||
| βββ eval_test.go # Main Ginkgo test suite | ||
| βββ DESIGN.md # This file | ||
| βββ testdata/ | ||
| β βββ golden/ # Single-issue tests | ||
| β β βββ missing-optional-doc/ | ||
| β β β βββ patch.diff | ||
| β β β βββ expected.txt | ||
| β β βββ undocumented-enum/ | ||
| β β β βββ patch.diff | ||
| β β β βββ expected.txt | ||
| β β βββ valid-api-change/ | ||
| β β βββ patch.diff | ||
| β β βββ expected.txt # Empty file = no issues expected | ||
| β βββ integration/ # Multi-issue tests | ||
| β βββ new-field-multiple-issues/ | ||
| β βββ patch.diff | ||
| β βββ expected.txt | ||
| ``` | ||
|
|
||
| ## Test Case Format | ||
|
|
||
| ### patch.diff | ||
|
|
||
| Standard git diff format: | ||
|
|
||
| ```diff | ||
| diff --git a/config/v1/types.go b/config/v1/types.go | ||
| --- a/config/v1/types.go | ||
| +++ b/config/v1/types.go | ||
| @@ -10,0 +11,5 @@ | ||
| +// MyField does something | ||
| +// +optional | ||
| +// +kubebuilder:validation:Enum=Foo;Bar | ||
| +MyField string `json:"myField"` | ||
| ``` | ||
|
|
||
| ### expected.txt | ||
|
|
||
| One expected issue per line: | ||
|
|
||
| ``` | ||
| enum values Foo and Bar not documented in comment | ||
| optional field does not explain behavior when omitted | ||
| ``` | ||
|
|
||
| Empty file means the API change should pass review with no issues. | ||
|
|
||
| **Note**: Order of issues in `expected.txt` does not matter. Comparison uses semantic matching, not exact string matching. | ||
|
|
||
| ## Test Flow | ||
|
|
||
| ``` | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | ||
| β 1. Pre-flight: β | ||
| β a. Verify local AGENTS.md and β | ||
| β .claude/commands/api-review.md exist β | ||
| β b. These will be copied to temp dir after clone β | ||
| β (ensures local changes are tested) β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | ||
| β 2. Setup (once): β | ||
| β a. Shallow clone openshift/api to temp dir β | ||
| β b. Copy local AGENTS.md and .claude/ to temp dir β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | ||
| β 3. For each test case (sequential): β | ||
| β a. Reset repo to clean state β | ||
| β b. Apply patch.diff β | ||
| β c. Run claude api-review on changed files β | ||
| β d. Run claude to compare output vs expected.txt β | ||
| β e. Parse true/false response, assert β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | ||
| β 4. Teardown: Remove temp dir β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | ||
| ``` | ||
|
|
||
| ## Reset Between Tests | ||
|
|
||
| ```bash | ||
| git reset --hard origin/master && git clean -fd | ||
| ``` | ||
|
|
||
| - `git reset --hard origin/master`: Resets all tracked files to match the remote master branch, discarding any local commits and staged/unstaged changes | ||
| - `git clean -f`: Force remove untracked files (files not in git) | ||
| - `git clean -d`: Also remove untracked directories | ||
|
|
||
| After reset, re-copy `AGENTS.md` and `.claude/` from local source to preserve local modifications being tested. | ||
|
|
||
| **Remote origin handling**: The shallow clone creates `origin` automatically. Before reset, verify remote exists: | ||
|
|
||
| ```bash | ||
| git remote get-url origin || git remote add origin https://github.com/openshift/api.git | ||
| ``` | ||
|
|
||
| ## Claude Invocations | ||
|
|
||
| ### Step 1 - Run API Review | ||
|
|
||
| ```bash | ||
| claude --print -p "/api-review" --allowedTools "Bash,Read,Grep,Glob,Task" <files> | ||
| ``` | ||
|
|
||
| ### Step 2 - Compare Results | ||
|
|
||
| ```bash | ||
| claude --print -p "Given this API review output: | ||
| <output> | ||
|
|
||
| Expected issues (one per line): | ||
| <contents of expected.txt> | ||
|
|
||
| Compare the review output against the expected issues list. | ||
| The result is 'true' ONLY if: | ||
| 1. ALL expected issues are identified in the output | ||
| 2. NO additional issues are reported beyond what is expected | ||
|
|
||
| If any expected issue is missing, reply 'false'. | ||
| If any issue is reported that is NOT in the expected list, reply 'false'. | ||
|
|
||
| Reply with exactly 'true' or 'false' (no other text)." | ||
| ``` | ||
|
|
||
| Parse response, trim whitespace, check for `true` or `false`. | ||
|
|
||
| ## Pre-flight Check | ||
|
|
||
| Before cloning, verify these local files exist in the source repo: | ||
|
|
||
| - `AGENTS.md` | ||
| - `.claude/commands/api-review.md` | ||
|
|
||
| These are copied into the temp clone so that any local modifications to the review command are tested, not the remote versions. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Setting | Value | | ||
| |---------|-------| | ||
| | Timeout per Claude call | 5 minutes | | ||
| | Execution mode | Sequential | | ||
| | Clone depth | 1 (shallow) | | ||
| | Clone source | `https://github.com/openshift/api.git` | | ||
| | Reset between tests | Verify origin remote exists, `git reset --hard origin/master && git clean -fd`, re-copy local AGENTS.md and .claude/ | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 2 | ||
|
|
||
| ### Cost Tracking β IMPLEMENTED | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do these have |
||
|
|
||
|
|
||
| Use `--output-format json` to capture `total_cost_usd` from each Claude invocation. Accumulate across all calls (review + judge) and print the total in `AfterSuite`. | ||
|
|
||
| ### Test Structure Reorganization β IMPLEMENTED | ||
|
|
||
| Reorganize `testdata/` into two categories: | ||
|
|
||
| ``` | ||
| tests/eval/testdata/ | ||
| βββ golden/ # Base truth tests - single isolated issues | ||
| β βββ missing-optional-doc/ | ||
| β β βββ patch.diff # Triggers ONLY missing-optional-doc | ||
| β β βββ expected.txt | ||
| β βββ undocumented-enum/ | ||
| β β βββ patch.diff # Triggers ONLY undocumented-enum | ||
| β β βββ expected.txt | ||
| β βββ missing-featuregate/ | ||
| β β βββ patch.diff # Triggers ONLY missing-featuregate | ||
| β β βββ expected.txt | ||
| β βββ valid-api-change/ | ||
| β βββ patch.diff # Triggers NO issues | ||
| β βββ expected.txt | ||
| βββ integration/ # Complex scenarios - multiple issues | ||
| βββ new-field-all-issues/ | ||
| β βββ patch.diff # Triggers multiple issues together | ||
| β βββ expected.txt | ||
| βββ partial-documentation/ | ||
| βββ patch.diff | ||
| βββ expected.txt | ||
| ``` | ||
|
|
||
| **Golden tests**: Each patch is carefully crafted to trigger exactly one issue type. These validate that the review command correctly identifies individual issue categories in isolation. | ||
|
|
||
| **Integration tests**: Patches that trigger multiple issues, testing the review command's ability to identify combinations of problems in realistic scenarios. | ||
|
|
||
| ### Model Selection β IMPLEMENTED | ||
|
|
||
| Each test tier has a default model, overridable via environment variable: | ||
|
|
||
| | Test Type | Default Model | Override Env Var | | ||
| |-----------|---------------|------------------| | ||
| | Golden tests | Sonnet | `EVAL_GOLDEN_MODEL` | | ||
| | Integration tests | Opus | `EVAL_INTEGRATION_MODEL` | | ||
| | Judge LLM | Haiku | `EVAL_JUDGE_MODEL` | | ||
|
|
||
| The test suite reads these at startup and applies per-tier: | ||
|
|
||
| ```go | ||
| goldenModel := getEnvOrDefault("EVAL_GOLDEN_MODEL", "claude-sonnet-4-5@20250929") | ||
| integrationModel := getEnvOrDefault("EVAL_INTEGRATION_MODEL", "claude-opus-4-5@20251101") | ||
| judgeModel := getEnvOrDefault("EVAL_JUDGE_MODEL", "claude-haiku-4-5-20251001") | ||
| ``` | ||
|
|
||
| Usage: | ||
| ```bash | ||
| # Use defaults | ||
| go test ./tests/eval/... | ||
|
|
||
| # Override golden tests to use Haiku | ||
| EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 go test ./tests/eval/... | ||
|
|
||
| # Override all models | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't include an example of overriding the judge model |
||
| EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 \ | ||
| EVAL_INTEGRATION_MODEL=claude-sonnet-4-20250514 \ | ||
| go test ./tests/eval/... | ||
| ``` | ||
|
|
||
| ### Patch Stability | ||
|
|
||
| Patches may fail to apply as `origin/master` evolves over time. Strategies: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we come to a conclusion on this? |
||
|
|
||
| - Pin to a specific commit SHA in the clone step | ||
| - Use `git apply --3way` for better conflict handling | ||
| - Periodic patch refresh CI job | ||
|
|
||
| ### Error Handling | ||
|
|
||
| Current design does not address failure scenarios: | ||
|
|
||
| - Patch application failures | ||
| - Resource cleanup on test failures | ||
|
|
||
| Using `--output-format json` also enables better error handling in future phases: | ||
|
|
||
| - Claude CLI timeouts or crashes (detect via JSON parse failure or missing fields) | ||
| - Empty or malformed output (validate JSON structure) | ||
| - Authentication failures (check for error fields in JSON response) | ||
|
|
||
| ### Performance Optimizations | ||
|
|
||
| The API review step is the slowest part of the eval suite. Options to improve: | ||
|
|
||
| 1. **Skip linting by default** - Update api-review command to skip `make lint` unless explicitly requested. Linting adds significant time. | ||
|
|
||
| 2. **Cache review outputs** - For development, cache the review output keyed by patch hash. Skip re-running if cached result exists. Clear cache on command changes. | ||
|
|
||
| 3. **Parallel test execution** - Run golden tests in parallel (requires separate repo clones per test). | ||
|
|
||
| 4. **Smaller/faster model for development** - Use Haiku for rapid iteration, Sonnet/Opus for CI validation. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowed tools matters here?