feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)
* feat: browser ref staleness detection via async count() validation
resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow
Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: eval efficiency metrics — turns, duration, commentary across all surfaces
Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: bump version and changelog (v0.4.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0
- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add user-facing benefit descriptions to v0.4.0 changelog
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Merge pull request #55 from garrytan/v0.3.6-qa-upgrades
feat: E2E observability + eval infrastructure + all skills templated
fix: never clean up observability artifacts — partial file persists after finalize
Removing the _partial-e2e.json deletion from finalize(). These are small files
on a local disk and their persistence is the whole point of observability.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial()
session-runner: atomic heartbeat file (e2e-live.json), per-run log directory
(~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence,
failure transcripts to persistent run dir instead of tmpdir.
eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call),
savePartial() writes _partial-e2e.json after each addTest() for crash resilience,
finalize() cleans up partial file.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: eval persistence with auto-compare against previous run
EvalCollector accumulates test results during eval runs, writes JSON to
~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json, prints
a summary table, and automatically compares against the previous run.
- EvalCollector class with addTest() / finalize() / summary table
- findPreviousRun() prefers same branch, falls back to any branch
- compareEvalResults() matches tests by name, detects improved/regressed
- extractToolSummary() counts tool types from transcript events
- formatComparison() renders delta table with per-test + aggregate diffs
- Wire into skill-e2e.test.ts (recordE2E helper) and skill-llm-eval.test.ts
- 19 unit tests for collector + comparison functions
- schema_version: 1 for forward compatibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>