~cytrogen/gstack

ref: f9cfabeda8d6521e31134b88db7c50f47e7ae4dc gstack/test/helpers/eval-store.ts -rw-r--r-- 16.6 KiB
f9cfabed — Garry Tan a month ago
feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial()

session-runner: atomic heartbeat file (e2e-live.json), per-run log directory
(~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence,
failure transcripts to persistent run dir instead of tmpdir.

eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call),
savePartial() writes _partial-e2e.json after each addTest() for crash resilience,
finalize() cleans up partial file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
84f52f3b — Garry Tan a month ago
feat: eval persistence with auto-compare against previous run

EvalCollector accumulates test results during eval runs, writes JSON to
~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json, prints
a summary table, and automatically compares against the previous run.

- EvalCollector class with addTest() / finalize() / summary table
- findPreviousRun() prefers same branch, falls back to any branch
- compareEvalResults() matches tests by name, detects improved/regressed
- extractToolSummary() counts tool types from transcript events
- formatComparison() renders delta table with per-test + aggregate diffs
- Wire into skill-e2e.test.ts (recordE2E helper) and skill-llm-eval.test.ts
- 19 unit tests for collector + comparison functions
- schema_version: 1 for forward compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>