~cytrogen/gstack

ref: 407b15692073dfd8aaff13e755d99a6022ff8ae5 gstack/test/skill-validation.test.ts -rw-r--r-- 22.7 KiB
78e519e3 — Garry Tan a month ago
feat: await support in browse js/eval + contributor mode v2 (#104)

* feat: support await in $B js and eval commands

Auto-wrap await expressions in async IIFE context so
$B js "await fetch(...)" works without SyntaxError.

- hasAwait() strips comments before detection
- js: expression wrapping (async()=>(expr))()
- eval: smart wrapping — single-line=expression, multi-line=block
- 6 new unit tests covering async, false-positive, and return semantics

* feat: redesign contributor mode — periodic reflection with 0-10 rating

Replace passive "report when things break" with active reflection:
- Rate gstack experience 0-10 at workflow step boundaries
- Historical calibration example (await bug) anchors the reporting bar
- "What would make this a 10" field focuses on actionable improvements
- Removed category lists in favor of judgment-based assessment

* test: add deterministic contributor mode preamble validation

40 new skill-validation tests (4 checks × 10 skills) verify:
- 0-10 rating scale present
- Calibration example present
- "What would make this a 10" field present
- Periodic reflection (not per-command)

Update existing E2E contributor eval for new report format.

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve contributor mode + qa-quick E2E reliability

Contributor mode:
- Add "do not truncate" directive to template — agent was stopping
  after "My rating" without completing Steps/Raw output/What would
  make this a 10 sections
- Restore assertions for Steps to reproduce and Date footer

QA quick:
- Make test server URL prominent: top of prompt, explicit "already
  running" and "do NOT discover ports" instructions
- Bump session timeout 180s→240s and test timeout 240s→300s
- Set B= at top of prompt (was buried in prose)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use flexible assertions for contributor mode E2E

Agent writes thorough reports with creative section names
("Repro Steps" vs "Steps to reproduce"). Match intent not formatting:
- /repro|steps to reproduce/ for reproduction steps
- /date.*2026/ for date footer presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2

CONTRIBUTING.md: update contributor mode description — now describes
periodic 0-10 reflection loop instead of passive friction detection.

BROWSER.md: add js/eval async documentation — await expressions are
auto-wrapped in async context, single-line eval returns values directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore v0.4.2 changelog entries lost during cherry-pick conflict

The base branch detection entries from main were dropped when resolving
the CHANGELOG conflict — should have merged both sets, not replaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1e06b6a5 — Garry Tan a month ago
fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81)

* feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs

DRY placeholder for dynamic base branch detection across PR-targeting
skills. Detects via gh pr view (existing PR base) → gh repo view
(repo default) → fallback to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ship skill detects base branch instead of hardcoding main

Replaces ~14 hardcoded 'main' references with dynamic detection via
{{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces
targeting non-main branches. Adds --base <base> to gh pr create.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review, qa, plan-ceo-review detect base branch dynamically

Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}.
Also cleans up qa bash-isms (REPORT_DIR variable, port chaining).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: retro detects default branch instead of hardcoding origin/main

Retro queries commit history (not PR targets), so uses simpler detection:
gh repo view defaultBranchRef. Replaces ~11 origin/main refs with
origin/<default>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add explicit cross-step references in gstack-upgrade template

Bash blocks are self-contained, but cross-block variable references
(INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs+test: SKILL authoring guidance + regression tests

Adds "Writing SKILL templates" section to CLAUDE.md explaining that
templates are prompts, not scripts. Adds validation test catching
hardcoded 'main' in git commands, and resolver content test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update ARCHITECTURE + CONTRIBUTING for new placeholders

Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list.
Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing blank line between resolver functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 3 E2E smoke tests for base branch detection

- /review: verifies Step 0 detection + git diff against detected base
- /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no
  destructive actions
- /retro: verifies default branch detection for git log queries

Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship
template's dual abort check, and retro's inline detection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3e3843c4 — Garry Tan a month ago
feat: contributor mode, session awareness, recommendation format (#90)

* feat: contributor mode, session awareness, universal RECOMMENDATION format

- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Enum & Value Completeness to /review critical checklist

New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add CHANGELOG style guide — user-facing, sell the feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite v0.4.1 changelog to be user-facing and sell the features

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness

Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.

E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add E2E eval for session awareness ELI16 mode

Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: contributor mode eval marked FAIL due to expected browse error

The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
f3ee0ee2 — Garry Tan a month ago
feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)

* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
41141007 — Garry Tan a month ago
feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61)

* fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing

Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES,
ENOSPC) are now logged to .gstack/browse-server.log. Includes test for
permission-denied scenario with chmod 444.

* feat: merge TODO.md + TODOS.md into unified backlog with shared format reference

Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by
skill/component with P0-P4 priority ordering and Completed section. Add shared
review/TODOS-format.md for canonical format. Add static validation tests.

* feat: add 2-tier Greptile reply system with escalation detection

Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation
detection algorithm, and severity re-ranking guidance to greptile-triage.md.

* feat: cross-skill TODOS awareness + Greptile template refs in all skills

/ship Step 5.5: auto-detect completed TODOs, offer reorganization.
/review Step 5.5: cross-reference PR against open TODOs.
/plan-ceo-review, /plan-eng-review: TODOS context in planning.
/retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode.
All Greptile-aware skills now reference reply templates and escalation detection.

* chore: bump version and changelog (v0.3.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md for v0.3.8 changes

Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md
to "Things to know", mention Greptile triage in ship workflow description.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
7d266661 — Garry Tan a month ago
Merge pull request #55 from garrytan/v0.3.6-qa-upgrades

feat: E2E observability + eval infrastructure + all skills templated
40631041 — Garry Tan a month ago
fix: remove false-positive Exit code 1 pattern, fix NEEDS_SETUP test, update QA tests

- Remove /Exit code 1/ from BROWSE_ERROR_PATTERNS — too broad, matches any
  bash command exit code in the transcript (e.g., git diff, test commands).
  Remaining patterns (Unknown command, Unknown snapshot flag, binary not found,
  server failed, no such file) are specific to browse errors.

- Fix NEEDS_SETUP E2E test — accepts READY when global binary exists at
  ~/.claude/skills/gstack/browse/dist/browse (which it does on dev machines).
  Test now verifies the setup block handles missing local binary gracefully.

- Update QA skill structure validation tests to match current qa/SKILL.md
  template content (phases renamed, modes replaced tiers, output structure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a67dae5f — Garry Tan a month ago
fix: update check preamble exits 1 when up to date — convert all skills to .tmpl

The `[ -n "$_UPD" ] && echo "$_UPD"` line in 5 skills was missing `|| true`,
causing exit code 1 when the update check finds no update (empty $_UPD).

Fix: convert ship/, review/, plan-ceo-review/, plan-eng-review/, retro/ to
.tmpl templates using {{UPDATE_CHECK}} placeholder (same as browse/qa/etc).
All 9 skills now generated from templates — preamble changes propagate everywhere.

Also: regenerates qa/SKILL.md which had drifted from its template, adds 12 tests
validating the update check preamble exits 0 in all skills, removes completed TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
76803d78 — Garry Tan a month ago
feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)

Adds comprehensive eval infrastructure:
- Tier 1 (free): 13 new static tests — cross-skill path consistency, QA
  structure validation, greptile format, planted-bug fixture validation
- Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo,
  3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs)
- Tier 3 (LLM judge): QA workflow quality, health rubric clarity,
  cross-skill consistency, baseline score pinning

New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON,
review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY).

Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks.
`bun run test:evals` runs everything that costs money (~$4/run).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5155fe3a — Garry Tan a month ago
Merge remote-tracking branch 'origin/main' into v0.3.5-qa-upgrades
a4683742 — Garry Tan a month ago
fix: enrich SKILL.md docs to pass LLM evals, upgrade judge to Sonnet 4.6 (#43)

* fix: enrich command descriptions and snapshot flags for LLM eval quality

14 command descriptions enriched with specific arg formats, valid values,
error behavior, and return types. Fixed header usage from <name> <value>
to <name>:<value>. Added cookie usage syntax. Snapshot flags now show
long names, ref numbering, and output format examples.

* refactor: auto-generate server.ts help text from COMMAND_DESCRIPTIONS

Replace hand-maintained help block with generateHelpText() that reads
from COMMAND_DESCRIPTIONS and SNAPSHOT_FLAGS. Eliminates help text
drift from source of truth.

* test: add usage consistency and pipe guard tests

Usage consistency test cross-checks Usage: patterns in implementation
against COMMAND_DESCRIPTIONS using structural skeleton comparison.
Pipe guard test ensures descriptions don't contain | which would break
markdown table rendering.

* chore: upgrade eval judge to Sonnet 4.6, update changelog

Switch LLM-as-judge evals from Haiku to Sonnet 4.6 for more stable,
nuanced scoring. Add changelog entry for all eval improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
52050702 — Garry Tan a month ago
feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41)

* refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata

- NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects)
- server.ts imports from commands.ts instead of declaring sets inline
- snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication)
- All 186 existing tests pass

* feat: SKILL.md template system with auto-generated command references

- SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders
- scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run)
- Build pipeline runs gen:skill-docs before binary compilation
- Generated files have AUTO-GENERATED header, committed to git

* test: Tier 1 static validation — 34 tests for SKILL.md command correctness

- test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry
- test/skill-parser.test.ts: 13 parser/validator unit tests
- test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency
- test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness)

* feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

- scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness)
- scripts/dev-skill.ts: watch mode for template development
- test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests
- test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions)
- E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts

* ci: SKILL.md freshness check on push/PR + TODO updates

- .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale
- TODO.md: add E2E cost tracking and model pinning to future ideas

* fix: restore rich descriptions lost in auto-generation

- Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>)
- Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.)
- Commands: is → includes valid states enum
- Commands: console → notes --errors filter behavior
- Commands: press → lists common keys (Enter, Tab, Escape)
- Commands: cookie-import-browser → describes picker UI
- Commands: dialog-accept → specifies alert/confirm/prompt
- Tips: restore → arrow (was downgraded to ->)

* test: quality evals for generated SKILL.md descriptions

Catches the exact regressions we shipped and caught in review:
- Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>)
- is command must list all valid states (visible/hidden/enabled/...)
- press command must list example keys (Enter, Tab, Escape)
- console command must describe --errors behavior
- Snapshot -i must mention @e refs, -C must mention @c refs
- All descriptions must be >= 8 chars (no empty stubs)
- Tips section must use → not ->

* feat: LLM-as-judge evals for SKILL.md documentation quality

4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run):
- Command reference table: clarity/completeness/actionability >= 4/5
- Snapshot flags section: same thresholds
- browse/SKILL.md overall quality
- Regression: generated version must score >= hand-maintained baseline

Requires ANTHROPIC_API_KEY. Auto-skips without it.
Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts)

* chore: bump version to 0.3.3, update changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: conductor.json lifecycle hooks + .env propagation across worktrees

bin/dev-setup now copies .env from main worktree so API keys carry
over to Conductor workspaces automatically. conductor.json wires up
setup and archive hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>