~cytrogen/gstack

ref: 1a100a2a239a74af03c2bf32abb370419b06416a gstack/browse d---------
1a100a2a — xianren a month ago
fix: eliminate duplicate command sets in chain, improve flush perf and type safety

- Remove duplicate CHAIN_READ/CHAIN_WRITE/CHAIN_META sets from meta-commands.ts
  and import from commands.ts (single source of truth). The duplicated sets would
  silently fail to route new commands added to commands.ts.
- Replace read+concat+write log flush with fs.appendFileSync — O(new entries)
  instead of O(total log size) per flush cycle.
- Replace `any` types for contextOptions with Playwright's BrowserContextOptions
  and add proper types for storage state in recreateContext().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
318ffdbd — Garry Tan a month ago
fix: js statement wrapping + click auto-routes option to selectOption (v0.4.5) (#117)

* fix: js statement wrapping + click auto-routes option to selectOption

Bug 1: js command wrapped all code as expressions — const, semicolons,
and multi-line code broke with SyntaxError. Added needsBlockWrapper()
and wrapForEvaluate() helpers (shared with eval) to detect statements
and use block wrapper {…} instead of expression wrapper (…).

Bug 2: clicking <option> refs hung forever because Playwright can't
.click() native select UI. Click handler now checks ARIA role + DOM
tagName and auto-routes to selectOption() via parent <select>.

Bug 3: click timeouts on <option> elements gave no guidance. Now
throws helpful error: "Use browse select instead of click."

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
c86faa79 — Garry Tan a month ago
fix: update check cache — 60min UP_TO_DATE TTL + --force flag (v0.4.4) (#110)

* fix: split update check cache TTL + add --force flag

UP_TO_DATE cache now expires after 60 min (was 720 min / 12 hours).
UPGRADE_AVAILABLE keeps 720 min TTL to keep nagging.

--force flag deletes cache before checking, used by /gstack-upgrade
standalone invocation to always get a fresh result from GitHub.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /gstack-upgrade standalone uses --force for fresh check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
276d0cc6 — Garry Tan a month ago
feat: always-on ELI16 + branch detection (v0.4.3) (#108)

* feat: always-on ELI16 + branch detection in preamble

- Add _BRANCH detection to preamble bash block (git branch --show-current)
- Merge ELI16 rules into default AskUserQuestion format (always-on)
- Remove _SESSIONS >= 3 conditional — better questions always
- Add simplification rules: plain English, no jargon, no raw function names
- Update tests for branch detection and simplification regression guard

* chore: bump version and changelog (v0.4.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
78e519e3 — Garry Tan a month ago
feat: await support in browse js/eval + contributor mode v2 (#104)

* feat: support await in $B js and eval commands

Auto-wrap await expressions in async IIFE context so
$B js "await fetch(...)" works without SyntaxError.

- hasAwait() strips comments before detection
- js: expression wrapping (async()=>(expr))()
- eval: smart wrapping — single-line=expression, multi-line=block
- 6 new unit tests covering async, false-positive, and return semantics

* feat: redesign contributor mode — periodic reflection with 0-10 rating

Replace passive "report when things break" with active reflection:
- Rate gstack experience 0-10 at workflow step boundaries
- Historical calibration example (await bug) anchors the reporting bar
- "What would make this a 10" field focuses on actionable improvements
- Removed category lists in favor of judgment-based assessment

* test: add deterministic contributor mode preamble validation

40 new skill-validation tests (4 checks × 10 skills) verify:
- 0-10 rating scale present
- Calibration example present
- "What would make this a 10" field present
- Periodic reflection (not per-command)

Update existing E2E contributor eval for new report format.

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve contributor mode + qa-quick E2E reliability

Contributor mode:
- Add "do not truncate" directive to template — agent was stopping
  after "My rating" without completing Steps/Raw output/What would
  make this a 10 sections
- Restore assertions for Steps to reproduce and Date footer

QA quick:
- Make test server URL prominent: top of prompt, explicit "already
  running" and "do NOT discover ports" instructions
- Bump session timeout 180s→240s and test timeout 240s→300s
- Set B= at top of prompt (was buried in prose)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use flexible assertions for contributor mode E2E

Agent writes thorough reports with creative section names
("Repro Steps" vs "Steps to reproduce"). Match intent not formatting:
- /repro|steps to reproduce/ for reproduction steps
- /date.*2026/ for date footer presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2

CONTRIBUTING.md: update contributor mode description — now describes
periodic 0-10 reflection loop instead of passive friction detection.

BROWSER.md: add js/eval async documentation — await expressions are
auto-wrapped in async context, single-line eval returns values directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore v0.4.2 changelog entries lost during cherry-pick conflict

The base branch detection entries from main were dropped when resolving
the CHANGELOG conflict — should have merged both sets, not replaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
3e3843c4 — Garry Tan a month ago
feat: contributor mode, session awareness, recommendation format (#90)

* feat: contributor mode, session awareness, universal RECOMMENDATION format

- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Enum & Value Completeness to /review critical checklist

New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add CHANGELOG style guide — user-facing, sell the feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite v0.4.1 changelog to be user-facing and sell the features

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness

Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.

E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add E2E eval for session awareness ELI16 mode

Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: contributor mode eval marked FAIL due to expected browse error

The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
f3ee0ee2 — Garry Tan a month ago
feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)

* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
bb46ca6b — Garry Tan a month ago
feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62)

* feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml

Simple get/set/list interface for persistent gstack configuration.
Used by update-check and upgrade skill for auto_upgrade and update_check settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: smart update check with 12h cache, snooze backoff, config disable

- Reduce cache TTL from 24h to 12h for faster update detection
- Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version)
- Add update_check: false config option to disable checks entirely
- Clear snooze file on just-upgraded
- 14 new tests covering snooze levels, expiry, corruption, and config paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync

- Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var
- 4-option AskUserQuestion: upgrade once, always, not now, never
- Step 4.5: sync local vendored copy after upgrading primary install
- Snooze write with escalating backoff on "Not now"
- Update preamble text in gen-skill-docs for new upgrade flow
- Regenerate all SKILL.md files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: simplify upgrade instructions, move auto-upgrade to completed

README now points to /gstack-upgrade instead of long paste commands.
Auto-upgrade TODO moved to Completed section (v0.3.8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.9)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
41141007 — Garry Tan a month ago
feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61)

* fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing

Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES,
ENOSPC) are now logged to .gstack/browse-server.log. Includes test for
permission-denied scenario with chmod 444.

* feat: merge TODO.md + TODOS.md into unified backlog with shared format reference

Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by
skill/component with P0-P4 priority ordering and Completed section. Add shared
review/TODOS-format.md for canonical format. Add static validation tests.

* feat: add 2-tier Greptile reply system with escalation detection

Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation
detection algorithm, and severity re-ranking guidance to greptile-triage.md.

* feat: cross-skill TODOS awareness + Greptile template refs in all skills

/ship Step 5.5: auto-detect completed TODOs, offer reorganization.
/review Step 5.5: cross-reference PR against open TODOs.
/plan-ceo-review, /plan-eng-review: TODOS context in planning.
/retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode.
All Greptile-aware skills now reference reply templates and escalation detection.

* chore: bump version and changelog (v0.3.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md for v0.3.8 changes

Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md
to "Things to know", mention Greptile triage in ship workflow description.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2aa745cb — Garry Tan a month ago
feat: screenshot element/region clipping (v0.3.7) (#56)

* feat: screenshot element/region clipping (--clip, --viewport, CSS/@ref)

Add element crop (CSS selector or @ref), region clip (--clip x,y,w,h),
and viewport-only (--viewport) modes to the screenshot command. Uses
Playwright's native locator.screenshot() and page.screenshot({ clip }).
Full page remains the default. Includes 10 new tests covering all modes
and error paths.

* chore: bump version and changelog (v0.3.7)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add screenshot modes to BROWSER.md command reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
7d266661 — Garry Tan a month ago
Merge pull request #55 from garrytan/v0.3.6-qa-upgrades

feat: E2E observability + eval infrastructure + all skills templated
baf8acd5 — Garry Tan a month ago
fix: update check ignores stale UP_TO_DATE cache after version change

The UP_TO_DATE cache path exited immediately without checking if the
cached version still matched the local VERSION. After upgrading (e.g.
0.3.3 → 0.3.4), the cache still said "UP_TO_DATE 0.3.3" and the
script never re-checked against remote — so updates were invisible
until the 24h cache expired.

Now both UP_TO_DATE and UPGRADE_AVAILABLE verify cached version vs
local before trusting the cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2d88f5f0 — Garry Tan a month ago
test: add update-check exit code regression tests

Guards against the "exits 1 when up to date" bug that broke skill
preambles. Two new tests: real VERSION + unreachable remote, and
multi-call sequence verifying exit 0 in all states.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3d750d89 — Garry Tan a month ago
Merge remote-tracking branch 'origin/main' into v0.3.6-qa-upgrades

# Conflicts:
#	test/skill-e2e.test.ts
1717ed28 — Garry Tan a month ago
fix: browse binary discovery broken for agents (v0.3.5) (#44)

* fix: replace find-browse with direct path in SKILL.md setup blocks

Agents were skipping the find-browse binary and guessing bin/browse
(wrong path). Now the setup block explicitly checks browse/dist/browse
with workspace-local priority, global fallback.

Also adds || true to update check to prevent misleading exit code 1.

Adds {{UPDATE_CHECK}} and {{BROWSE_SETUP}} template placeholders to
gen-skill-docs.ts so all skills share a single source of truth.

* refactor: convert qa/ and setup-browser-cookies/ to .tmpl templates

Replaces hardcoded update check and find-browse blocks with
{{UPDATE_CHECK}} and {{BROWSE_SETUP}} placeholders. Both skills
are now generated from templates via gen-skill-docs.

* test: add e2e and LLM eval tests for SKILL.md setup block

- 3 Agent SDK e2e tests: happy path, NEEDS_SETUP, non-git-repo
- LLM eval: setup block clarity + actionability >= 4
- New error pattern: 'no such file or directory.*browse'

These tests catch the exact failure mode where agents can't discover
the browse binary via SKILL.md instructions.

* chore: bump version and changelog (v0.3.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
76803d78 — Garry Tan a month ago
feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)

Adds comprehensive eval infrastructure:
- Tier 1 (free): 13 new static tests — cross-skill path consistency, QA
  structure validation, greptile format, planted-bug fixture validation
- Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo,
  3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs)
- Tier 3 (LLM judge): QA workflow quality, health rubric clarity,
  cross-skill consistency, baseline score pinning

New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON,
review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY).

Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks.
`bun run test:evals` runs everything that costs money (~$4/run).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6b69c46a — Garry Tan a month ago
feat: daily update check + /gstack-upgrade skill (v0.3.4) (#42)

* feat: add daily update check script + /gstack-upgrade skill

bin/gstack-update-check: pure bash, checks VERSION against remote once/day,
outputs UPGRADE_AVAILABLE or JUST_UPGRADED. Uses ~/.gstack/ for state.

gstack-upgrade/SKILL.md: new skill with inline upgrade flow for all preambles.
Detects global-git, local-git, vendored installs. Shows What's New from CHANGELOG.

browse/test/gstack-update-check.test.ts: 10 test cases covering all branch paths.

* refactor: remove version check from find-browse, simplify to binary locator

Delete checkVersion(), readCache(), writeCache(), fetchRemoteSHA(),
resolveSkillDir(), CacheEntry interface, REPO_URL/CACHE_PATH/CACHE_TTL
constants, and META output from find-browse.ts.

Version checking is now handled by bin/gstack-update-check (previous commit).

* feat: add update check preamble to all 9 skills

Every skill now runs bin/gstack-update-check on invocation. If an upgrade
is available, reads gstack-upgrade/SKILL.md inline upgrade flow.

Also adds AskUserQuestion to 5 skills that lacked it (gstack root, browse,
qa, retro, setup-browser-cookies) and Bash to plan-eng-review.

Simplifies qa and setup-browser-cookies setup blocks (removes META parsing).

* chore: bump version and changelog (v0.3.4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unused import + add corrupt cache test

Address pre-landing review findings:
- Remove unused mkdirSync import from gstack-update-check.test.ts
- Add Path I test: corrupt cache file falls through to remote fetch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
5155fe3a — Garry Tan a month ago
Merge remote-tracking branch 'origin/main' into v0.3.5-qa-upgrades
a4683742 — Garry Tan a month ago
fix: enrich SKILL.md docs to pass LLM evals, upgrade judge to Sonnet 4.6 (#43)

* fix: enrich command descriptions and snapshot flags for LLM eval quality

14 command descriptions enriched with specific arg formats, valid values,
error behavior, and return types. Fixed header usage from <name> <value>
to <name>:<value>. Added cookie usage syntax. Snapshot flags now show
long names, ref numbering, and output format examples.

* refactor: auto-generate server.ts help text from COMMAND_DESCRIPTIONS

Replace hand-maintained help block with generateHelpText() that reads
from COMMAND_DESCRIPTIONS and SNAPSHOT_FLAGS. Eliminates help text
drift from source of truth.

* test: add usage consistency and pipe guard tests

Usage consistency test cross-checks Usage: patterns in implementation
against COMMAND_DESCRIPTIONS using structural skeleton comparison.
Pipe guard test ensures descriptions don't contain | which would break
markdown table rendering.

* chore: upgrade eval judge to Sonnet 4.6, update changelog

Switch LLM-as-judge evals from Haiku to Sonnet 4.6 for more stable,
nuanced scoring. Add changelog entry for all eval improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ff5cbbbf — Garry Tan a month ago
feat: add remote slug helper and auto-gitignore for .gstack/

- getRemoteSlug() in config.ts: parses git remote origin → owner-repo format
- browse/bin/remote-slug: shell helper for SKILL.md use (BSD sed compatible)
- ensureStateDir() now appends .gstack/ to project .gitignore if not present
- setup creates ~/.gstack/projects/ global state directory
- 7 new tests: 4 gitignore behavior + 3 remote slug parsing
Next