~cytrogen/gstack: Merge remote-tracking branch 'origin/main' into v0.3.6-qa-upgrades

27 files changed, 1154 insertions(+), 294 deletions(-)

M CHANGELOG.md
M SKILL.md
M SKILL.md.tmpl
A TODOS.md
M VERSION
A bin/gstack-update-check
M browse/SKILL.md
M browse/SKILL.md.tmpl
M browse/bin/find-browse
M browse/src/find-browse.ts
M browse/test/find-browse.test.ts
A browse/test/gstack-update-check.test.ts
A gstack-upgrade/SKILL.md
M plan-ceo-review/SKILL.md
M plan-eng-review/SKILL.md
M qa/SKILL.md
A qa/SKILL.md.tmpl
M retro/SKILL.md
M review/SKILL.md
M scripts/gen-skill-docs.ts
M setup
M setup-browser-cookies/SKILL.md
A setup-browser-cookies/SKILL.md.tmpl
M ship/SKILL.md
M test/helpers/session-runner.ts
M test/skill-e2e.test.ts
M test/skill-llm-eval.test.ts

M CHANGELOG.md => CHANGELOG.md +36 -1

@@ 1,11 1,46 @@
 # Changelog
 
-## Unreleased — 2026-03-14
+## 0.3.4 — 2026-03-13
+
+### Added
+- **Daily update check** — all 9 skills now check for new versions once per day via `bin/gstack-update-check` (pure bash, <5ms cached). Prompts user via AskUserQuestion with option to upgrade or defer 24h.
+- **`/gstack-upgrade` skill** — standalone upgrade command that detects install type (global-git, local-git, vendored), upgrades, and shows a "What's New" summary from CHANGELOG
+- **"Just upgraded" confirmation** — after upgrading, the next skill invocation shows "Running gstack v{new} (just updated!)" via `~/.gstack/just-upgraded-from` marker
+- **`AskUserQuestion` added to 5 skills** — gstack (root), browse, qa, retro, setup-browser-cookies now have AskUserQuestion in allowed-tools for upgrade prompts
+- **`Bash` added to plan-eng-review** — enables the update check preamble to run in plan review sessions
+- `browse/test/gstack-update-check.test.ts` — 10 test cases covering all script branch paths with `GSTACK_REMOTE_URL` env var for test isolation
+- `TODOS.md` for tracking deferred work
+
+### Changed
+- **Version check is now one system** — removed SHA-based `checkVersion()` from `browse/src/find-browse.ts` (~120 lines deleted) and `browse/test/find-browse.test.ts` (~100 lines deleted). Replaced by `bin/gstack-update-check` bash script using semver VERSION comparison with 24h cache.
+- Simplified `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` setup blocks — removed old `BROWSE_OUTPUT`/`META` parsing, now use simple `find-browse` call
+- Updated `browse/bin/find-browse` shim comments to reflect simplified role (binary locator only)
+
+### Removed
+- `checkVersion()`, `readCache()`, `writeCache()`, `fetchRemoteSHA()`, `resolveSkillDir()`, `CacheEntry` interface from `browse/src/find-browse.ts`
+- `META:UPDATE_AVAILABLE` protocol from find-browse output
+- Old META-based upgrade instructions from qa and setup-browser-cookies SKILL.md files
+- Legacy `/tmp/gstack-latest-version` cache file (cleaned up by `setup` script)
+
+## 0.3.5 — 2026-03-14
+
+### Fixed
+- **Browse binary discovery broken for agents** — replaced `find-browse` indirection with explicit `browse/dist/browse` path in SKILL.md setup blocks. Agents were guessing `bin/browse` (wrong) instead of running `find-browse` to discover `browse/dist/browse` (correct).
+- **Update check exit code 1 misleading agents** — `[ -n "$_UPD" ] && echo "$_UPD"` returned exit code 1 when no update available, causing agents to think gstack was broken. Added `|| true`.
+- **browse/SKILL.md missing setup block** — `/browse` used `$B` in every example but never defined it. Added `{{BROWSE_SETUP}}` placeholder.
 
 ### Changed
 - Enriched 14 command descriptions with specific arg formats, valid values, error behavior, and return types
 - Fixed `header` usage from `<name> <value>` to `<name>:<value>` (matching actual implementation)
 - Added `cookie` usage syntax: `cookie <name>=<value>`
+- **Template system expanded** — added `{{UPDATE_CHECK}}` and `{{BROWSE_SETUP}}` placeholders to `gen-skill-docs.ts`. Converted `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` to `.tmpl` templates. All 4 browse-using skills now generate from a single source of truth.
+- Setup block now checks workspace-local path first (for development), then falls back to global `~/.claude/skills/gstack/browse/dist/browse`
+
+### Added
+- 3 new e2e test cases for SKILL.md setup flow: happy path, NEEDS_SETUP, non-git-repo
+- LLM eval for setup block clarity (actionability + clarity >= 4)
+- `no such file or directory.*browse` error pattern in session-runner
+- TODO: convert remaining 5 non-browse skills to .tmpl files
 - Enriched 4 snapshot flag descriptions with defaults, output paths, and behavior details
 - Snapshot flags section now shows long flag names (`-i / --interactive`) alongside short
 - Added ref numbering explanation and output format example to snapshot docs

M SKILL.md => SKILL.md +15 -4

@@ 10,11 10,21 @@ description: |
 allowed-tools:
   - Bash
   - Read
+  - AskUserQuestion
 
 ---
 <!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
 <!-- Regenerate: bun run gen:skill-docs -->
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD" || true
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # gstack browse: QA Testing & Dogfooding
 
 Persistent headless Chromium. First call auto-starts (~3s), then ~100-200ms per command.


@@ 23,8 33,11 @@ Auto-shuts down after 30 min idle. State persists between calls (cookies, tabs, 
 ## SETUP (run this check BEFORE any browse command)
 
 ```bash
-B=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
-if [ -n "$B" ]; then
+_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+B=""
+[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
+[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
+if [ -x "$B" ]; then
   echo "READY: $B"
 else
   echo "NEEDS_SETUP"


@@ 48,8 61,6 @@ If `NEEDS_SETUP`:
 ### Test a user flow (login, signup, checkout, etc.)
 
 ```bash
-B=~/.claude/skills/gstack/browse/dist/browse
-
 # 1. Go to the page
 $B goto https://app.example.com/login

M SKILL.md.tmpl => SKILL.md.tmpl +4 -17

@@ 10,29 10,18 @@ description: |
 allowed-tools:
   - Bash
   - Read
+  - AskUserQuestion
 
 ---
 
+{{UPDATE_CHECK}}
+
 # gstack browse: QA Testing & Dogfooding
 
 Persistent headless Chromium. First call auto-starts (~3s), then ~100-200ms per command.
 Auto-shuts down after 30 min idle. State persists between calls (cookies, tabs, sessions).
 
-## SETUP (run this check BEFORE any browse command)
-
-```bash
-B=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
-if [ -n "$B" ]; then
-  echo "READY: $B"
-else
-  echo "NEEDS_SETUP"
-fi
-```
-
-If `NEEDS_SETUP`:
-1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
-2. Run: `cd <SKILL_DIR> && ./setup`
-3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
+{{BROWSE_SETUP}}
 
 ## IMPORTANT
 


@@ 46,8 35,6 @@ If `NEEDS_SETUP`:
 ### Test a user flow (login, signup, checkout, etc.)
 
 ```bash
-B=~/.claude/skills/gstack/browse/dist/browse
-
 # 1. Go to the page
 $B goto https://app.example.com/login

A TODOS.md => TODOS.md +24 -0

@@ 0,0 1,24 @@
+# TODOS
+
+## Auto-upgrade mode (zero-prompt)
+
+**What:** Add a `GSTACK_AUTO_UPGRADE=1` env var or `~/.gstack/config` option that skips the AskUserQuestion prompt and upgrades automatically when a new version is detected.
+
+**Why:** Power users and CI environments may want zero-friction upgrades without being asked every time.
+
+**Context:** The current upgrade system (v0.3.4) always prompts via AskUserQuestion. This TODO adds an opt-in bypass. Implementation is ~10 lines in the preamble instructions: check for the env var/config before calling AskUserQuestion, and if set, go straight to the upgrade flow. Depends on the full upgrade system being stable first — wait for user feedback on the prompt-based flow before adding this.
+
+**Effort:** S (small)
+**Priority:** P3 (nice-to-have, revisit after adoption data)
+
+## Convert remaining skills to .tmpl files
+
+**What:** Convert ship/, review/, plan-ceo-review/, plan-eng-review/, retro/ SKILL.md files to .tmpl templates using the `{{UPDATE_CHECK}}` placeholder.
+
+**Why:** These 5 skills still have the update check preamble copy-pasted. When the preamble changes (like the `|| true` fix in v0.3.5), all 5 need manual updates. The `{{UPDATE_CHECK}}` resolver already exists in `scripts/gen-skill-docs.ts` — these skills just need to be converted.
+
+**Context:** The browse-using skills (SKILL.md, browse/, qa/, setup-browser-cookies/) were converted to .tmpl in v0.3.5. The remaining 5 skills only use `{{UPDATE_CHECK}}` (no `{{BROWSE_SETUP}}`), so the conversion is mechanical: replace the preamble with `{{UPDATE_CHECK}}`, add the path to `findTemplates()` in `scripts/gen-skill-docs.ts`, and commit both .tmpl + generated .md.
+
+**Depends on:** v0.3.5 shipping first (the `{{UPDATE_CHECK}}` resolver).
+**Effort:** S (small, ~20 min)
+**Priority:** P2 (prevents drift on next preamble change)

M VERSION => VERSION +1 -1

@@ 1,1 1,1 @@
-0.3.3
+0.3.5

A bin/gstack-update-check => bin/gstack-update-check +88 -0

@@ 0,0 1,88 @@
+#!/usr/bin/env bash
+# gstack-update-check — daily version check for all skills.
+#
+# Output (one line, or nothing):
+#   JUST_UPGRADED <old> <new>       — marker found from recent upgrade
+#   UPGRADE_AVAILABLE <old> <new>   — remote VERSION differs from local
+#   (nothing)                       — up to date or check skipped
+#
+# Env overrides (for testing):
+#   GSTACK_DIR          — override auto-detected gstack root
+#   GSTACK_REMOTE_URL   — override remote VERSION URL
+#   GSTACK_STATE_DIR    — override ~/.gstack state directory
+set -euo pipefail
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+STATE_DIR="${GSTACK_STATE_DIR:-$HOME/.gstack}"
+CACHE_FILE="$STATE_DIR/last-update-check"
+MARKER_FILE="$STATE_DIR/just-upgraded-from"
+VERSION_FILE="$GSTACK_DIR/VERSION"
+REMOTE_URL="${GSTACK_REMOTE_URL:-https://raw.githubusercontent.com/garrytan/gstack/main/VERSION}"
+
+# ─── Step 1: Read local version ──────────────────────────────
+LOCAL=""
+if [ -f "$VERSION_FILE" ]; then
+  LOCAL="$(cat "$VERSION_FILE" 2>/dev/null | tr -d '[:space:]')"
+fi
+if [ -z "$LOCAL" ]; then
+  exit 0  # No VERSION file → skip check
+fi
+
+# ─── Step 2: Check "just upgraded" marker ─────────────────────
+if [ -f "$MARKER_FILE" ]; then
+  OLD="$(cat "$MARKER_FILE" 2>/dev/null | tr -d '[:space:]')"
+  rm -f "$MARKER_FILE"
+  mkdir -p "$STATE_DIR"
+  echo "UP_TO_DATE $LOCAL" > "$CACHE_FILE"
+  if [ -n "$OLD" ]; then
+    echo "JUST_UPGRADED $OLD $LOCAL"
+  fi
+  exit 0
+fi
+
+# ─── Step 3: Check cache freshness (24h = 1440 min) ──────────
+if [ -f "$CACHE_FILE" ]; then
+  # Cache is fresh if modified within 1440 minutes
+  STALE=$(find "$CACHE_FILE" -mmin +1440 2>/dev/null || true)
+  if [ -z "$STALE" ]; then
+    # Cache is fresh — read it
+    CACHED="$(cat "$CACHE_FILE" 2>/dev/null || true)"
+    case "$CACHED" in
+      UP_TO_DATE*)
+        exit 0
+        ;;
+      UPGRADE_AVAILABLE*)
+        # Verify local version still matches cached old version
+        CACHED_OLD="$(echo "$CACHED" | awk '{print $2}')"
+        if [ "$CACHED_OLD" = "$LOCAL" ]; then
+          echo "$CACHED"
+          exit 0
+        fi
+        # Local version changed (manual upgrade?) — fall through to re-check
+        ;;
+    esac
+  fi
+fi
+
+# ─── Step 4: Slow path — fetch remote version ────────────────
+mkdir -p "$STATE_DIR"
+
+REMOTE=""
+REMOTE="$(curl -sf --max-time 5 "$REMOTE_URL" 2>/dev/null || true)"
+REMOTE="$(echo "$REMOTE" | tr -d '[:space:]')"
+
+# Validate: must look like a version number (reject HTML error pages)
+if ! echo "$REMOTE" | grep -qE '^[0-9]+\.[0-9.]+$'; then
+  # Invalid or empty response — assume up to date
+  echo "UP_TO_DATE $LOCAL" > "$CACHE_FILE"
+  exit 0
+fi
+
+if [ "$LOCAL" = "$REMOTE" ]; then
+  echo "UP_TO_DATE $LOCAL" > "$CACHE_FILE"
+  exit 0
+fi
+
+# Versions differ — upgrade available
+echo "UPGRADE_AVAILABLE $LOCAL $REMOTE" > "$CACHE_FILE"
+echo "UPGRADE_AVAILABLE $LOCAL $REMOTE"

M browse/SKILL.md => browse/SKILL.md +29 -0

@@ 10,16 10,45 @@ description: |
 allowed-tools:
   - Bash
   - Read
+  - AskUserQuestion
 
 ---
 <!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
 <!-- Regenerate: bun run gen:skill-docs -->
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD" || true
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # browse: QA Testing & Dogfooding
 
 Persistent headless Chromium. First call auto-starts (~3s), then ~100ms per command.
 State persists between calls (cookies, tabs, login sessions).
 
+## SETUP (run this check BEFORE any browse command)
+
+```bash
+_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+B=""
+[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
+[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
+if [ -x "$B" ]; then
+  echo "READY: $B"
+else
+  echo "NEEDS_SETUP"
+fi
+```
+
+If `NEEDS_SETUP`:
+1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
+2. Run: `cd <SKILL_DIR> && ./setup`
+3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
+
 ## Core QA Patterns
 
 ### 1. Verify a page loads correctly

M browse/SKILL.md.tmpl => browse/SKILL.md.tmpl +5 -0

@@ 10,14 10,19 @@ description: |
 allowed-tools:
   - Bash
   - Read
+  - AskUserQuestion
 
 ---
 
+{{UPDATE_CHECK}}
+
 # browse: QA Testing & Dogfooding
 
 Persistent headless Chromium. First call auto-starts (~3s), then ~100ms per command.
 State persists between calls (cookies, tabs, login sessions).
 
+{{BROWSE_SETUP}}
+
 ## Core QA Patterns
 
 ### 1. Verify a page loads correctly

M browse/bin/find-browse => browse/bin/find-browse +2 -2

@@ 1,11 1,11 @@
 #!/bin/bash
 # Shim: delegates to compiled find-browse binary, falls back to basic discovery.
-# The compiled binary adds version checking and META signal support.
+# The compiled binary handles git root detection for workspace-local installs.
 DIR="$(cd "$(dirname "$0")/.." && pwd)/dist"
 if test -x "$DIR/find-browse"; then
   exec "$DIR/find-browse" "$@"
 fi
-# Fallback: basic discovery (no version check)
+# Fallback: basic discovery
 ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
 if [ -n "$ROOT" ] && test -x "$ROOT/.claude/skills/gstack/browse/dist/browse"; then
   echo "$ROOT/.claude/skills/gstack/browse/dist/browse"

M browse/src/find-browse.ts => browse/src/find-browse.ts +3 -128

@@ 1,28 1,14 @@
 /**
- * find-browse — locate the gstack browse binary + check for updates.
+ * find-browse — locate the gstack browse binary.
  *
  * Compiled to browse/dist/find-browse (standalone binary, no bun runtime needed).
- *
- * Output protocol:
- *   Line 1: /path/to/binary              (always present)
- *   Line 2+: META:<TYPE> <json-payload>   (optional, 0 or more)
- *
- * META types:
- *   META:UPDATE_AVAILABLE — local binary is behind origin/main
- *
- * All version checks are best-effort: network failures, missing files, and
- * cache errors degrade gracefully to outputting only the binary path.
+ * Outputs the absolute path to the browse binary on stdout, or exits 1 if not found.
  */
 
 import { existsSync } from 'fs';
-import { readFileSync, writeFileSync } from 'fs';
-import { join, dirname } from 'path';
+import { join } from 'path';
 import { homedir } from 'os';
 
-const REPO_URL = 'https://github.com/garrytan/gstack.git';
-const CACHE_PATH = '/tmp/gstack-latest-version';
-const CACHE_TTL = 14400; // 4 hours in seconds
-
 // ─── Binary Discovery ───────────────────────────────────────────
 
 function getGitRoot(): string | null {


@@ 55,114 41,6 @@ export function locateBinary(): string | null {
   return null;
 }
 
-// ─── Version Check ──────────────────────────────────────────────
-
-interface CacheEntry {
-  sha: string;
-  timestamp: number;
-}
-
-function readCache(): CacheEntry | null {
-  try {
-    const content = readFileSync(CACHE_PATH, 'utf-8').trim();
-    const parts = content.split(/\s+/);
-    if (parts.length < 2) return null;
-    const sha = parts[0];
-    const timestamp = parseInt(parts[1], 10);
-    if (!sha || isNaN(timestamp)) return null;
-    // Validate SHA is hex
-    if (!/^[0-9a-f]{40}$/i.test(sha)) return null;
-    return { sha, timestamp };
-  } catch {
-    return null;
-  }
-}
-
-function writeCache(sha: string, timestamp: number): void {
-  try {
-    writeFileSync(CACHE_PATH, `${sha} ${timestamp}\n`);
-  } catch {
-    // Cache write failure is non-fatal
-  }
-}
-
-function fetchRemoteSHA(): string | null {
-  try {
-    const proc = Bun.spawnSync(['git', 'ls-remote', REPO_URL, 'refs/heads/main'], {
-      stdout: 'pipe',
-      stderr: 'pipe',
-      timeout: 10_000, // 10s timeout
-    });
-    if (proc.exitCode !== 0) return null;
-    const output = proc.stdout.toString().trim();
-    const sha = output.split(/\s+/)[0];
-    if (!sha || !/^[0-9a-f]{40}$/i.test(sha)) return null;
-    return sha;
-  } catch {
-    return null;
-  }
-}
-
-function resolveSkillDir(binaryPath: string): string | null {
-  const home = homedir();
-  const globalPrefix = join(home, '.claude', 'skills', 'gstack');
-  if (binaryPath.startsWith(globalPrefix)) return globalPrefix;
-
-  // Workspace-local: binary is at $ROOT/.claude/skills/gstack/browse/dist/browse
-  // Skill dir is $ROOT/.claude/skills/gstack
-  const parts = binaryPath.split('/.claude/skills/gstack/');
-  if (parts.length === 2) return parts[0] + '/.claude/skills/gstack';
-
-  return null;
-}
-
-export function checkVersion(binaryDir: string): string | null {
-  // Read local version
-  const versionFile = join(binaryDir, '.version');
-  let localSHA: string;
-  try {
-    localSHA = readFileSync(versionFile, 'utf-8').trim();
-  } catch {
-    return null; // No .version file → skip check
-  }
-  if (!localSHA) return null;
-
-  const now = Math.floor(Date.now() / 1000);
-
-  // Check cache
-  let remoteSHA: string | null = null;
-  const cache = readCache();
-  if (cache && (now - cache.timestamp) < CACHE_TTL) {
-    remoteSHA = cache.sha;
-  }
-
-  // Fetch from remote if cache miss
-  if (!remoteSHA) {
-    remoteSHA = fetchRemoteSHA();
-    if (remoteSHA) {
-      writeCache(remoteSHA, now);
-    }
-  }
-
-  if (!remoteSHA) return null; // Offline or error → skip check
-
-  // Compare
-  if (localSHA === remoteSHA) return null; // Up to date
-
-  // Determine skill directory for update command
-  const binaryPath = join(binaryDir, 'browse');
-  const skillDir = resolveSkillDir(binaryPath);
-  if (!skillDir) return null;
-
-  const payload = JSON.stringify({
-    current: localSHA.slice(0, 8),
-    latest: remoteSHA.slice(0, 8),
-    command: `cd ${skillDir} && git stash && git fetch origin && git reset --hard origin/main && ./setup`,
-  });
-
-  return `META:UPDATE_AVAILABLE ${payload}`;
-}
-
 // ─── Main ───────────────────────────────────────────────────────
 
 function main() {


@@ 173,9 51,6 @@ function main() {
   }
 
   console.log(bin);
-
-  const meta = checkVersion(dirname(bin));
-  if (meta) console.log(meta);
 }
 
 main();

M browse/test/find-browse.test.ts => browse/test/find-browse.test.ts +4 -124

@@ 1,130 1,10 @@
 /**
- * Tests for find-browse version check logic
- *
- * Tests the checkVersion() and locateBinary() functions directly.
- * Uses temp directories with mock .version files and cache files.
+ * Tests for find-browse binary locator.
  */
 
-import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
-import { checkVersion, locateBinary } from '../src/find-browse';
-import { mkdtempSync, writeFileSync, rmSync, existsSync, mkdirSync } from 'fs';
-import { join } from 'path';
-import { tmpdir } from 'os';
-
-let tempDir: string;
-
-beforeEach(() => {
-  tempDir = mkdtempSync(join(tmpdir(), 'find-browse-test-'));
-});
-
-afterEach(() => {
-  rmSync(tempDir, { recursive: true, force: true });
-  // Clean up test cache
-  try { rmSync('/tmp/gstack-latest-version'); } catch {}
-});
-
-describe('checkVersion', () => {
-  test('returns null when .version file is missing', () => {
-    const result = checkVersion(tempDir);
-    expect(result).toBeNull();
-  });
-
-  test('returns null when .version file is empty', () => {
-    writeFileSync(join(tempDir, '.version'), '');
-    const result = checkVersion(tempDir);
-    expect(result).toBeNull();
-  });
-
-  test('returns null when .version has only whitespace', () => {
-    writeFileSync(join(tempDir, '.version'), '  \n');
-    const result = checkVersion(tempDir);
-    expect(result).toBeNull();
-  });
-
-  test('returns null when local SHA matches remote (cache hit)', () => {
-    const sha = 'a'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), sha);
-    // Write cache with same SHA, recent timestamp
-    const now = Math.floor(Date.now() / 1000);
-    writeFileSync('/tmp/gstack-latest-version', `${sha} ${now}\n`);
-
-    const result = checkVersion(tempDir);
-    expect(result).toBeNull();
-  });
-
-  test('returns META:UPDATE_AVAILABLE when SHAs differ (cache hit)', () => {
-    const localSha = 'a'.repeat(40);
-    const remoteSha = 'b'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), localSha);
-    // Create a fake browse binary path so resolveSkillDir works
-    const browsePath = join(tempDir, 'browse');
-    writeFileSync(browsePath, '');
-    // Write cache with different SHA, recent timestamp
-    const now = Math.floor(Date.now() / 1000);
-    writeFileSync('/tmp/gstack-latest-version', `${remoteSha} ${now}\n`);
-
-    const result = checkVersion(tempDir);
-    // Result may be null if resolveSkillDir can't determine skill dir from temp path
-    // That's expected — the META signal requires a known skill dir path
-    if (result !== null) {
-      expect(result).toStartWith('META:UPDATE_AVAILABLE');
-      const jsonStr = result.replace('META:UPDATE_AVAILABLE ', '');
-      const payload = JSON.parse(jsonStr);
-      expect(payload.current).toBe('a'.repeat(8));
-      expect(payload.latest).toBe('b'.repeat(8));
-      expect(payload.command).toContain('git stash');
-      expect(payload.command).toContain('git reset --hard origin/main');
-      expect(payload.command).toContain('./setup');
-    }
-  });
-
-  test('uses cached SHA when cache is fresh (< 4hr)', () => {
-    const localSha = 'a'.repeat(40);
-    const remoteSha = 'a'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), localSha);
-    // Cache is 1 hour old — should still be valid
-    const oneHourAgo = Math.floor(Date.now() / 1000) - 3600;
-    writeFileSync('/tmp/gstack-latest-version', `${remoteSha} ${oneHourAgo}\n`);
-
-    const result = checkVersion(tempDir);
-    expect(result).toBeNull(); // SHAs match
-  });
-
-  test('treats expired cache as stale', () => {
-    const localSha = 'a'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), localSha);
-    // Cache is 5 hours old — should be stale
-    const fiveHoursAgo = Math.floor(Date.now() / 1000) - 18000;
-    writeFileSync('/tmp/gstack-latest-version', `${'b'.repeat(40)} ${fiveHoursAgo}\n`);
-
-    // This will try git ls-remote which may fail in test env — that's OK
-    // The important thing is it doesn't use the stale cache value
-    const result = checkVersion(tempDir);
-    // Result depends on whether git ls-remote succeeds in test environment
-    // If offline, returns null (graceful degradation)
-    expect(result === null || typeof result === 'string').toBe(true);
-  });
-
-  test('handles corrupt cache file gracefully', () => {
-    const localSha = 'a'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), localSha);
-    writeFileSync('/tmp/gstack-latest-version', 'garbage data here');
-
-    // Should not throw, should treat as stale
-    const result = checkVersion(tempDir);
-    expect(result === null || typeof result === 'string').toBe(true);
-  });
-
-  test('handles cache with invalid SHA gracefully', () => {
-    const localSha = 'a'.repeat(40);
-    writeFileSync(join(tempDir, '.version'), localSha);
-    writeFileSync('/tmp/gstack-latest-version', `not-a-sha ${Math.floor(Date.now() / 1000)}\n`);
-
-    // Invalid SHA should be treated as no cache
-    const result = checkVersion(tempDir);
-    expect(result === null || typeof result === 'string').toBe(true);
-  });
-});
+import { describe, test, expect } from 'bun:test';
+import { locateBinary } from '../src/find-browse';
+import { existsSync } from 'fs';
 
 describe('locateBinary', () => {
   test('returns null when no binary exists at known paths', () => {

A browse/test/gstack-update-check.test.ts => browse/test/gstack-update-check.test.ts +188 -0

@@ 0,0 1,188 @@
+/**
+ * Tests for bin/gstack-update-check bash script.
+ *
+ * Uses Bun.spawnSync to invoke the script with temp dirs and
+ * GSTACK_DIR / GSTACK_STATE_DIR / GSTACK_REMOTE_URL env overrides
+ * for full isolation.
+ */
+
+import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
+import { mkdtempSync, writeFileSync, rmSync, existsSync, readFileSync } from 'fs';
+import { join } from 'path';
+import { tmpdir } from 'os';
+
+const SCRIPT = join(import.meta.dir, '..', '..', 'bin', 'gstack-update-check');
+
+let gstackDir: string;
+let stateDir: string;
+
+function run(extraEnv: Record<string, string> = {}) {
+  const result = Bun.spawnSync(['bash', SCRIPT], {
+    env: {
+      ...process.env,
+      GSTACK_DIR: gstackDir,
+      GSTACK_STATE_DIR: stateDir,
+      GSTACK_REMOTE_URL: `file://${join(gstackDir, 'REMOTE_VERSION')}`,
+      ...extraEnv,
+    },
+    stdout: 'pipe',
+    stderr: 'pipe',
+  });
+  return {
+    exitCode: result.exitCode,
+    stdout: result.stdout.toString().trim(),
+    stderr: result.stderr.toString().trim(),
+  };
+}
+
+beforeEach(() => {
+  gstackDir = mkdtempSync(join(tmpdir(), 'gstack-upd-test-'));
+  stateDir = mkdtempSync(join(tmpdir(), 'gstack-state-test-'));
+});
+
+afterEach(() => {
+  rmSync(gstackDir, { recursive: true, force: true });
+  rmSync(stateDir, { recursive: true, force: true });
+});
+
+describe('gstack-update-check', () => {
+  // ─── Path A: No VERSION file ────────────────────────────────
+  test('exits 0 with no output when VERSION file is missing', () => {
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+  });
+
+  // ─── Path B: Empty VERSION file ─────────────────────────────
+  test('exits 0 with no output when VERSION file is empty', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '');
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+  });
+
+  // ─── Path C: Just-upgraded marker ───────────────────────────
+  test('outputs JUST_UPGRADED and deletes marker', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.4.0\n');
+    writeFileSync(join(stateDir, 'just-upgraded-from'), '0.3.3\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('JUST_UPGRADED 0.3.3 0.4.0');
+    // Marker should be deleted
+    expect(existsSync(join(stateDir, 'just-upgraded-from'))).toBe(false);
+    // Cache should be written
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── Path D1: Fresh cache, UP_TO_DATE ───────────────────────
+  test('exits silently when cache says UP_TO_DATE and is fresh', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(stateDir, 'last-update-check'), 'UP_TO_DATE 0.3.3');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+  });
+
+  // ─── Path D2: Fresh cache, UPGRADE_AVAILABLE ────────────────
+  test('echoes cached UPGRADE_AVAILABLE when cache is fresh', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(stateDir, 'last-update-check'), 'UPGRADE_AVAILABLE 0.3.3 0.4.0');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('UPGRADE_AVAILABLE 0.3.3 0.4.0');
+  });
+
+  // ─── Path D3: Fresh cache, but local version changed ────────
+  test('re-checks when local version does not match cached old version', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.4.0\n');
+    // Cache says 0.3.3 → 0.4.0 but we're already on 0.4.0
+    writeFileSync(join(stateDir, 'last-update-check'), 'UPGRADE_AVAILABLE 0.3.3 0.4.0');
+    // Remote also says 0.4.0 — should be up to date
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '0.4.0\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe(''); // Up to date after re-check
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── Path E: Versions match (remote fetch) ─────────────────
+  test('writes UP_TO_DATE cache when versions match', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '0.3.3\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── Path F: Versions differ (remote fetch) ─────────────────
+  test('outputs UPGRADE_AVAILABLE when versions differ', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '0.4.0\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('UPGRADE_AVAILABLE 0.3.3 0.4.0');
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UPGRADE_AVAILABLE 0.3.3 0.4.0');
+  });
+
+  // ─── Path G: Invalid remote response ────────────────────────
+  test('treats invalid remote response as up to date', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '<html>404 Not Found</html>\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── Path H: Curl fails (bad URL) ──────────────────────────
+  test('exits silently when remote URL is unreachable', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+
+    const { exitCode, stdout } = run({
+      GSTACK_REMOTE_URL: 'file:///nonexistent/path/VERSION',
+    });
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── Path I: Corrupt cache file ─────────────────────────────
+  test('falls through to remote fetch when cache is corrupt', () => {
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(stateDir, 'last-update-check'), 'garbage data here');
+    // Remote says same version — should end up UP_TO_DATE
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '0.3.3\n');
+
+    const { exitCode, stdout } = run();
+    expect(exitCode).toBe(0);
+    expect(stdout).toBe('');
+    // Cache should be overwritten with valid content
+    const cache = readFileSync(join(stateDir, 'last-update-check'), 'utf-8');
+    expect(cache).toContain('UP_TO_DATE');
+  });
+
+  // ─── State dir creation ─────────────────────────────────────
+  test('creates state dir if it does not exist', () => {
+    const newStateDir = join(stateDir, 'nested', 'dir');
+    writeFileSync(join(gstackDir, 'VERSION'), '0.3.3\n');
+    writeFileSync(join(gstackDir, 'REMOTE_VERSION'), '0.3.3\n');
+
+    const { exitCode } = run({ GSTACK_STATE_DIR: newStateDir });
+    expect(exitCode).toBe(0);
+    expect(existsSync(join(newStateDir, 'last-update-check'))).toBe(true);
+  });
+});

A gstack-upgrade/SKILL.md => gstack-upgrade/SKILL.md +112 -0

@@ 0,0 1,112 @@
+---
+name: gstack-upgrade
+version: 1.0.0
+description: |
+  Upgrade gstack to the latest version. Detects global vs vendored install,
+  runs the upgrade, and shows what's new.
+allowed-tools:
+  - Bash
+  - Read
+  - AskUserQuestion
+---
+
+# /gstack-upgrade
+
+Upgrade gstack to the latest version and show what's new.
+
+## Inline upgrade flow
+
+This section is referenced by all skill preambles when they detect `UPGRADE_AVAILABLE`.
+
+### Step 1: Ask the user
+
+Use AskUserQuestion:
+- Question: "gstack **v{new}** is available (you're on v{old}). Upgrade now? Takes ~10 seconds."
+- Options: ["Yes, upgrade now", "Later (ask again tomorrow)"]
+
+**If "Later":** Run `touch ~/.gstack/last-update-check` to reset the 24h timer and continue with the current skill. Do not mention the upgrade again.
+
+### Step 2: Detect install type
+
+```bash
+if [ -d "$HOME/.claude/skills/gstack/.git" ]; then
+  INSTALL_TYPE="global-git"
+  INSTALL_DIR="$HOME/.claude/skills/gstack"
+elif [ -d ".claude/skills/gstack/.git" ]; then
+  INSTALL_TYPE="local-git"
+  INSTALL_DIR=".claude/skills/gstack"
+elif [ -d ".claude/skills/gstack" ]; then
+  INSTALL_TYPE="vendored"
+  INSTALL_DIR=".claude/skills/gstack"
+elif [ -d "$HOME/.claude/skills/gstack" ]; then
+  INSTALL_TYPE="vendored-global"
+  INSTALL_DIR="$HOME/.claude/skills/gstack"
+else
+  echo "ERROR: gstack not found"
+  exit 1
+fi
+echo "Install type: $INSTALL_TYPE at $INSTALL_DIR"
+```
+
+### Step 3: Save old version
+
+```bash
+OLD_VERSION=$(cat "$INSTALL_DIR/VERSION" 2>/dev/null || echo "unknown")
+```
+
+### Step 4: Upgrade
+
+**For git installs** (global-git, local-git):
+```bash
+cd "$INSTALL_DIR"
+STASH_OUTPUT=$(git stash 2>&1)
+git fetch origin
+git reset --hard origin/main
+./setup
+```
+If `$STASH_OUTPUT` contains "Saved working directory", warn the user: "Note: local changes were stashed. Run `git stash pop` in the skill directory to restore them."
+
+**For vendored installs** (vendored, vendored-global):
+```bash
+PARENT=$(dirname "$INSTALL_DIR")
+TMP_DIR=$(mktemp -d)
+git clone --depth 1 https://github.com/garrytan/gstack.git "$TMP_DIR/gstack"
+mv "$INSTALL_DIR" "$INSTALL_DIR.bak"
+mv "$TMP_DIR/gstack" "$INSTALL_DIR"
+cd "$INSTALL_DIR" && ./setup
+rm -rf "$INSTALL_DIR.bak" "$TMP_DIR"
+```
+
+### Step 5: Write marker + clear cache
+
+```bash
+mkdir -p ~/.gstack
+echo "$OLD_VERSION" > ~/.gstack/just-upgraded-from
+rm -f ~/.gstack/last-update-check
+```
+
+### Step 6: Show What's New
+
+Read `$INSTALL_DIR/CHANGELOG.md`. Find all version entries between the old version and the new version. Summarize as 5-7 bullets grouped by theme. Don't overwhelm — focus on user-facing changes. Skip internal refactors unless they're significant.
+
+Format:
+```
+gstack v{new} — upgraded from v{old}!
+
+What's new:
+- [bullet 1]
+- [bullet 2]
+- ...
+
+Happy shipping!
+```
+
+### Step 7: Continue
+
+After showing What's New, continue with whatever skill the user originally invoked. The upgrade is done — no further action needed.
+
+---
+
+## Standalone usage
+
+When invoked directly as `/gstack-upgrade` (not from a preamble), follow Steps 2-6 above. If already on the latest version, tell the user: "You're already on the latest version (v{version})."

M plan-ceo-review/SKILL.md => plan-ceo-review/SKILL.md +9 -0

@@ 14,6 14,15 @@ allowed-tools:
   - AskUserQuestion
 ---
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD"
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # Mega Plan Review Mode
 
 ## Philosophy

M plan-eng-review/SKILL.md => plan-eng-review/SKILL.md +10 -0

@@ 10,8 10,18 @@ allowed-tools:
   - Grep
   - Glob
   - AskUserQuestion
+  - Bash
 ---
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD"
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # Plan Review Mode
 
 Review this plan thoroughly before making any code changes. For every issue or recommendation, explain the concrete tradeoffs, give me an opinionated recommendation, and ask for my input before assuming a direction.

M qa/SKILL.md => qa/SKILL.md +26 -9

@@ 10,7 10,19 @@ allowed-tools:
   - Bash
   - Read
   - Write
+  - AskUserQuestion
 ---
+<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
+<!-- Regenerate: bun run gen:skill-docs -->
+
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD" || true
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
 
 # /qa: Systematic QA Testing
 


@@ 32,19 44,24 @@ You are a QA engineer. Test web applications like a real user — click everythi
 
 **Find the browse binary:**
 
+## SETUP (run this check BEFORE any browse command)
+
 ```bash
-BROWSE_OUTPUT=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
-B=$(echo "$BROWSE_OUTPUT" | head -1)
-META=$(echo "$BROWSE_OUTPUT" | grep "^META:" || true)
-if [ -z "$B" ]; then
-  echo "ERROR: browse binary not found"
-  exit 1
+_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+B=""
+[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
+[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
+if [ -x "$B" ]; then
+  echo "READY: $B"
+else
+  echo "NEEDS_SETUP"
 fi
-echo "READY: $B"
-[ -n "$META" ] && echo "$META"
 ```
 
-If you see `META:UPDATE_AVAILABLE`: tell the user an update is available, STOP and wait for approval, then run the command from the META payload and re-run the setup check.
+If `NEEDS_SETUP`:
+1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
+2. Run: `cd <SKILL_DIR> && ./setup`
+3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
 
 **Set up report directory (persistent, global):**

A qa/SKILL.md.tmpl => qa/SKILL.md.tmpl +337 -0

@@ 0,0 1,337 @@
+---
+name: qa
+version: 1.0.0
+description: |
+  Systematically QA test a web application. Use when asked to "qa", "QA", "test this site",
+  "find bugs", "dogfood", or review quality. Four modes: diff-aware (automatic on feature
+  branches — analyzes git diff, identifies affected pages, tests them), full (systematic
+  exploration), quick (30-second smoke test), regression (compare against baseline). Produces
+  structured report with health score, screenshots, and repro steps.
+allowed-tools:
+  - Bash
+  - Read
+  - Write
+  - AskUserQuestion
+---
+
+{{UPDATE_CHECK}}
+
+# /qa: Systematic QA Testing
+
+You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence.
+
+## Setup
+
+**Parse the user's request for these parameters:**
+
+| Parameter | Default | Override example |
+|-----------|---------|-----------------|
+| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
+| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
+| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
+| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
+| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
+
+**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
+
+**Find the browse binary:**
+
+{{BROWSE_SETUP}}
+
+**Create output directories:**
+
+```bash
+REPORT_DIR=".gstack/qa-reports"
+mkdir -p "$REPORT_DIR/screenshots"
+```
+
+---
+
+## Modes
+
+### Diff-aware (automatic when on a feature branch with no URL)
+
+This is the **primary mode** for developers verifying their work. When the user says `/qa` without a URL and the repo is on a feature branch, automatically:
+
+1. **Analyze the branch diff** to understand what changed:
+   ```bash
+   git diff main...HEAD --name-only
+   git log main..HEAD --oneline
+   ```
+
+2. **Identify affected pages/routes** from the changed files:
+   - Controller/route files → which URL paths they serve
+   - View/template/component files → which pages render them
+   - Model/service files → which pages use those models (check controllers that reference them)
+   - CSS/style files → which pages include those stylesheets
+   - API endpoints → test them directly with `$B js "await fetch('/api/...')"`
+   - Static pages (markdown, HTML) → navigate to them directly
+
+3. **Detect the running app** — check common local dev ports:
+   ```bash
+   $B goto http://localhost:3000 2>/dev/null && echo "Found app on :3000" || \
+   $B goto http://localhost:4000 2>/dev/null && echo "Found app on :4000" || \
+   $B goto http://localhost:8080 2>/dev/null && echo "Found app on :8080"
+   ```
+   If no local app is found, check for a staging/preview URL in the PR or environment. If nothing works, ask the user for the URL.
+
+4. **Test each affected page/route:**
+   - Navigate to the page
+   - Take a screenshot
+   - Check console for errors
+   - If the change was interactive (forms, buttons, flows), test the interaction end-to-end
+   - Use `snapshot -D` before and after actions to verify the change had the expected effect
+
+5. **Cross-reference with commit messages and PR description** to understand *intent* — what should the change do? Verify it actually does that.
+
+6. **Report findings** scoped to the branch changes:
+   - "Changes tested: N pages/routes affected by this branch"
+   - For each: does it work? Screenshot evidence.
+   - Any regressions on adjacent pages?
+
+**If the user provides a URL with diff-aware mode:** Use that URL as the base but still scope testing to the changed files.
+
+### Full (default when URL is provided)
+Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
+
+### Quick (`--quick`)
+30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
+
+### Regression (`--regression <baseline>`)
+Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
+
+---
+
+## Workflow
+
+### Phase 1: Initialize
+
+1. Find browse binary (see Setup above)
+2. Create output directories
+3. Copy report template from `qa/templates/qa-report-template.md` to output dir
+4. Start timer for duration tracking
+
+### Phase 2: Authenticate (if needed)
+
+**If the user specified auth credentials:**
+
+```bash
+$B goto <login-url>
+$B snapshot -i                    # find the login form
+$B fill @e3 "user@example.com"
+$B fill @e4 "[REDACTED]"         # NEVER include real passwords in report
+$B click @e5                      # submit
+$B snapshot -D                    # verify login succeeded
+```
+
+**If the user provided a cookie file:**
+
+```bash
+$B cookie-import cookies.json
+$B goto <target-url>
+```
+
+**If 2FA/OTP is required:** Ask the user for the code and wait.
+
+**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue."
+
+### Phase 3: Orient
+
+Get a map of the application:
+
+```bash
+$B goto <target-url>
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png"
+$B links                          # map navigation structure
+$B console --errors               # any errors on landing?
+```
+
+**Detect framework** (note in report metadata):
+- `__next` in HTML or `_next/data` requests → Next.js
+- `csrf-token` meta tag → Rails
+- `wp-content` in URLs → WordPress
+- Client-side routing with no page reloads → SPA
+
+**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead.
+
+### Phase 4: Explore
+
+Visit pages systematically. At each page:
+
+```bash
+$B goto <page-url>
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png"
+$B console --errors
+```
+
+Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`):
+
+1. **Visual scan** — Look at the annotated screenshot for layout issues
+2. **Interactive elements** — Click buttons, links, controls. Do they work?
+3. **Forms** — Fill and submit. Test empty, invalid, edge cases
+4. **Navigation** — Check all paths in and out
+5. **States** — Empty state, loading, error, overflow
+6. **Console** — Any new JS errors after interactions?
+7. **Responsiveness** — Check mobile viewport if relevant:
+   ```bash
+   $B viewport 375x812
+   $B screenshot "$REPORT_DIR/screenshots/page-mobile.png"
+   $B viewport 1280x720
+   ```
+
+**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy).
+
+**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible?
+
+### Phase 5: Document
+
+Document each issue **immediately when found** — don't batch them.
+
+**Two evidence tiers:**
+
+**Interactive bugs** (broken flows, dead buttons, form failures):
+1. Take a screenshot before the action
+2. Perform the action
+3. Take a screenshot showing the result
+4. Use `snapshot -D` to show what changed
+5. Write repro steps referencing screenshots
+
+```bash
+$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png"
+$B click @e5
+$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png"
+$B snapshot -D
+```
+
+**Static bugs** (typos, layout issues, missing images):
+1. Take a single annotated screenshot showing the problem
+2. Describe what's wrong
+
+```bash
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png"
+```
+
+**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`.
+
+### Phase 6: Wrap Up
+
+1. **Compute health score** using the rubric below
+2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues
+3. **Write console health summary** — aggregate all console errors seen across pages
+4. **Update severity counts** in the summary table
+5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework
+6. **Save baseline** — write `baseline.json` with:
+   ```json
+   {
+     "date": "YYYY-MM-DD",
+     "url": "<target>",
+     "healthScore": N,
+     "issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }],
+     "categoryScores": { "console": N, "links": N, ... }
+   }
+   ```
+
+**Regression mode:** After writing the report, load the baseline file. Compare:
+- Health score delta
+- Issues fixed (in baseline but not current)
+- New issues (in current but not baseline)
+- Append the regression section to the report
+
+---
+
+## Health Score Rubric
+
+Compute each category score (0-100), then take the weighted average.
+
+### Console (weight: 15%)
+- 0 errors → 100
+- 1-3 errors → 70
+- 4-10 errors → 40
+- 10+ errors → 10
+
+### Links (weight: 10%)
+- 0 broken → 100
+- Each broken link → -15 (minimum 0)
+
+### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility)
+Each category starts at 100. Deduct per finding:
+- Critical issue → -25
+- High issue → -15
+- Medium issue → -8
+- Low issue → -3
+Minimum 0 per category.
+
+### Weights
+| Category | Weight |
+|----------|--------|
+| Console | 15% |
+| Links | 10% |
+| Visual | 10% |
+| Functional | 20% |
+| UX | 15% |
+| Performance | 10% |
+| Content | 5% |
+| Accessibility | 15% |
+
+### Final Score
+`score = Σ (category_score × weight)`
+
+---
+
+## Framework-Specific Guidance
+
+### Next.js
+- Check console for hydration errors (`Hydration failed`, `Text content did not match`)
+- Monitor `_next/data` requests in network — 404s indicate broken data fetching
+- Test client-side navigation (click links, don't just `goto`) — catches routing issues
+- Check for CLS (Cumulative Layout Shift) on pages with dynamic content
+
+### Rails
+- Check for N+1 query warnings in console (if development mode)
+- Verify CSRF token presence in forms
+- Test Turbo/Stimulus integration — do page transitions work smoothly?
+- Check for flash messages appearing and dismissing correctly
+
+### WordPress
+- Check for plugin conflicts (JS errors from different plugins)
+- Verify admin bar visibility for logged-in users
+- Test REST API endpoints (`/wp-json/`)
+- Check for mixed content warnings (common with WP)
+
+### General SPA (React, Vue, Angular)
+- Use `snapshot -i` for navigation — `links` command misses client-side routes
+- Check for stale state (navigate away and back — does data refresh?)
+- Test browser back/forward — does the app handle history correctly?
+- Check for memory leaks (monitor console after extended use)
+
+---
+
+## Important Rules
+
+1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions.
+2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke.
+3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps.
+4. **Write incrementally.** Append each issue to the report as you find it. Don't batch.
+5. **Never read source code.** Test as a user, not a developer.
+6. **Check console after every interaction.** JS errors that don't surface visually are still bugs.
+7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end.
+8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions.
+9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
+10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
+
+---
+
+## Output Structure
+
+```
+.gstack/qa-reports/
+├── qa-report-{domain}-{YYYY-MM-DD}.md    # Structured report
+├── screenshots/
+│   ├── initial.png                        # Landing page annotated screenshot
+│   ├── issue-001-step-1.png               # Per-issue evidence
+│   ├── issue-001-result.png
+│   └── ...
+└── baseline.json                          # For regression mode
+```
+
+Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`

M retro/SKILL.md => retro/SKILL.md +10 -0

@@ 10,8 10,18 @@ allowed-tools:
   - Read
   - Write
   - Glob
+  - AskUserQuestion
 ---
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD"
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # /retro — Weekly Engineering Retrospective
 
 Generates a comprehensive engineering retrospective analyzing commit history, work patterns, and code quality metrics. Team-aware: identifies the user running the command, then analyzes every contributor with per-person praise and growth opportunities. Designed for a senior IC/CTO-level builder using Claude Code as a force multiplier.

M review/SKILL.md => review/SKILL.md +9 -0

@@ 14,6 14,15 @@ allowed-tools:
   - AskUserQuestion
 ---
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD"
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # Pre-Landing PR Review
 
 You are running the `/review` workflow. Analyze the current branch's diff against main for structural issues that tests don't catch.

M scripts/gen-skill-docs.ts => scripts/gen-skill-docs.ts +36 -0

@@ 94,9 94,43 @@ function generateSnapshotFlags(): string {
   return lines.join('\n');
 }
 
+function generateUpdateCheck(): string {
+  return `## Update Check (run first)
+
+\`\`\`bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD" || true
+\`\`\`
+
+If output shows \`UPGRADE_AVAILABLE <old> <new>\`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, \`touch ~/.gstack/last-update-check\` if no). If \`JUST_UPGRADED <from> <to>\`: tell user "Running gstack v{to} (just updated!)" and continue.`;
+}
+
+function generateBrowseSetup(): string {
+  return `## SETUP (run this check BEFORE any browse command)
+
+\`\`\`bash
+_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+B=""
+[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
+[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
+if [ -x "$B" ]; then
+  echo "READY: $B"
+else
+  echo "NEEDS_SETUP"
+fi
+\`\`\`
+
+If \`NEEDS_SETUP\`:
+1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
+2. Run: \`cd <SKILL_DIR> && ./setup\`
+3. If \`bun\` is not installed: \`curl -fsSL https://bun.sh/install | bash\``;
+}
+
 const RESOLVERS: Record<string, () => string> = {
   COMMAND_REFERENCE: generateCommandReference,
   SNAPSHOT_FLAGS: generateSnapshotFlags,
+  UPDATE_CHECK: generateUpdateCheck,
+  BROWSE_SETUP: generateBrowseSetup,
 };
 
 // ─── Template Processing ────────────────────────────────────


@@ 141,6 175,8 @@ function findTemplates(): string[] {
   const candidates = [
     path.join(ROOT, 'SKILL.md.tmpl'),
     path.join(ROOT, 'browse', 'SKILL.md.tmpl'),
+    path.join(ROOT, 'qa', 'SKILL.md.tmpl'),
+    path.join(ROOT, 'setup-browser-cookies', 'SKILL.md.tmpl'),
   ];
   for (const p of candidates) {
     if (fs.existsSync(p)) templates.push(p);

M setup => setup +7 -0

@@ 88,3 88,10 @@ else
   echo "  browse: $BROWSE_BIN"
   echo "  (skipped skill symlinks — not inside .claude/skills/)"
 fi
+
+# 4. First-time welcome + legacy cleanup
+if [ ! -d "$HOME/.gstack" ]; then
+  mkdir -p "$HOME/.gstack"
+  echo "  Welcome! Run /gstack-upgrade anytime to stay current."
+fi
+rm -f /tmp/gstack-latest-version

M setup-browser-cookies/SKILL.md => setup-browser-cookies/SKILL.md +19 -7

@@ 8,7 8,19 @@ description: |
 allowed-tools:
   - Bash
   - Read
+  - AskUserQuestion
 ---
+<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
+<!-- Regenerate: bun run gen:skill-docs -->
+
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD" || true
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
 
 # Setup Browser Cookies
 


@@ 25,13 37,15 @@ Import logged-in sessions from your real Chromium browser into the headless brow
 
 ### 1. Find the browse binary
 
+## SETUP (run this check BEFORE any browse command)
+
 ```bash
-BROWSE_OUTPUT=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
-B=$(echo "$BROWSE_OUTPUT" | head -1)
-META=$(echo "$BROWSE_OUTPUT" | grep "^META:" || true)
-if [ -n "$B" ]; then
+_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+B=""
+[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
+[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
+if [ -x "$B" ]; then
   echo "READY: $B"
-  [ -n "$META" ] && echo "$META"
 else
   echo "NEEDS_SETUP"
 fi


@@ 42,8 56,6 @@ If `NEEDS_SETUP`:
 2. Run: `cd <SKILL_DIR> && ./setup`
 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
 
-If you see `META:UPDATE_AVAILABLE`: tell the user an update is available, STOP and wait for approval, then run the command from the META payload and re-run the setup check.
-
 ### 2. Open the cookie picker
 
 ```bash

A setup-browser-cookies/SKILL.md.tmpl => setup-browser-cookies/SKILL.md.tmpl +73 -0

@@ 0,0 1,73 @@
+---
+name: setup-browser-cookies
+version: 1.0.0
+description: |
+  Import cookies from your real browser (Comet, Chrome, Arc, Brave, Edge) into the
+  headless browse session. Opens an interactive picker UI where you select which
+  cookie domains to import. Use before QA testing authenticated pages.
+allowed-tools:
+  - Bash
+  - Read
+  - AskUserQuestion
+---
+
+{{UPDATE_CHECK}}
+
+# Setup Browser Cookies
+
+Import logged-in sessions from your real Chromium browser into the headless browse session.
+
+## How it works
+
+1. Find the browse binary
+2. Run `cookie-import-browser` to detect installed browsers and open the picker UI
+3. User selects which cookie domains to import in their browser
+4. Cookies are decrypted and loaded into the Playwright session
+
+## Steps
+
+### 1. Find the browse binary
+
+{{BROWSE_SETUP}}
+
+### 2. Open the cookie picker
+
+```bash
+$B cookie-import-browser
+```
+
+This auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge) and opens
+an interactive picker UI in your default browser where you can:
+- Switch between installed browsers
+- Search domains
+- Click "+" to import a domain's cookies
+- Click trash to remove imported cookies
+
+Tell the user: **"Cookie picker opened — select the domains you want to import in your browser, then tell me when you're done."**
+
+### 3. Direct import (alternative)
+
+If the user specifies a domain directly (e.g., `/setup-browser-cookies github.com`), skip the UI:
+
+```bash
+$B cookie-import-browser comet --domain github.com
+```
+
+Replace `comet` with the appropriate browser if specified.
+
+### 4. Verify
+
+After the user confirms they're done:
+
+```bash
+$B cookies
+```
+
+Show the user a summary of imported cookies (domain counts).
+
+## Notes
+
+- First import per browser may trigger a macOS Keychain dialog — click "Allow" / "Always Allow"
+- Cookie picker is served on the same port as the browse server (no extra process)
+- Only domain names and cookie counts are shown in the UI — no cookie values are exposed
+- The browse session persists cookies between commands, so imported cookies work immediately

M ship/SKILL.md => ship/SKILL.md +9 -0

@@ 13,6 13,15 @@ allowed-tools:
   - AskUserQuestion
 ---
 
+## Update Check (run first)
+
+```bash
+_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
+[ -n "$_UPD" ] && echo "$_UPD"
+```
+
+If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (AskUserQuestion → upgrade if yes, `touch ~/.gstack/last-update-check` if no). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
+
 # Ship: Fully Automated Ship Workflow
 
 You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.

M test/helpers/session-runner.ts => test/helpers/session-runner.ts +1 -0

@@ 33,6 33,7 @@ const BROWSE_ERROR_PATTERNS = [
   /Exit code 1/,
   /ERROR: browse binary not found/,
   /Server failed to start/,
+  /no such file or directory.*browse/i,
 ];
 
 export async function runSkillTest(options: {

M test/skill-e2e.test.ts => test/skill-e2e.test.ts +84 -1

@@ 132,6 132,89 @@ Report what each command returned.`,
     expect(result.browseErrors).toHaveLength(0);
     expect(result.exitReason).toBe('success');
   }, 90_000);
+
+
+  test('agent discovers browse binary via SKILL.md setup block', async () => {
+    const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
+    const setupStart = skillMd.indexOf('## SETUP');
+    const setupEnd = skillMd.indexOf('## IMPORTANT');
+    const setupBlock = skillMd.slice(setupStart, setupEnd);
+
+    // Guard: verify we extracted a valid setup block
+    expect(setupBlock).toContain('browse/dist/browse');
+
+    const result = await runSkillTest({
+      prompt: `Follow these instructions to find the browse binary and run a basic command.
+
+${setupBlock}
+
+After finding the binary, run: $B goto ${testServer.url}
+Then run: $B text
+Report whether it worked.`,
+      workingDirectory: tmpDir,
+      maxTurns: 10,
+      timeout: 60_000,
+    });
+
+    expect(result.browseErrors).toHaveLength(0);
+    expect(result.exitReason).toBe('success');
+  }, 90_000);
+
+  test('SKILL.md setup block shows NEEDS_SETUP when binary missing', async () => {
+    // Create a tmpdir with no browse binary
+    const emptyDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-empty-'));
+
+    const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
+    const setupStart = skillMd.indexOf('## SETUP');
+    const setupEnd = skillMd.indexOf('## IMPORTANT');
+    const setupBlock = skillMd.slice(setupStart, setupEnd);
+
+    const result = await runSkillTest({
+      prompt: `Follow these instructions exactly. Run the bash code block below and report what it outputs.
+
+${setupBlock}
+
+Report the exact output. Do NOT try to fix or install anything — just report what you see.`,
+      workingDirectory: emptyDir,
+      maxTurns: 5,
+      timeout: 30_000,
+    });
+
+    // Agent should see NEEDS_SETUP (not crash or guess wrong paths)
+    const allText = result.output || '';
+    expect(allText).toContain('NEEDS_SETUP');
+
+    // Clean up
+    try { fs.rmSync(emptyDir, { recursive: true, force: true }); } catch {}
+  }, 60_000);
+
+  test('SKILL.md setup block works outside git repo', async () => {
+    // Create a tmpdir outside any git repo
+    const nonGitDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-nogit-'));
+
+    const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
+    const setupStart = skillMd.indexOf('## SETUP');
+    const setupEnd = skillMd.indexOf('## IMPORTANT');
+    const setupBlock = skillMd.slice(setupStart, setupEnd);
+
+    const result = await runSkillTest({
+      prompt: `Follow these instructions exactly. Run the bash code block below and report what it outputs.
+
+${setupBlock}
+
+Report the exact output — either "READY: <path>" or "NEEDS_SETUP".`,
+      workingDirectory: nonGitDir,
+      maxTurns: 5,
+      timeout: 30_000,
+    });
+
+    // Should either find global binary (READY) or show NEEDS_SETUP — not crash
+    const allText = result.output || '';
+    expect(allText).toMatch(/READY|NEEDS_SETUP/);
+
+    // Clean up
+    try { fs.rmSync(nonGitDir, { recursive: true, force: true }); } catch {}
+  }, 60_000);
 });
 
 // --- B4: QA skill E2E ---


@@ 264,7 347,7 @@ describeOutcome('Planted-bug outcome evals', () => {
     fs.mkdirSync(path.join(reportDir, 'screenshots'), { recursive: true });
     const reportPath = path.join(reportDir, 'qa-report.md');
 
-    // Phase 1: Agent SDK runs /qa Standard
+    // Phase 1: runs /qa Standard
     const result = await runSkillTest({
       prompt: `You have a browse binary at ${browseBin}. Assign it to B variable like: B="${browseBin}"

M test/skill-llm-eval.test.ts => test/skill-llm-eval.test.ts +13 -0

@@ 65,6 65,19 @@ describeEval('LLM-as-judge quality evals', () => {
     expect(scores.actionability).toBeGreaterThanOrEqual(4);
   }, 30_000);
 
+  test('setup block scores >= 4 on actionability and clarity', async () => {
+    const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
+    const setupStart = content.indexOf('## SETUP');
+    const setupEnd = content.indexOf('## IMPORTANT');
+    const section = content.slice(setupStart, setupEnd);
+
+    const scores = await judge('setup/binary discovery instructions', section);
+    console.log('Setup block scores:', JSON.stringify(scores, null, 2));
+
+    expect(scores.actionability).toBeGreaterThanOrEqual(4);
+    expect(scores.clarity).toBeGreaterThanOrEqual(4);
+  }, 30_000);
+
   test('regression check: compare branch vs baseline quality', async () => {
     // This test compares the generated output against the hand-maintained
     // baseline from main. The generated version should score equal or higher.