# gstack development ## Commands ```bash bun install # install dependencies bun test # run free tests (browse + snapshot + skill validation) bun run test:evals # run paid evals: LLM judge + E2E (~$4/run) bun run test:e2e # run E2E tests only (~$3.85/run) bun run dev # run CLI in dev mode, e.g. bun run dev goto https://example.com bun run build # gen docs + compile binaries bun run gen:skill-docs # regenerate SKILL.md files from templates bun run skill:check # health dashboard for all skills bun run dev:skill # watch mode: auto-regen + validate on change bun run eval:list # list all eval runs from ~/.gstack-dev/evals/ bun run eval:compare # compare two eval runs (auto-picks most recent) bun run eval:summary # aggregate stats across all eval runs ``` `test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json --verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison against the previous run. ## Project structure ``` gstack/ ├── browse/ # Headless browser CLI (Playwright) │ ├── src/ # CLI + server + commands │ │ ├── commands.ts # Command registry (single source of truth) │ │ └── snapshot.ts # SNAPSHOT_FLAGS metadata array │ ├── test/ # Integration tests + fixtures │ └── dist/ # Compiled binary ├── scripts/ # Build + DX tooling │ ├── gen-skill-docs.ts # Template → SKILL.md generator │ ├── skill-check.ts # Health dashboard │ └── dev-skill.ts # Watch mode ├── test/ # Skill validation + eval tests │ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts │ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines │ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s) │ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s) │ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run) │ └── skill-e2e.test.ts # Tier 2: E2E via claude -p (~$3.85/run) ├── qa-only/ # /qa-only skill (report-only QA, no fixes) ├── ship/ # Ship workflow skill ├── review/ # PR review skill ├── plan-ceo-review/ # /plan-ceo-review skill ├── plan-eng-review/ # /plan-eng-review skill ├── retro/ # Retrospective skill ├── document-release/ # /document-release skill (post-ship doc updates) ├── setup # One-time setup: build binary + symlink skills ├── SKILL.md # Generated from SKILL.md.tmpl (don't edit directly) ├── SKILL.md.tmpl # Template: edit this, run gen:skill-docs └── package.json # Build scripts for browse ``` ## SKILL.md workflow SKILL.md files are **generated** from `.tmpl` templates. To update docs: 1. Edit the `.tmpl` file (e.g. `SKILL.md.tmpl` or `browse/SKILL.md.tmpl`) 2. Run `bun run gen:skill-docs` (or `bun run build` which does it automatically) 3. Commit both the `.tmpl` and generated `.md` files To add a new browse command: add it to `browse/src/commands.ts` and rebuild. To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild. ## Writing SKILL templates SKILL.md.tmpl files are **prompt templates read by Claude**, not bash scripts. Each bash code block runs in a separate shell — variables do not persist between blocks. Rules: - **Use natural language for logic and state.** Don't use shell variables to pass state between code blocks. Instead, tell Claude what to remember and reference it in prose (e.g., "the base branch detected in Step 0"). - **Don't hardcode branch names.** Detect `main`/`master`/etc dynamically via `gh pr view` or `gh repo view`. Use `{{BASE_BRANCH_DETECT}}` for PR-targeting skills. Use "the base branch" in prose, `` in code block placeholders. - **Keep bash blocks self-contained.** Each code block should work independently. If a block needs context from a previous step, restate it in the prose above. - **Express conditionals as English.** Instead of nested `if/elif/else` in bash, write numbered decision steps: "1. If X, do Y. 2. Otherwise, do Z." ## Browser interaction When you need to interact with a browser (QA, dogfooding, cookie setup), use the `/browse` skill or run the browse binary directly via `$B `. NEVER use `mcp__claude-in-chrome__*` tools — they are slow, unreliable, and not what this project uses. ## Vendored symlink awareness When developing gstack, `.claude/skills/gstack` may be a symlink back to this working directory (gitignored). This means skill changes are **live immediately** — great for rapid iteration, risky during big refactors where half-written skills could break other Claude Code sessions using gstack concurrently. **Check once per session:** Run `ls -la .claude/skills/gstack` to see if it's a symlink or a real copy. If it's a symlink to your working directory, be aware that: - Template changes + `bun run gen:skill-docs` immediately affect all gstack invocations - Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions - During large refactors, remove the symlink (`rm .claude/skills/gstack`) so the global install at `~/.claude/skills/gstack/` is used instead **For plan reviews:** When reviewing plans that modify skill templates or the gen-skill-docs pipeline, consider whether the changes should be tested in isolation before going live (especially if the user is actively using gstack in other windows). ## CHANGELOG style CHANGELOG.md is **for users**, not contributors. Write it like product release notes: - Lead with what the user can now **do** that they couldn't before. Sell the feature. - Use plain language, not implementation details. "You can now..." not "Refactored the..." - Put contributor/internal changes in a separate "For contributors" section at the bottom. - Every entry should make someone think "oh nice, I want to try that." - No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver." ## E2E eval failure blame protocol When an E2E eval fails during `/ship` or any other workflow, **never claim "not related to our changes" without proving it.** These systems have invisible couplings — a preamble text change affects agent behavior, a new helper changes timing, a regenerated SKILL.md shifts prompt context. **Required before attributing a failure to "pre-existing":** 1. Run the same eval on main (or base branch) and show it fails there too 2. If it passes on main but fails on the branch — it IS your change. Trace the blame. 3. If you can't run on main, say "unverified — may or may not be related" and flag it as a risk in the PR body "Pre-existing" without receipts is a lazy claim. Prove it or don't say it. ## Deploying to the active skill The active skill lives at `~/.claude/skills/gstack/`. After making changes: 1. Push your branch 2. Fetch and reset in the skill directory: `cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main` 3. Rebuild: `cd ~/.claude/skills/gstack && bun run build` Or copy the binary directly: `cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse`